Skip to content

Latest commit

 

History

History

0-omc

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Onramp Compiler -- First Stage

This is the first-stage compiler. It is implemented in compound Onramp Assembly and compiles Onramp Minimal C (omC).

This document describes the implementation. For a specification of the language it compiles, see Onramp Minimal C.

Lexer

The lexer consumes characters from the input file and produces tokens. Each token is one of the following types:

  • "a" -- alphanumeric (a keyword, a type name or a variable name)
  • "n" -- number
  • "c" -- character literal
  • "s" -- string literal (not merged with adjacent string literals)
  • "p" -- punctuation (a valid operator or other punctuation token)
  • "e" -- end-of-file

The token types are represented by letters. This lets us compare token types using mix-type arguments in assembly.

The current token type and value are stored in the global variables lexer_type and lexer_value. The parser loads these directly where needed. It calls the lexer functions lexer_consume(), lexer_accept(), lexer_is(), lexer_take() and lexer_expect() to parse tokens.

The lexer does not use the typical C lexer hack. Instead the lexer just returns tokens of type "alphanumeric". It is up to the parser to determine whether such tokens are keywords, type names or variable names. The lexer is entirely context-free.

The lexer also handles the few preprocessor directives supported (#line and #pragma). The lexer does not handle comments; they must be stripped by an Onramp preprocessor.

Types

Onramp Minimal C only supports the base types void, char and int, along with pointers to them: void*, int**, char*** and so on.

Since a complete type always prefixes a name, types and names are always parsed separately. This makes for much simpler parsing.

We also require that most qualifiers (typedef, extern, static) come first, and they can only be used at file scope. This means we don't have to worry about qualifiers during type parsing either. The only qualifier that can appear anywhere is const and it is ignored.

Types are stored in the low seven bits of a char. This makes it possible to work with types entirely in mix-type instruction arguments in assembly. Types have the following layout:

  • bits 0-4: The pointer indirection count. This is 0 for non-pointers and 1 or more for pointers.
  • bits 5-6: The basic type. 1 is void, 2 is char and 3 is int.
  • bit 7: The l-value flag.

For example 0x10 is void, 0x21 is char*, 0x73 is an l-value of type int***, and so on.

Expression Evaluation

All expression parsing functions emit code on-the-fly. The code they emit places the result of the expression in r0, and they return the type of the value the code places in r0. The stack is used for storage of all intermediate values. We essentially treat the bytecode as a stack machine, ignoring almost all numbered registers.

For example, parse_binary_expression() first parses the left-hand side into r0 and pushes it; then parses the right-hand side into r0; then pops the left-hand side to r1 and calculates the result into r0. Constants and variables are always loaded into r0; mix-type arguments are not used. An expression like (5 * 2) + (9 / 4) therefore emits code like this:

imw r0 5       ; 5
push r0
imw r0 2       ; 2
pop r1
mul r0 r1 r0   ; *
push r0
imw r0 9       ; 9
push r0
imw r0 4       ; 4
pop r1
div r0 r1 r0   ; /
pop r1
add r0 r1 r0   ; +

The resulting code is essentially in reverse Polish notation.

This is 23 bytecode instructions. It is of course extremely inefficient, but it is also extremely simple, and it works. A major advantage of spilling all variables and expressions to the stack is that function calls don't need to do anything special to preserve registers.

When an expression returns an l-value, the emitted code places the address of the value in r0 instead of the value itself. It is effectively an additional indirection, except that this type can also be the target of an assignment expression. The function compile_dereference_if_lvalue() is used to convert an l-value into an r-value; you'll see this called wherever the real value is needed.

Local Variables

Local variable definitions are stored in a stack in locals.os. When a variable is defined, it is pushed onto this stack. When a block closes, all variable definitions in that block are popped.

The values of variables are stored in the current function's stack frame. Since we don't support arrays or structs, all variables take up exactly one word of space. A variable's index (plus one) is its position in words under the frame pointer.

void foo(int a) {       //   ---------------
    char b;             //   | (prev. rfp) |  <- rfp
    {                   //   ---------------
        void* c;        //   |      a      |  <- rfp -4
    }                   //   ---------------
    {                   //   |      b      |  <- rfp -8
        int* d;         //   ---------------
        char** e;       //   |    c/d/f    |  <- rfp -12
    }                   //   ---------------
    int*** f;           //   |      e      |  <- rfp -16
}                       //   ---------------

The definition of local variables and the opening and closing of blocks does not emit any instructions. The compiler instead tracks the maximum number of variables that were defined at any time in the function and reserves space for the full stack frame at the entry point of the function (see Function Generation below.)

Function Generation

The size of a function's stack frame can only be determined after the entire function has been parsed. Since we are a single-pass compiler, we need to emit code before we know the frame size.

We treat our output file as a stream so we can't seek back and patch the frame size in after the fact. We also don't have support in our linker to fill in values for us. We need to output the value after the function is compiled.

To do this, we place the preamble at the end of the function. For example, a function foo() is compiled like this:

@_F_foo
    ; ...
    ; ...
    ; ...
    leave
    ret

=foo
    enter
    imw r9 <framesize>
    sub rsp rsp r9
    jmp ^_F_foo

We start with a static symbol to contain the function's contents. Its name is the function name prefixed with _F_. It assumes the correct stack space has been reserved. We simply emit each statement one by one with labels to jump among if and while blocks.

Once the entire function has been emitted, we now know the required size of the stack frame. We can now define the actual function, which is just a "shunt" or "trampoline". It performs the preamble, reserves the appropriate stack space, and jumps to the _F_ symbol.

The resulting function is thus technically in two separate symbols. This is of course inefficient but it is very easy to implement for bootstrapping.

Strings

When a literal string appears in a function, we assign the string a unique id, copy the string from the lexer and store it in an array. Once the function has been emitted, all strings are output as symbols and freed. For example, a function bar() that uses several strings is compiled like this:

@_F_bar
    ; ...
    ; ...
    imw r0 ^_Sx0
    add r0 rpp r0
    ; ...
    ; ...
    imw r0 ^_Sx1
    add r0 rpp r0
    ; ...
    ; ...
    imw r0 ^_Sx2
    add r0 rpp r0
    ; ...
    ; ...
    leave
    ret

=bar
    enter
    imw r9 <framesize>
    sub rsp rsp r9
    jmp ^_F_bar

@_Sx0
    "Alice"'00

@_Sx1
    "Bob"'00

@_Sx2
    "Carol"'00

We don't support concatenation of adjacent string literals in this stage.