How Compilers and Interpreters are Made

How Compilers and Interpreters are Made

This blog post is a brief overview of how compilers are made and how they work. It is a high-level overview and doesn't go into the nitty-gritty details of compiler design. If you want to skip the technical jargon and just want to know about how Cyclone was created, you can find it in this blog post How Cyclone was concieved, and Why?

So without further ado, let's dive into the world of compilers.

What is a Compiler?

A compiler is a program that translates code written in a high-level programming language into machine code that a computer can understand. The process of translation is called compilation, and the resulting machine code is called an executable.

Image

General Compilation Process

A compiler typically consists of several stages, each of which performs a specific task in the translation process. These stages are:

  1. Lexical Analysis: This stage breaks the input code into tokens, which are the smallest units of meaning in a programming language. It also removes any whitespace and comments from the code.

  2. Parsing (Syntax Analysis): This stage takes the tokens produced by the lexical analysis stage and arranges them into a hierarchical structure called a parse tree. The parse tree represents the syntactic structure of the code.

  3. Semantic Analysis: This stage checks the code for semantic errors, such as type mismatches or undeclared variables. It also performs type inference and type checking.

  4. Intermediate Code Generation: This stage translates the parse tree into an intermediate representation that is easier to work with than the original code. This intermediate representation is often in the form of an abstract syntax tree (AST) or three-address code.

  5. Code Optimization: This stage improves the efficiency of the code by applying various optimization techniques, such as constant folding, loop unrolling, and dead code elimination.

  6. Code Generation: This stage translates the intermediate representation into machine code that can be executed by the target machine. This machine code is specific to the target architecture and may include instructions for the CPU, memory management, and I/O operations.

Error Handling in Compilers

Before we dive deeper in each step of the compilation process, let's first understand how errors are handled in compilers.

In Cyclone, each component of compiler maintains its own list of diagnostics and after completion of each stage, compiler checks if any diagnostics are present in the list. If any diagnostics are present, compiler reports them to the user and stops the compilation process.

Example code snippet:

class Lexer
{
public:
    Lexer(SyntaxTree *syntaxTree) : _syntaxTree(syntaxTree), input(syntaxTree->Text), pos(0), currentChar(input[0]), lookAhead(input[1]) {};
    std::vector<Token> tokenize();
    const DiagnosticBag &GetDiagnostics() const
    {
        return _diagnostics;
    }

private:
    DiagnosticBag _diagnostics;
    SourceText input;
    SyntaxTree *_syntaxTree;
    size_t pos;
    char currentChar;
    char lookAhead;
    void advance();
    Token GenerateStringToken();
    Token GenerateWhitespaceToken();
    Token GenerateNumberToken();
    Token GenerateIdentifierToken();
    Token GenerateSingleLineComment();
    Token GenerateMultiLineComment();
    SyntaxKind checkKeyword(const std::string &keyword);
};

It should be noted compiler doesn't necessarily stop the compilation process after encountering an error. In case of trivial errors, compiler fabricates the missing information and continues the compilation process. This is done to ensure that the user gets as much information as possible about the errors in the code.

Lexical Analysis

Lexical analysis is the first stage of the compilation process and probably the simplest. We just take input code, match it against our set of language grammer and syntax rules, and generate tokens.

Example code snippet:

std::vector<Token> Lexer::tokenize()
{
    std::vector<Token> tokens;
    while (currentChar != '\0')
    {
        switch (currentChar)
        {
        case '+':
            tokens.push_back(Token{_syntaxTree, SyntaxKind::PLUS, "+", pos});
            advance();
            break;

        case '-':
            tokens.push_back(Token{_syntaxTree, SyntaxKind::MINUS, "-", pos});
            advance();
            break;

        case '*':
            tokens.push_back(Token{_syntaxTree, SyntaxKind::MULTIPLY, "*", pos});
            advance();
            break;
        ...
        }
      ...
    }

Parsing (Syntax Analysis)

Parsing is the second stage of the compilation process. In this stage, we take the tokens produced by the lexical analysis stage and arrange them into a hierarchical structure called a parse tree.

This can be done using a parser generator tool like Yacc or Bison, or by writing a parser by hand. Types of Parsers include:

  • Top-Down Parsers: Recursive Descent, LL(k), LL(*)
  • Bottom-Up Parsers: LR(0), SLR(1), LALR(1), LR(1)
In Cyclone we are using a hand-written Recursive Descent Parser.

Example code snippet:


SyntaxTree *Parser::parse()
{
    auto expression = parseExpression();
    auto endOfFile = _lexer->Match(SyntaxKind::EndOfFileToken);
    return new SyntaxTree(expression, endOfFile);
}

ExpressionSyntax *Parser::parseExpression()
{
    return parseAddition();
}

ExpressionSyntax *Parser::parseAddition()
{
    auto left = parseMultiplication();
    while (_lexer->Current().Kind == SyntaxKind::PLUS || _lexer->Current().Kind == SyntaxKind::MINUS)
    {
        auto op = _lexer->Next();
        auto right = parseMultiplication();
        left = new BinaryExpressionSyntax(left, op, right);
    }
    return left;
}

Semantic Analysis

Semantic analysis is the third stage of the compilation process. In this stage, we check the code for semantic errors, such as type mismatches or undeclared variables. We also perform type inference and type checking. This process is also called Binding.

In Cyclone, we Bind the types by recursively traversing the parse tree and assigning types to each node based on the context.

Example code snippet:

class Binder
{
public:
    Binder(SyntaxTree *syntaxTree) : _syntaxTree(syntaxTree) {};
    BoundExpression *BindExpression(ExpressionSyntax *syntax);
    const DiagnosticBag &GetDiagnostics() const
    {
        return _diagnostics;
    }

private:
    DiagnosticBag _diagnostics;
    SyntaxTree *_syntaxTree;
    BoundExpression *BindLiteralExpression(LiteralExpressionSyntax *syntax);
    BoundExpression *BindBinaryExpression(BinaryExpressionSyntax *syntax);
    BoundExpression *BindParenthesizedExpression(ParenthesizedExpressionSyntax *syntax);
    BoundExpression *BindUnaryExpression(UnaryExpressionSyntax *syntax);
    BoundExpression *BindNameExpression(NameExpressionSyntax *syntax);
};


BoundExpression *Binder::BindBinaryExpression(BinaryExpressionNode *node)
{
    BoundExpression *boundLeft = BindExpression(node->left);
    BoundExpression *boundRight = BindExpression(node->right);

    if (boundLeft->type == TypeSymbol::Error || boundRight->type == TypeSymbol::Error)
    {
        return new BoundErrorExpression();
    }
    BoundBinaryOperator *boundOperator = BoundBinaryOperator::Bind(node->OperatorToken.Kind, boundLeft->type, boundRight->type);

    if (boundOperator == nullptr)
    {

        _diagnostics.ReportUndefinedBinaryOperator(node->OperatorToken.Location, node->OperatorToken.value, boundLeft->type.ToString(), boundRight->type.ToString());
        return new BoundErrorExpression();
    }

    return new BoundBinaryExpression(boundLeft, boundOperator, boundRight);
}

Intermediate Code Generation

Intermediate code generation is the fourth stage of the compilation process. In this stage, we translate the parse tree into an intermediate representation that is easier to work with than the original code. This intermediate representation is often in the form of an abstract syntax tree (AST) or three-address code.

In Cyclone, we generate an AST that represents the structure of the code. This AST is then used in the code optimization and code generation stages.

Lowering / Optimization

Lowering is the process of converting high-level constructs into lower-level constructs. This is done to simplify the code and make it easier to optimize. In Cyclone, we lower the code by converting it into a lower-level intermediate representation called Lowered Tree.

Example of Lowering is conversion of for loop into while loop. And while loop into if and goto statements:

BoundStatement *Lowerer::RewriteWhileStatement(BoundWhileStatement *node)
{
    BoundLabel *bodyLabel = GenerateLabel();

    BoundGotoStatement *gotoContinue = new BoundGotoStatement(*node->ContinueLabel);
    BoundLabelStatement *bodyLabelStatement = new BoundLabelStatement(*bodyLabel);

    BoundLabelStatement *continueLabelStatement = new BoundLabelStatement(*node->ContinueLabel);

    BoundConditionalGotoStatement *gotoTrue = new BoundConditionalGotoStatement(*bodyLabel, node->Condition);
    BoundLabelStatement *breakLabelStatement = new BoundLabelStatement(*node->BreakLabel);

    BoundBlockStatement *result = new BoundBlockStatement({gotoContinue,
                                                           bodyLabelStatement,
                                                           node->Body,
                                                           continueLabelStatement,
                                                           gotoTrue,
                                                           breakLabelStatement});
    return RewriteStatement(result);
}

BoundStatement *Lowerer::RewriteForStatement(BoundForStatement *node)
{
    BoundVariableDeclaration *variableDeclaration = new BoundVariableDeclaration(node->Variable, node->LowerBound);
    BoundVariableExpression *variableExpression = new BoundVariableExpression(node->Variable);

    LocalVariableSymbol *upperBoundSymbol = new LocalVariableSymbol("upperBound", true, TypeSymbol::Integer);
    BoundVariableDeclaration *upperBoundDeclaration = new BoundVariableDeclaration(*upperBoundSymbol, node->UpperBound);

    BoundBinaryOperator *lessOrEquals = BoundBinaryOperator::Bind(SyntaxKind::LESS_EQUALS, TypeSymbol::Integer, TypeSymbol::Integer);
    BoundBinaryExpression *condition = new BoundBinaryExpression(variableExpression, lessOrEquals, new BoundVariableExpression(*upperBoundSymbol));

    BoundLabelStatement *continueLabelStatement = new BoundLabelStatement(*node->ContinueLabel);
    BoundExpressionStatement *increment = new BoundExpressionStatement(new BoundAssignmentExpression(node->Variable, new BoundBinaryExpression(variableExpression, BoundBinaryOperator::Bind(SyntaxKind::PLUS, TypeSymbol::Integer, TypeSymbol::Integer), new BoundLiteralExpression("1", TypeSymbol::Integer))));

    BoundBlockStatement *whileBody = new BoundBlockStatement({node->Body, continueLabelStatement, increment});
    BoundWhileStatement *whileStatement = new BoundWhileStatement(condition, whileBody, node->BreakLabel, GenerateLabel());
    BoundBlockStatement *result = new BoundBlockStatement({variableDeclaration,
                                                           upperBoundDeclaration,
                                                           whileStatement});

    return RewriteStatement(result);
}

Code Generation

Code generation is the final stage of the compilation process. In this stage, we translate the intermediate representation into machine code that can be executed by the target machine. This machine code is specific to the target architecture and may include instructions for the CPU, memory management, and I/O operations.

In Cyclone, we took the easy route and generated C++ code from the intermediate representation. This C++ code is then compiled using a C++ compiler to generate the final executable.

Example code snippet:

class Emitter
{
public:
    Emitter(std::string filename, BoundProgram *program, std::unordered_map<VariableSymbol, std::any> &variables) : _program(program), _globals(variables), Filename(filename)
    {
        _locals.push(std::unordered_map<VariableSymbol, std::any>());
    }
    void Emit();

private:
    BoundProgram *_program;
    std::string Filename;
    std::ofstream codeStream;
    std::unordered_map<VariableSymbol, std::any> &_globals;
    std::stack<std::unordered_map<VariableSymbol, std::any>> _locals;

    void EmitIncludes();
    void EmitFunctions();
    void EmitStatement(BoundBlockStatement *node);
    void EmitVariableDeclaration(BoundVariableDeclaration *node);
    void EmitExpressionStatement(BoundExpressionStatement *node);
    void EmitGotoStatement(BoundGotoStatement *node);
    void EmitConditionalGotoStatement(BoundConditionalGotoStatement *node);
    void EmitLabelStatement(BoundLabelStatement *node);
    void EmitReturnStatement(BoundReturnStatement *node);

    void EmitExpression(BoundExpression *node);
    void EmitLiteralExpression(BoundLiteralExpression *node);
    void EmitConversionExpression(BoundConversionExpression *node);
    void EmitCallExpression(BoundCallExpression *node);
    void EmitArrayInitializerExpression(BoundArrayInitializerExpression *node);
    void EmitArrayAccessExpression(BoundArrayAccessExpression *node);

    void EmitType(TypeSymbol type);

    std::string EscapeString(const std::string &str);
};

Bonus: Interpreters

Interpreters are programs that execute code directly without the need for compilation. They read the code line by line and execute it on the fly. Interpreters are slower than compilers because they don't generate machine code, but they are easier to use and debug.

In Cyclone, we have an interpreter that can execute the intermediate representation directly. This is useful for testing and debugging the code without having to compile it.

Example code snippet:



class Evaluator
{
public:
    Evaluator(BoundProgram *program, std::unordered_map<VariableSymbol, std::any> &variables) : _program(program), _globals(variables)
    {
        _locals.push(std::unordered_map<VariableSymbol, std::any>());
    }
    std::any Evaluate()
    {
        return EvaluateStatement(_program->statement);
    };

    std::any EvaluateExpression(BoundExpression *node);

private:
    BoundProgram *_program;
    std::unordered_map<VariableSymbol, std::any> &_globals;
    std::stack<std::unordered_map<VariableSymbol, std::any>> _locals;

    void EvaluateExpressionStatement(BoundExpressionStatement *node);
    void EvaluateVariableDeclaration(BoundVariableDeclaration *node);
    void Assign(VariableSymbol variable, std::any value);
    // void AssignArray(VariableSymbol variable, std::any value, int index);
    std::any EvaluateStatement(BoundBlockStatement *node);

    std::any EvaluateLiteralExpression(BoundLiteralExpression *node);
    std::any EvaluateVariableExpression(BoundVariableExpression *node);
    std::any EvaluateAssignmentExpression(BoundAssignmentExpression *node);
    std::any EvaluateUnaryExpression(BoundUnaryExpression *node);
    std::any EvaluateBinaryExpression(BoundBinaryExpression *node);
    std::any EvaluateCallExpression(BoundCallExpression *node);
    std::any EvaluateConversionExpression(BoundConversionExpression *node);
    std::any EvaluateArrayInitializerExpression(BoundArrayInitializerExpression *node);
    std::any EvaluateArrayAccessExpression(BoundArrayAccessExpression *node);
    std::any EvaluateArrayAssignmentExpression(BoundArrayAssignmentExpression *node);
    std::any _lastValue = "";
};

Conclusion

Compilers are complex programs that perform a series of transformations on code to translate it from a high-level programming language to machine code. Each stage of the compilation process has its own set of challenges and requirements, and the overall process requires careful planning and design.

In Cyclone, we have implemented a simple compiler that can transpiles into C++ which is then compiled to machine code. While our compiler is far from perfect, it demonstrates the basic principles of compiler design and implementation.

If you have and question, do reach out, I am sure I can teach you something new and in return I know I’ll learn from you. I am always happy to chat😁, so feel free to contact🤗. Looking forward to having an interesting and engaging conversation. 😉

Until then.

Jayvardhan Patil, 😎 DevsTomorrow


    Theme

    Presets

    Background

    Custom:

    Primary

    Custom:

    Secondary

    Custom:

    Border

    Custom:

    Mode

    Light
    Dark