Amila Senadheera

Tech enthusiast

The journey of implementing the WinZigC programming language

Published on February 05, 2024

When I was an undergraduate, I took the 'Compiler Design' module during the last semester of my final year. It was an engaging module that allowed us to see how the knowledge gained from other core Computer Science modules, such as Data Structures and Algorithms, and Theory of Computing, could be applied together. As the semester project for the module, we were given a grammar for a programming language called WinZigC and were tasked with writing a parser to construct the Abstract Syntax Tree (AST) for a given set of test programs.

I successfully completed the assignment by writing both a top-down parser and a bottom-up parser (utilizing a left-child right-sibling binary tree). The latter method was the preferred approach for the assignment. Additionally, we delved into the application of attribute grammar to generate code for a stack machine in a language called Tiny. This task proved to be time-consuming even for a simple program, and, of course, there was an exam question related to it.

WinZigC compiler

Fast forward to December 2023. When I look back at those test programs for the WinZigC grammar, they turn out to be valid programs, and I can interpret the results in my mind. I thought, why not write a compiler for the WinZigC grammar and see how it runs on real hardware? During my search, I stumbled upon a project called LLVM, which provides the infrastructure for writing compilers. This project has had a significant impact on the tech industry, powering AI and large language models (LLMs). For instance, when you train a model using a TPU with Python libraries, it's LLVM that makes it possible. Moreover, languages like C++, Swift, Rust, and many others use LLVM as their backend.

Exploring LLVM

The most amazing concept in LLVM is the existence of a language-agnostic form used to describe a program called the Intermediate Representation (IR). IR sits between high-level programming languages and machine code. It shares similarities with assembly, but it is not purely assembly. When you have a program in the IR, you can perform analyses and optimizations at that level and then compile it to the machine architecture of your choice. This process makes building compilers incredibly flexible and helps avoid repetitive work across compiler implementations.

I went through the tutorial for implementing Kaleidoscope. The primary goal of this tutorial is to provide ideas about LLVM constructs when implementing a language. I gained insights into concepts such as SSA form, where you generate instructions for a machine with an infinite number of registers, each of which can be assigned only once. While following the tutorial, I began considering ways to bring the WinZigC grammar to generate LLVM bitcode.

Observations from the test programs

In some programs, I noticed instances of assigning to an undeclared variable 'd'. Initially, I thought these were syntactically correct but semantically incorrect programs. However, upon looking into the grammar and examining a few programs, I realized that these programs are indeed correct. The undeclared variable is mysteriously declared, and its purpose is to discard the return value of a function call when it is not used. Of course, there isn't a void return type in the grammar.

Another observation I made was the presence of assignments to the function name within the function. I understood that this is a mechanism to change the return value of the function without explicitly using a return statement. This feature, although unconventional in some programming languages, is known and used in languages like PASCAL.

Choices I made for the project

I implemented the compiler using C++ since LLVM is also written in C++, making it easy to add LLVM libraries as dependencies. I chose Bazel as the build system and utilized Google-published Bazel build libraries for LLVM (LLVM upstream uses CMake as its build system). For logging, I integrated glog, and for writing tests, I employed googletest. All of these projects are amazing open-source initiatives from Google.

In my codebase, I added only the dependencies required for generating LLVM bitcode. Consequently, the clang backend is utilized to build the binary for machine code. Clang supports cross-compilation and defaults to the host machine architecture; in my case, it compiled to x86-64 architecture for my Intel Macbook.

First things first

I started by writing a few lines of code and configured the LLDB Debugger for Visual Studio Code. This setup significantly sped up my debugging process and proved invaluable in addressing runtime issues. Additionally, I developed a simple bash script for compiling and running programs, a process that continued to evolve with the incorporation of unit and integration tests.

For testing, each program included numerous language constructs. This allowed me to iteratively test individual programs as I implemented more features. The first functionality I implemented was simple expression evaluation and printing. To handle input and output, I utilized the C library functions scanf and printf. At this stage, the program could read input from the command line and output results accordingly.

Subsequently, I proceeded to implement binary expressions and control flows (branching and looping).

WinZigC program -> executable binary

The process a WinZigC program goes through to become an executable binary is as follows:

  1. A WinZigC program is just a text file. The Lexer identifies different tokens such as integers, identifiers, operators (+ * := and mod), delimeters (begin, end), comments, spaces, and newlines. The latter three are ingnored by the Screener.
  2. Labeled tokens are then fed into the Parser. The Parser constructs the Abstract Syntax Tree (AST) using Look-Ahead Top-Down derivation.
  3. Next, the AST nodes are visited using the visitor pattern in a depth-first manner to generate instructions. An LLVM BasicBlock is created to group a sequence of instructions with a label. When branching occurs, a branch instruction is added to navigate to the destination BasicBlock.
  4. At this point, we have LLVM bitcode (IR) created in memory. Optional optimizations can be applied to this bitcode. It can then be serialized into binary or human-readable text form.
  5. In the final step, the Clang backend creates a binary with the target machine architecture using above bitcode.

Below is a sample WinZigC program and the LLVM text representation and the Control flow graph (CFG):

WinZigC program to find factors of a number

program factors:

    i : integer; 

function Factor ( i : integer ):integer;
    j : integer;
    if i > 0 then
	for (j := 1; j <= i; j:=j+1) 
	    if i mod j = 0 then output ( j )
end Factor;


	d:=Factor ( i )
    until i <= 0

end factors.

LLVM bitcode for the factors program

; ModuleID = 'factors'
source_filename = "factors"

@d = internal global i32 0
@i = internal global i32 0
@0 = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
@1 = private unnamed_addr constant [3 x i8] c"%d\00", align 1

declare i32 @printf(i8*, ...)

declare i32 @scanf(i8*, ...)

define i32 @Factor(i32 %0) {
  %Factor = alloca i32, align 4
  store i32 0, i32* %Factor, align 4
  %i = alloca i32, align 4
  store i32 %0, i32* %i, align 4
  %j = alloca i32, align 4
  %gttmp = icmp sgt i32 %0, 0
  br i1 %gttmp, label %for_cond, label %ifcont10

for_cond:                                         ; preds = %entry, %ifcont
  %j8 = phi i32 [ %addtmp, %ifcont ], [ 1, %entry ]
  store i32 %j8, i32* %j, align 4
  %letmp.not = icmp sgt i32 %j8, %0
  br i1 %letmp.not, label %ifcont10, label %for_body

for_body:                                         ; preds = %for_cond
  %modtmp = srem i32 %0, %j8
  %eqtmp = icmp eq i32 %modtmp, 0
  br i1 %eqtmp, label %then6, label %ifcont

then6:                                            ; preds = %for_body
  %1 = call i32 (i8*, ...) @printf(i8* nonnull dereferenceable(1) getelementptr inbounds ([4 x i8], [4 x i8]* @0, i64 0, i64 0), i32 %j8)
  br label %ifcont

ifcont:                                           ; preds = %for_body, %then6
  %addtmp = add i32 %j8, 1
  br label %for_cond

ifcont10:                                         ; preds = %entry, %for_cond
  ret i32 0

define i32 @main() {
  br label %repeat_body

repeat_body:                                      ; preds = %repeat_body, %entry
  %0 = call i32 (i8*, ...) @scanf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @1, i64 0, i64 0), i32* nonnull @i)
  %i = load i32, i32* @i, align 4
  %1 = call i32 @Factor(i32 %i)
  store i32 %1, i32* @d, align 4
  %i1 = load i32, i32* @i, align 4
  %letmp = icmp slt i32 %i1, 1
  br i1 %letmp, label %repeat_exit, label %repeat_body

repeat_exit:                                      ; preds = %repeat_body
  ret i32 0

CFG for 'Factor' before optimizations

CFG for 'Factor' after optimizations

What's next

This is a short post to give an overview of the project. I will definitely provide more content on the nitty-gritty details of implementing the WinZigC compiler. Until then, stay tuned for more updates on the project!

Happy Coding!

If you like it, share it!

Created by potrace 1.16, written by Peter Selinger 2001-2019 © 2024 Developer Diary. All Rights Reserved.

Made withusing Gatsby, served to your browser from a home grown Raspberry Pi cluster.