docs

How it works

The Parser

The parser takes the flat token stream from the lexer and builds a tree that represents the structure of the program - nesting, operator precedence, and the relationships between statements and the expressions they contain.

Input

List[Token]

received from previous stage

Output

Program

passed to next stage

Source

parser.py

295 lines

Why a tree?

Source code has inherent structure. An if statement contains a condition and one or two blocks. A block contains statements. A statement might contain an expression. An expression might contain sub-expressions. This hierarchy of containment is exactly what a tree represents.

The Abstract Syntax Tree (AST) is the parser's answer. It is "abstract" because it drops irrelevant syntax tokens (braces, the word "che", newlines) and keeps only the semantically meaningful structure.

The kemlang-py grammar (BNF)

A context-free grammar defines what sequences of tokens form a valid program.

kemlang-py grammar in BNF notation

  program      ::= "kem bhai" statement* "aavjo bhai"

  statement    ::= print_stmt
                 | declaration
                 | assignment
                 | if_stmt
                 | while_stmt
                 | break_stmt
                 | continue_stmt

  print_stmt   ::= "bhai bol" expression
  declaration  ::= "aa" IDENTIFIER "che" expression
  assignment   ::= IDENTIFIER "che" expression
  break_stmt   ::= "tame jao"
  continue_stmt::= "aagal vado"

  if_stmt      ::= "jo" expression block ( "nahi to" block )?
  while_stmt   ::= "farvu" block "jya sudhi" expression
  block        ::= "{" statement* "}"

  expression   ::= comparison
  comparison   ::= term ( ( "==" | "!=" | "<" | ">" | "<=" | ">=" ) term )*
  term         ::= factor ( ( "+" | "-" ) factor )*
  factor       ::= unary ( ( "*" | "/" | "%" ) unary )*
  unary        ::= "-" unary | primary
  primary      ::= INTEGER | FLOAT | STRING | BOOLEAN
                 | "bapu tame bolo"
                 | IDENTIFIER
                 | "(" expression ")"

Operator precedence

Why does 1 + 2 * 3 equal 7and not 9? The grammar encodes precedence by stratification: lower in the grammar = tighter binding = higher precedence.

operator precedence chain - lower = tighter binding

  expression()        lowest precedence (entry point)
    └── comparison()  == != < > <= >=
          └── term()  + -
                └── factor()  * / %
                      └── unary()  - (negation)
                            └── primary()  literals, variables, (expr)
                                           highest precedence

  Parsing  1 + 2 * 3 :

    term() needs a left operand  -> calls factor()
      factor() needs left  -> calls unary() -> primary() -> Literal(1)
      factor() sees no * / % -> returns Literal(1)
    term() has left=1, sees '+', needs right -> calls factor()
      factor() needs left  -> calls unary() -> primary() -> Literal(2)
      factor() sees '*'   -> calls unary() -> primary() -> Literal(3)
      factor() returns Binary(*, Literal(2), Literal(3))   <- tighter!
    term() returns Binary(+, Literal(1), Binary(*, 2, 3))

  Result tree: 1 + (2 * 3)  =  7   correct

Recursive descent

Each grammar rule has a corresponding Python method. Methods call each other recursively, naturally mirroring the nested structure of the grammar.

kemlang/parser.py - how rules call each other
def statement(self) -> Stmt | None:
    if self.match(TokenType.JO):
        return self.if_statement()    # dispatch on current token
    if self.match(TokenType.AA):
        return self.declaration()
    # ...

def if_statement(self) -> If:
    condition   = self.expression()   # parse the condition
    then_branch = self.block()        # parse { ... }
    else_branch = None
    if self.match(TokenType.ELSE):
        else_branch = self.block()    # optional nahi to { ... }
    return If(condition, then_branch, else_branch)

def expression(self) -> Expr:
    return self.comparison()          # delegate to next level

def comparison(self) -> Expr:
    left = self.term()
    while self.match(TokenType.EQUAL, TokenType.NOT_EQUAL, ...):
        op    = self.previous()
        right = self.term()
        left  = Binary(left, op, right)
    return left

AST node types

Every node is an immutable @dataclass. The parser builds the tree; the interpreter reads it without modification.

node typefields
STATEMENTS
Programstatements: list[Stmt]
Blockstatements: list[Stmt]
Printexpression: Expr
Declarationname: str, initializer: Expr
Assignmentname: str, value: Expr
Ifcondition: Expr, then_branch: Block, else_branch: Block | None
Whilebody: Block, condition: Expr (body runs before condition is checked)
Break(no fields)
Continue(no fields)
EXPRESSIONS
Literalvalue: int | float | str | bool | None
Variablename: str
Binaryleft: Expr, operator: Token, right: Expr
Unaryoperator: Token, right: Expr
Input(no fields) - evaluates bapu tame bolo at runtime

Parsing a complete example

source
kem bhai
  aa x che 10
  jo x > 5 {
    bhai bol "big"
  } nahi to {
    bhai bol "small"
  }
aavjo bhai
resulting AST

  Program
  ├── Declaration
  │   ├── name:         "x"
  │   └── initializer:  Literal(10)
  └── If
      ├── condition:    Binary
      │                  ├── left:     Variable("x")
      │                  ├── operator: Token(GREATER, ">")
      │                  └── right:    Literal(5)
      ├── then_branch:  Block
      │   └── Print
      │       └── expression: Literal("big")
      └── else_branch:  Block
          └── Print
              └── expression: Literal("small")

The AST contains no braces, no che, no jo, no nahi to. The parser consumed those tokens to understand structure, then discarded them.

Parse errors

The parser raises ParseError when the current token does not match what the grammar expects. It reports both the expected token and the actual one found.

example error messages
ParseError: expected '{' after 'jo' condition at line 3, column 14
ParseError: expected 'jya sudhi' after loop body at line 7, column 0
ParseError: program must start with 'kem bhai' at line 1, column 0