docs

How it works

The Lexer

The lexer is the first stage of the pipeline. It reads raw source text one character at a time and produces a stream of tokens - the atoms from which the parser builds a program tree.

Input

str

received from previous stage

Output

List[Token]

passed to next stage

Source

lexer.py

256 lines

What is a lexer?

A lexer (also called a scanner or tokenizer) converts a flat string of characters into a structured sequence of tokens. The parser cannot work directly on raw text - it needs discrete, labeled units it can reason about grammatically.

Think of it like reading a sentence. Before you can parse "the cat sat on the mat" into subject-verb-object structure, your brain first groups the characters into words. The lexer does the same thing for source code.

What is a token?

A token is a small, labeled chunk of source text. In kemlang-py, every token carries four pieces of information:

kemlang/types.py
@dataclass
class Token:
    type:    TokenType   # what kind of thing this is
    lexeme:  str         # the exact source text e.g. "bhai bol"
    line:    int         # 1-indexed line number in the source file
    col:     int         # 0-indexed column offset on that line
    literal: Any = None  # parsed value for strings/numbers

How the scanning loop works

The lexer maintains a cursor (self.current) that advances through the source string. On each iteration of the main loop, it calls scan_token(), which reads the character at the cursor, decides what kind of token starts here, and advances the cursor past it.

kemlang/lexer.py - main loop
def tokenize(self) -> list[Token]:
    while not self.is_at_end():
        self.start = self.current   # mark start of next token
        self.scan_token()           # consume chars, emit token

    self.tokens.append(Token(TokenType.EOF, "", self.line, self.col))
    return self.tokens

The multi-word keyword problem

Most programming languages use single-word reserved words: if, while, print. kemlang-py's Gujarati keywords are phrases: bhai bol(print), aavjo bhai (program end), bapu tame bolo (read input). The word bhai alone is not a valid token - only the full phrase is.

kemlang-py solves this by checking for multi-word sequences at the start of every scan_token() call, before doing anything else, using Python's string startswith() to peek ahead.

kemlang/lexer.py - multi-word keyword detection
self.multiword_keywords = [
    ("kem bhai",       TokenType.KEM_BHAI),
    ("aavjo bhai",     TokenType.AAVJO_BHAI),
    ("bhai bol",       TokenType.BHAI_BOL),
    ("bapu tame bolo", TokenType.BAPU_TAME_BOLO),
    ("bhai chhe",      TokenType.BHAI_CHHE),
    ("bhai nathi",     TokenType.BHAI_NATHI),
    ("jya sudhi",      TokenType.JYA_SUDHI),
    ("tame jao",       TokenType.TAME_JAO),
    ("aagal vado",     TokenType.AAGAL_VADO),
    ("nahi to",        TokenType.ELSE),
]

# At scan time: check multi-word before anything else
remaining = self.source[self.current - 1:]
for phrase, token_type in self.multiword_keywords:
    if remaining.startswith(phrase):
        self.current += len(phrase) - 1   # advance past full phrase
        self.add_token(token_type)
        return

Step-by-step scan trace

input source
kem bhai
  bhai bol "kem cho!"
aavjo bhai
scanning trace - cursor advances left-to-right through source

  Line 1:  k e m   b h a i
           ^
           remaining starts with "kem bhai"  -> MATCH
           emit  KEM_BHAI   lexeme='kem bhai'   1:0
           advance 8 chars, hit 
, increment line counter

  Line 2:     b h a i   b o l   " k e m   c h o ! "
           ^
           skip 2 leading spaces
              ^
              remaining starts with "bhai bol"  -> MATCH
              emit  BHAI_BOL  lexeme='bhai bol'  2:2
              advance 8 chars

              skip 1 space
                         ^
                         char is '"'  -> start string scan
                         advance until closing '"' found
                         emit  STRING  lexeme='"kem cho!"'  2:10
                         advance 10 chars

  Line 3:  a a v j o   b h a i
           ^
           remaining starts with "aavjo bhai"  -> MATCH
           emit  AAVJO_BHAI  lexeme='aavjo bhai'  3:0
           advance 10 chars

  End of source:
           emit  EOF  lexeme=''  4:0

  Final stream:
  ┌──────────────┬──────────────────────┬───────┐
  │ type         │ lexeme               │ pos   │
  ├──────────────┼──────────────────────┼───────┤
  │ KEM_BHAI     │ 'kem bhai'           │ 1:0   │
  │ BHAI_BOL     │ 'bhai bol'           │ 2:2   │
  │ STRING       │ '"kem cho!"'         │ 2:10  │
  │ AAVJO_BHAI   │ 'aavjo bhai'         │ 3:0   │
  │ EOF          │ ''                   │ 4:0   │
  └──────────────┴──────────────────────┴───────┘

Scanning priority

scan_token() decision order - first match wins

  On each new character:

  1.  whitespace     space / tab / 
           skip, advance
  2.  newline        
                          emit NEWLINE, next line
  3.  multi-word kw  "bhai bol", "kem bhai"...  emit keyword token
  4.  comment        //                          skip to end of line
  5.  operator       + - * / % ( ) { }          emit operator token
  6.  two-char op    == != <= >=  (peek next)    emit operator token
  7.  string         "                           scan to closing "
  8.  digit          0-9                         scan integer or float
  9.  letter         a-z A-Z _                   scan word, check keywords
                                                 -> keyword token or IDENTIFIER
  10. (no match)     any other character         raise LexerError

All token types

token typesource text / meaning
MULTI-WORD KEYWORDS
KEM_BHAIkem bhai - program start
AAVJO_BHAIaavjo bhai - program end
BHAI_BOLbhai bol - print statement
BAPU_TAME_BOLObapu tame bolo - read input
BHAI_CHHEbhai chhe - boolean true
BHAI_NATHIbhai nathi - boolean false
JYA_SUDHIjya sudhi - while condition
TAME_JAOtame jao - break
AAGAL_VADOaagal vado - continue
ELSEnahi to - else branch
SINGLE-WORD KEYWORDS
AAaa - variable declaration
CHEche - assignment
JOjo - if
FARVUfarvu - loop body
LITERALS
INTEGER42, 0, -1 (Python int)
FLOAT3.14, 0.5 (Python float)
STRING"hello" (double-quoted, single-line)
BOOLEANbhai chhe / bhai nathi
IDENTIFIERx, score, myVar (variable names)
OPERATORS / DELIMITERS
PLUS/MINUS/MULTIPLY/DIVIDE/MODULO+ - * / %
EQUAL / NOT_EQUAL== !=
LESS / GREATER / LESS_EQUAL / GREATER_EQUAL< > <= >=
LEFT_BRACE / RIGHT_BRACE{ }
SPECIAL
EOFend of file - always the last token emitted
NEWLINEline break - filtered out by the parser

What the lexer rejects

Unexpected characterbhai bol x?2

? is not part of any token type. LexerError raised immediately at the bad position.

Unterminated stringbhai bol "hello

The lexer scans for a closing " on the same line. End-of-line before the closing quote raises LexerError.

Invalid number3.14.15

Numbers may have at most one decimal point. The second . raises LexerError.

Why hand-written over regex?

Better error messages

A hand-written scanner knows exactly where it is in the source at all times. It can report the precise line and column of every error.

Multi-word keyword support

Regex-based lexers split on word boundaries before checking for keywords. Matching 'bhai bol' as a two-word unit is much simpler with startswith().

Full control over scanning

The scanner can implement context-sensitive behavior (string scanning differs from identifier scanning) without complex regex lookahead.