How it works
The Lexer
The lexer is the first stage of the pipeline. It reads raw source text one character at a time and produces a stream of tokens - the atoms from which the parser builds a program tree.
Input
str
received from previous stage
Output
List[Token]
passed to next stage
Source
lexer.py
256 lines
What is a lexer?
A lexer (also called a scanner or tokenizer) converts a flat string of characters into a structured sequence of tokens. The parser cannot work directly on raw text - it needs discrete, labeled units it can reason about grammatically.
Think of it like reading a sentence. Before you can parse "the cat sat on the mat" into subject-verb-object structure, your brain first groups the characters into words. The lexer does the same thing for source code.
What is a token?
A token is a small, labeled chunk of source text. In kemlang-py, every token carries four pieces of information:
@dataclass
class Token:
type: TokenType # what kind of thing this is
lexeme: str # the exact source text e.g. "bhai bol"
line: int # 1-indexed line number in the source file
col: int # 0-indexed column offset on that line
literal: Any = None # parsed value for strings/numbersHow the scanning loop works
The lexer maintains a cursor (self.current) that advances through the source string. On each iteration of the main loop, it calls scan_token(), which reads the character at the cursor, decides what kind of token starts here, and advances the cursor past it.
def tokenize(self) -> list[Token]:
while not self.is_at_end():
self.start = self.current # mark start of next token
self.scan_token() # consume chars, emit token
self.tokens.append(Token(TokenType.EOF, "", self.line, self.col))
return self.tokensThe multi-word keyword problem
Most programming languages use single-word reserved words: if, while, print. kemlang-py's Gujarati keywords are phrases: bhai bol(print), aavjo bhai (program end), bapu tame bolo (read input). The word bhai alone is not a valid token - only the full phrase is.
kemlang-py solves this by checking for multi-word sequences at the start of every scan_token() call, before doing anything else, using Python's string startswith() to peek ahead.
self.multiword_keywords = [
("kem bhai", TokenType.KEM_BHAI),
("aavjo bhai", TokenType.AAVJO_BHAI),
("bhai bol", TokenType.BHAI_BOL),
("bapu tame bolo", TokenType.BAPU_TAME_BOLO),
("bhai chhe", TokenType.BHAI_CHHE),
("bhai nathi", TokenType.BHAI_NATHI),
("jya sudhi", TokenType.JYA_SUDHI),
("tame jao", TokenType.TAME_JAO),
("aagal vado", TokenType.AAGAL_VADO),
("nahi to", TokenType.ELSE),
]
# At scan time: check multi-word before anything else
remaining = self.source[self.current - 1:]
for phrase, token_type in self.multiword_keywords:
if remaining.startswith(phrase):
self.current += len(phrase) - 1 # advance past full phrase
self.add_token(token_type)
returnStep-by-step scan trace
kem bhai bhai bol "kem cho!" aavjo bhai
Line 1: k e m b h a i
^
remaining starts with "kem bhai" -> MATCH
emit KEM_BHAI lexeme='kem bhai' 1:0
advance 8 chars, hit
, increment line counter
Line 2: b h a i b o l " k e m c h o ! "
^
skip 2 leading spaces
^
remaining starts with "bhai bol" -> MATCH
emit BHAI_BOL lexeme='bhai bol' 2:2
advance 8 chars
skip 1 space
^
char is '"' -> start string scan
advance until closing '"' found
emit STRING lexeme='"kem cho!"' 2:10
advance 10 chars
Line 3: a a v j o b h a i
^
remaining starts with "aavjo bhai" -> MATCH
emit AAVJO_BHAI lexeme='aavjo bhai' 3:0
advance 10 chars
End of source:
emit EOF lexeme='' 4:0
Final stream:
┌──────────────┬──────────────────────┬───────┐
│ type │ lexeme │ pos │
├──────────────┼──────────────────────┼───────┤
│ KEM_BHAI │ 'kem bhai' │ 1:0 │
│ BHAI_BOL │ 'bhai bol' │ 2:2 │
│ STRING │ '"kem cho!"' │ 2:10 │
│ AAVJO_BHAI │ 'aavjo bhai' │ 3:0 │
│ EOF │ '' │ 4:0 │
└──────────────┴──────────────────────┴───────┘Scanning priority
On each new character:
1. whitespace space / tab /
skip, advance
2. newline
emit NEWLINE, next line
3. multi-word kw "bhai bol", "kem bhai"... emit keyword token
4. comment // skip to end of line
5. operator + - * / % ( ) { } emit operator token
6. two-char op == != <= >= (peek next) emit operator token
7. string " scan to closing "
8. digit 0-9 scan integer or float
9. letter a-z A-Z _ scan word, check keywords
-> keyword token or IDENTIFIER
10. (no match) any other character raise LexerErrorAll token types
What the lexer rejects
bhai bol x?2? is not part of any token type. LexerError raised immediately at the bad position.
bhai bol "helloThe lexer scans for a closing " on the same line. End-of-line before the closing quote raises LexerError.
3.14.15Numbers may have at most one decimal point. The second . raises LexerError.
Why hand-written over regex?
Better error messages
A hand-written scanner knows exactly where it is in the source at all times. It can report the precise line and column of every error.
Multi-word keyword support
Regex-based lexers split on word boundaries before checking for keywords. Matching 'bhai bol' as a two-word unit is much simpler with startswith().
Full control over scanning
The scanner can implement context-sensitive behavior (string scanning differs from identifier scanning) without complex regex lookahead.