A comprehensive lexical analyzer (tokenizer) implementation in C++ that converts regular expressions into deterministic finite automata (DFA) for efficient pattern matching and tokenization.
This project implements the complete theoretical foundation of lexical analysis used in compiler construction:
Regular Expression → AST → NFA → DFA → Tokenizer
The lexical analyzer takes regular expression patterns, converts them through multiple representations, and uses the resulting DFAs to tokenize input strings using a longest-match greedy strategy.
- Fundamental building block for both NFA and DFA
- Manages unique state IDs, acceptance status, and transitions
- Supports epsilon transitions (represented as
'\0') - Thread-safe ID generation using static counter
- Non-deterministic Finite Automaton implementation
- Supports Thompson's Construction operations:
alternation(a, b)- Implementsa|bconcatenation(a, b)- ImplementsabkleeneStar(a)- Implementsa*
- Maintains start state and accepting states
- Note: Composition operations are destructive (modify input NFAs)
- Recursive descent parser for regular expressions
- Converts regex strings to Abstract Syntax Trees (AST)
- Supports operators with correct precedence:
- Parentheses
()- highest precedence - Kleene star
*- binds tightly to operand - Concatenation (implicit) - medium precedence
- Alternation
|- lowest precedence
- Parentheses
- Token types:
CHAR,ALTERNATION,STAR,LPAREN,RPAREN,END
- Implements Thompson's Construction algorithm
- Converts regex AST to NFA
- Recursively builds NFAs using composition operations
- Creates predictable NFA structures with epsilon transitions
- Implements the Subset Construction (Powerset Construction) algorithm
- Converts NFA to DFA by computing epsilon closures
- Creates
CombinedStateobjects representing sets of NFA states - Eliminates non-determinism for efficient matching
- Deterministic Finite Automaton implementation
- Uses
CombinedState- wrapper containing sets of NFA states - No epsilon transitions (all deterministic)
- Supports efficient pattern matching
- High-level tokenization interface
- Manages multiple regex patterns with associated names
- Implements longest-match greedy tokenization strategy
- Automatically handles whitespace
- Full pipeline integration: regex → parser → Thompson → subset → DFA
✓ Complete Regex Support
- Character literals
- Alternation (
|) - Kleene star (
*) - Concatenation (implicit)
- Parentheses for grouping
✓ Efficient Tokenization
- Longest-match strategy
- Greedy matching
- Multiple pattern support with priority
- Automatic whitespace handling
✓ Robust Implementation
- Proper epsilon closure computation
- Correct DFA state acceptance tracking
- Safe iterator handling
- Exception-based error reporting
- C++17 compatible compiler (clang++ or g++)
- Make build system
# Build the project
make
# Clean build artifacts
make clean
# Rebuild from scratch
make rebuild
# Build and run
make runlexical-analyzer/
├── state.h / state.cpp # State representation
├── NFA.h / NFA.cpp # NFA implementation
├── DFA.h / DFA.cpp # DFA implementation
├── regex_parser.h / .cpp # Regex parser
├── ThompsonConstructor.h / .cpp # NFA construction
├── SubsetConstructor.h / .cpp # NFA to DFA conversion
├── Lexer.h / Lexer.cpp # Tokenization engine
├── main.cpp # Test suite
├── Makefile # Build configuration
└── README.md # This file
#include "Lexer.h"
int main() {
Lexer lexer;
// Add patterns with names
lexer.addPattern("if|else|while", "KEYWORD");
lexer.addPattern("(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)*", "IDENTIFIER");
lexer.addPattern("(0|1|2|3|4|5|6|7|8|9)*", "NUMBER");
// Tokenize input
string input = "if x else y";
vector<LexerToken> tokens = lexer.tokenize(input);
// Process tokens
for (const auto& token : tokens) {
cout << token.patternName << ": " << token.value << endl;
}
// Output:
// KEYWORD: if
// IDENTIFIER: x
// KEYWORD: else
// IDENTIFIER: y
return 0;
}Creates an NFA from a regular expression with the following properties:
- Exactly one start state
- Exactly one accepting state
- Epsilon transitions connect sub-NFAs
- O(n) states for regex of length n
Example: For regex (a|b)*c:
ε ┌─ε─→ a ─ε─┐
start ─→ ○ ─┤ ├─→ ○ ─ε→ c ─→ (accept)
└─ε─→ b ─ε─┘ ↑
ε ←─────────┘
Converts NFA to DFA by:
- Computing epsilon closure of start state → DFA start state
- For each DFA state and input symbol:
- Find all NFA states reachable via that symbol
- Compute epsilon closure of those states
- Create new DFA state if needed
- Mark DFA states as accepting if they contain any NFA accepting state
Time Complexity: O(2^n) worst case, typically much better in practice
The lexer implements greedy longest-match strategy:
- At each position, try all patterns
- Select the pattern with the longest match
- If multiple patterns match the same length, use the first registered
- Advance input position by match length
- Skip whitespace automatically
- Throw exception for unrecognizable characters
The project includes a comprehensive test suite (main.cpp) covering:
- Single character matching: Basic pattern recognition
- Concatenation: Multi-character sequences
- Whitespace handling: Automatic space skipping
- Alternation patterns: Choice between alternatives (
if|else) - Mixed patterns: Complex combinations
- Edge cases: Empty strings, whitespace-only input
- Error handling: Invalid character detection
Run tests with:
make runThe implementation carefully handles C++ iterator lifetime:
- Stores vector copies before iteration to avoid dangling iterators
- Prevents undefined behavior from iterating over temporary objects
DFA matching checks acceptance status:
- After every successful transition
- At the start (for patterns like
a*that accept empty strings) - Uses backtracking to last accepting position on failure
The tokenization loop properly handles index advancement:
- Accounts for automatic loop increment when manually advancing
- Prevents off-by-one errors and character skipping
- No support for extended regex features (e.g., character classes
[a-z], quantifiers{n,m}, anchors^$) - No optimization/minimization of DFAs
- No support for capture groups or backreferences
- Destructive NFA composition operations (modifies input NFAs)
- Case-sensitive matching only
Potential improvements:
- DFA minimization using Hopcroft's algorithm
- Extended regex syntax support
- Character class support
[a-z],[0-9] - Quantifiers
+,?,{n,m} - Escape sequences
\n,\t,\. - Non-destructive NFA operations (deep cloning)
- Performance profiling and optimization
- Visualization of NFA/DFA state machines
- Unicode support
This implementation is based on fundamental concepts from formal language theory and compiler construction:
- Regular Languages: Languages recognizable by finite automata
- Thompson's Construction: Kenneth Thompson (1968) - efficient NFA construction
- Subset Construction: Rabin & Scott (1959) - NFA to DFA conversion
- Lexical Analysis: First phase of compilation, tokenization of source code
- "Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
- "Introduction to Automata Theory, Languages, and Computation" - Hopcroft, Motwani, Ullman
- "Engineering a Compiler" - Cooper & Torczon
This project was developed as an educational implementation of lexical analysis theory.
Special Thanks: Claude AI assisted with debugging, optimization, and comprehensive testing.
This project is provided for educational purposes.
Note: This implementation prioritizes clarity and correctness over performance. It serves as a learning tool for understanding compiler construction fundamentals.