Lexical Analyzer

A comprehensive lexical analyzer (tokenizer) implementation in C++ that converts regular expressions into deterministic finite automata (DFA) for efficient pattern matching and tokenization.

This README is written by AI in order to explain the projects features and constraints.

Overview

This project implements the complete theoretical foundation of lexical analysis used in compiler construction:

Regular Expression → AST → NFA → DFA → Tokenizer

The lexical analyzer takes regular expression patterns, converts them through multiple representations, and uses the resulting DFAs to tokenize input strings using a longest-match greedy strategy.

Architecture

Core Components

1. State (`state.h`, `state.cpp`)

Fundamental building block for both NFA and DFA
Manages unique state IDs, acceptance status, and transitions
Supports epsilon transitions (represented as '\0')
Thread-safe ID generation using static counter

2. NFA (`NFA.h`, `NFA.cpp`)

Non-deterministic Finite Automaton implementation
Supports Thompson's Construction operations:
- alternation(a, b) - Implements a|b
- concatenation(a, b) - Implements ab
- kleeneStar(a) - Implements a*
Maintains start state and accepting states
Note: Composition operations are destructive (modify input NFAs)

3. RegexParser (`regex_parser.h`, `regex_parser.cpp`)

Recursive descent parser for regular expressions
Converts regex strings to Abstract Syntax Trees (AST)
Supports operators with correct precedence:
- Parentheses () - highest precedence
- Kleene star * - binds tightly to operand
- Concatenation (implicit) - medium precedence
- Alternation | - lowest precedence
Token types: CHAR, ALTERNATION, STAR, LPAREN, RPAREN, END

4. ThompsonConstructor (`ThompsonConstructor.h`, `ThompsonConstructor.cpp`)

Implements Thompson's Construction algorithm
Converts regex AST to NFA
Recursively builds NFAs using composition operations
Creates predictable NFA structures with epsilon transitions

5. SubsetConstructor (`SubsetConstructor.h`, `SubsetConstructor.cpp`)

Implements the Subset Construction (Powerset Construction) algorithm
Converts NFA to DFA by computing epsilon closures
Creates CombinedState objects representing sets of NFA states
Eliminates non-determinism for efficient matching

6. DFA (`DFA.h`, `DFA.cpp`)

Deterministic Finite Automaton implementation
Uses CombinedState - wrapper containing sets of NFA states
No epsilon transitions (all deterministic)
Supports efficient pattern matching

7. Lexer (`Lexer.h`, `Lexer.cpp`)

High-level tokenization interface
Manages multiple regex patterns with associated names
Implements longest-match greedy tokenization strategy
Automatically handles whitespace
Full pipeline integration: regex → parser → Thompson → subset → DFA

Features

✓ Complete Regex Support

Character literals
Alternation (|)
Kleene star (*)
Concatenation (implicit)
Parentheses for grouping

✓ Efficient Tokenization

Longest-match strategy
Greedy matching
Multiple pattern support with priority
Automatic whitespace handling

✓ Robust Implementation

Proper epsilon closure computation
Correct DFA state acceptance tracking
Safe iterator handling
Exception-based error reporting

Building and Running

Prerequisites

C++17 compatible compiler (clang++ or g++)
Make build system

Build Commands

# Build the project
make

# Clean build artifacts
make clean

# Rebuild from scratch
make rebuild

# Build and run
make run

Project Structure

lexical-analyzer/
├── state.h / state.cpp           # State representation
├── NFA.h / NFA.cpp               # NFA implementation
├── DFA.h / DFA.cpp               # DFA implementation
├── regex_parser.h / .cpp         # Regex parser
├── ThompsonConstructor.h / .cpp  # NFA construction
├── SubsetConstructor.h / .cpp    # NFA to DFA conversion
├── Lexer.h / Lexer.cpp           # Tokenization engine
├── main.cpp                      # Test suite
├── Makefile                      # Build configuration
└── README.md                     # This file

Usage Example

#include "Lexer.h"

int main() {
    Lexer lexer;

    // Add patterns with names
    lexer.addPattern("if|else|while", "KEYWORD");
    lexer.addPattern("(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)*", "IDENTIFIER");
    lexer.addPattern("(0|1|2|3|4|5|6|7|8|9)*", "NUMBER");

    // Tokenize input
    string input = "if x else y";
    vector<LexerToken> tokens = lexer.tokenize(input);

    // Process tokens
    for (const auto& token : tokens) {
        cout << token.patternName << ": " << token.value << endl;
    }
    // Output:
    // KEYWORD: if
    // IDENTIFIER: x
    // KEYWORD: else
    // IDENTIFIER: y

    return 0;
}

Algorithm Details

Thompson's Construction

Creates an NFA from a regular expression with the following properties:

Exactly one start state
Exactly one accepting state
Epsilon transitions connect sub-NFAs
O(n) states for regex of length n

Example: For regex (a|b)*c:

      ε     ┌─ε─→ a ─ε─┐
start ─→ ○ ─┤           ├─→ ○ ─ε→ c ─→ (accept)
            └─ε─→ b ─ε─┘     ↑
                 ε ←─────────┘

Subset Construction

Converts NFA to DFA by:

Computing epsilon closure of start state → DFA start state
For each DFA state and input symbol:
- Find all NFA states reachable via that symbol
- Compute epsilon closure of those states
- Create new DFA state if needed
Mark DFA states as accepting if they contain any NFA accepting state

Time Complexity: O(2^n) worst case, typically much better in practice

Longest-Match Tokenization

The lexer implements greedy longest-match strategy:

At each position, try all patterns
Select the pattern with the longest match
If multiple patterns match the same length, use the first registered
Advance input position by match length
Skip whitespace automatically
Throw exception for unrecognizable characters

Test Suite

The project includes a comprehensive test suite (main.cpp) covering:

Single character matching: Basic pattern recognition
Concatenation: Multi-character sequences
Whitespace handling: Automatic space skipping
Alternation patterns: Choice between alternatives (if|else)
Mixed patterns: Complex combinations
Edge cases: Empty strings, whitespace-only input
Error handling: Invalid character detection

Run tests with:

make run

Technical Highlights

Iterator Safety

The implementation carefully handles C++ iterator lifetime:

Stores vector copies before iteration to avoid dangling iterators
Prevents undefined behavior from iterating over temporary objects

State Acceptance Tracking

DFA matching checks acceptance status:

After every successful transition
At the start (for patterns like a* that accept empty strings)
Uses backtracking to last accepting position on failure

Loop Control

The tokenization loop properly handles index advancement:

Accounts for automatic loop increment when manually advancing
Prevents off-by-one errors and character skipping

Limitations

No support for extended regex features (e.g., character classes [a-z], quantifiers {n,m}, anchors ^$)
No optimization/minimization of DFAs
No support for capture groups or backreferences
Destructive NFA composition operations (modifies input NFAs)
Case-sensitive matching only

Future Enhancements

Potential improvements:

DFA minimization using Hopcroft's algorithm
Extended regex syntax support
Character class support [a-z], [0-9]
Quantifiers +, ?, {n,m}
Escape sequences \n, \t, \.
Non-destructive NFA operations (deep cloning)
Performance profiling and optimization
Visualization of NFA/DFA state machines
Unicode support

Theory Background

This implementation is based on fundamental concepts from formal language theory and compiler construction:

Regular Languages: Languages recognizable by finite automata
Thompson's Construction: Kenneth Thompson (1968) - efficient NFA construction
Subset Construction: Rabin & Scott (1959) - NFA to DFA conversion
Lexical Analysis: First phase of compilation, tokenization of source code

Authors & Acknowledgments

This project was developed as an educational implementation of lexical analysis theory.

Special Thanks: Claude AI assisted with debugging, optimization, and comprehensive testing.

License

This project is provided for educational purposes.

Note: This implementation prioritizes clarity and correctness over performance. It serves as a learning tool for understanding compiler construction fundamentals.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
DFA.cpp		DFA.cpp
DFA.h		DFA.h
Lexer.cpp		Lexer.cpp
Lexer.h		Lexer.h
Makefile		Makefile
README.md		README.md
SubsetConstructor.cpp		SubsetConstructor.cpp
SubsetConstructor.h		SubsetConstructor.h
ThompsonConstructor.cpp		ThompsonConstructor.cpp
ThompsonConstructor.h		ThompsonConstructor.h
main.cpp		main.cpp
nfa.cpp		nfa.cpp
nfa.h		nfa.h
regex_parser.cpp		regex_parser.cpp
regex_parser.h		regex_parser.h
state.cpp		state.cpp
state.h		state.h

ardafincan/lexical-analyzer

Folders and files

Latest commit

History

Repository files navigation

Lexical Analyzer

This README is written by AI in order to explain the projects features and constraints.

Overview

Architecture

Core Components

1. State (state.h, state.cpp)

2. NFA (NFA.h, NFA.cpp)

3. RegexParser (regex_parser.h, regex_parser.cpp)

4. ThompsonConstructor (ThompsonConstructor.h, ThompsonConstructor.cpp)

5. SubsetConstructor (SubsetConstructor.h, SubsetConstructor.cpp)

6. DFA (DFA.h, DFA.cpp)

7. Lexer (Lexer.h, Lexer.cpp)