Skip to content

ardafincan/lexical-analyzer

Repository files navigation

Lexical Analyzer

A comprehensive lexical analyzer (tokenizer) implementation in C++ that converts regular expressions into deterministic finite automata (DFA) for efficient pattern matching and tokenization.

This README is written by AI in order to explain the projects features and constraints.

Overview

This project implements the complete theoretical foundation of lexical analysis used in compiler construction:

Regular Expression → AST → NFA → DFA → Tokenizer

The lexical analyzer takes regular expression patterns, converts them through multiple representations, and uses the resulting DFAs to tokenize input strings using a longest-match greedy strategy.

Architecture

Core Components

1. State (state.h, state.cpp)

  • Fundamental building block for both NFA and DFA
  • Manages unique state IDs, acceptance status, and transitions
  • Supports epsilon transitions (represented as '\0')
  • Thread-safe ID generation using static counter

2. NFA (NFA.h, NFA.cpp)

  • Non-deterministic Finite Automaton implementation
  • Supports Thompson's Construction operations:
    • alternation(a, b) - Implements a|b
    • concatenation(a, b) - Implements ab
    • kleeneStar(a) - Implements a*
  • Maintains start state and accepting states
  • Note: Composition operations are destructive (modify input NFAs)

3. RegexParser (regex_parser.h, regex_parser.cpp)

  • Recursive descent parser for regular expressions
  • Converts regex strings to Abstract Syntax Trees (AST)
  • Supports operators with correct precedence:
    • Parentheses () - highest precedence
    • Kleene star * - binds tightly to operand
    • Concatenation (implicit) - medium precedence
    • Alternation | - lowest precedence
  • Token types: CHAR, ALTERNATION, STAR, LPAREN, RPAREN, END

4. ThompsonConstructor (ThompsonConstructor.h, ThompsonConstructor.cpp)

  • Implements Thompson's Construction algorithm
  • Converts regex AST to NFA
  • Recursively builds NFAs using composition operations
  • Creates predictable NFA structures with epsilon transitions

5. SubsetConstructor (SubsetConstructor.h, SubsetConstructor.cpp)

  • Implements the Subset Construction (Powerset Construction) algorithm
  • Converts NFA to DFA by computing epsilon closures
  • Creates CombinedState objects representing sets of NFA states
  • Eliminates non-determinism for efficient matching

6. DFA (DFA.h, DFA.cpp)

  • Deterministic Finite Automaton implementation
  • Uses CombinedState - wrapper containing sets of NFA states
  • No epsilon transitions (all deterministic)
  • Supports efficient pattern matching

7. Lexer (Lexer.h, Lexer.cpp)

  • High-level tokenization interface
  • Manages multiple regex patterns with associated names
  • Implements longest-match greedy tokenization strategy
  • Automatically handles whitespace
  • Full pipeline integration: regex → parser → Thompson → subset → DFA

Features

Complete Regex Support

  • Character literals
  • Alternation (|)
  • Kleene star (*)
  • Concatenation (implicit)
  • Parentheses for grouping

Efficient Tokenization

  • Longest-match strategy
  • Greedy matching
  • Multiple pattern support with priority
  • Automatic whitespace handling

Robust Implementation

  • Proper epsilon closure computation
  • Correct DFA state acceptance tracking
  • Safe iterator handling
  • Exception-based error reporting

Building and Running

Prerequisites

  • C++17 compatible compiler (clang++ or g++)
  • Make build system

Build Commands

# Build the project
make

# Clean build artifacts
make clean

# Rebuild from scratch
make rebuild

# Build and run
make run

Project Structure

lexical-analyzer/
├── state.h / state.cpp           # State representation
├── NFA.h / NFA.cpp               # NFA implementation
├── DFA.h / DFA.cpp               # DFA implementation
├── regex_parser.h / .cpp         # Regex parser
├── ThompsonConstructor.h / .cpp  # NFA construction
├── SubsetConstructor.h / .cpp    # NFA to DFA conversion
├── Lexer.h / Lexer.cpp           # Tokenization engine
├── main.cpp                      # Test suite
├── Makefile                      # Build configuration
└── README.md                     # This file

Usage Example

#include "Lexer.h"

int main() {
    Lexer lexer;

    // Add patterns with names
    lexer.addPattern("if|else|while", "KEYWORD");
    lexer.addPattern("(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)*", "IDENTIFIER");
    lexer.addPattern("(0|1|2|3|4|5|6|7|8|9)*", "NUMBER");

    // Tokenize input
    string input = "if x else y";
    vector<LexerToken> tokens = lexer.tokenize(input);

    // Process tokens
    for (const auto& token : tokens) {
        cout << token.patternName << ": " << token.value << endl;
    }
    // Output:
    // KEYWORD: if
    // IDENTIFIER: x
    // KEYWORD: else
    // IDENTIFIER: y

    return 0;
}

Algorithm Details

Thompson's Construction

Creates an NFA from a regular expression with the following properties:

  • Exactly one start state
  • Exactly one accepting state
  • Epsilon transitions connect sub-NFAs
  • O(n) states for regex of length n

Example: For regex (a|b)*c:

      ε     ┌─ε─→ a ─ε─┐
start ─→ ○ ─┤           ├─→ ○ ─ε→ c ─→ (accept)
            └─ε─→ b ─ε─┘     ↑
                 ε ←─────────┘

Subset Construction

Converts NFA to DFA by:

  1. Computing epsilon closure of start state → DFA start state
  2. For each DFA state and input symbol:
    • Find all NFA states reachable via that symbol
    • Compute epsilon closure of those states
    • Create new DFA state if needed
  3. Mark DFA states as accepting if they contain any NFA accepting state

Time Complexity: O(2^n) worst case, typically much better in practice

Longest-Match Tokenization

The lexer implements greedy longest-match strategy:

  1. At each position, try all patterns
  2. Select the pattern with the longest match
  3. If multiple patterns match the same length, use the first registered
  4. Advance input position by match length
  5. Skip whitespace automatically
  6. Throw exception for unrecognizable characters

Test Suite

The project includes a comprehensive test suite (main.cpp) covering:

  • Single character matching: Basic pattern recognition
  • Concatenation: Multi-character sequences
  • Whitespace handling: Automatic space skipping
  • Alternation patterns: Choice between alternatives (if|else)
  • Mixed patterns: Complex combinations
  • Edge cases: Empty strings, whitespace-only input
  • Error handling: Invalid character detection

Run tests with:

make run

Technical Highlights

Iterator Safety

The implementation carefully handles C++ iterator lifetime:

  • Stores vector copies before iteration to avoid dangling iterators
  • Prevents undefined behavior from iterating over temporary objects

State Acceptance Tracking

DFA matching checks acceptance status:

  • After every successful transition
  • At the start (for patterns like a* that accept empty strings)
  • Uses backtracking to last accepting position on failure

Loop Control

The tokenization loop properly handles index advancement:

  • Accounts for automatic loop increment when manually advancing
  • Prevents off-by-one errors and character skipping

Limitations

  • No support for extended regex features (e.g., character classes [a-z], quantifiers {n,m}, anchors ^$)
  • No optimization/minimization of DFAs
  • No support for capture groups or backreferences
  • Destructive NFA composition operations (modifies input NFAs)
  • Case-sensitive matching only

Future Enhancements

Potential improvements:

  • DFA minimization using Hopcroft's algorithm
  • Extended regex syntax support
  • Character class support [a-z], [0-9]
  • Quantifiers +, ?, {n,m}
  • Escape sequences \n, \t, \.
  • Non-destructive NFA operations (deep cloning)
  • Performance profiling and optimization
  • Visualization of NFA/DFA state machines
  • Unicode support

Theory Background

This implementation is based on fundamental concepts from formal language theory and compiler construction:

  • Regular Languages: Languages recognizable by finite automata
  • Thompson's Construction: Kenneth Thompson (1968) - efficient NFA construction
  • Subset Construction: Rabin & Scott (1959) - NFA to DFA conversion
  • Lexical Analysis: First phase of compilation, tokenization of source code

Recommended Reading

  • "Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
  • "Introduction to Automata Theory, Languages, and Computation" - Hopcroft, Motwani, Ullman
  • "Engineering a Compiler" - Cooper & Torczon

Authors & Acknowledgments

This project was developed as an educational implementation of lexical analysis theory.

Special Thanks: Claude AI assisted with debugging, optimization, and comprehensive testing.

License

This project is provided for educational purposes.


Note: This implementation prioritizes clarity and correctness over performance. It serves as a learning tool for understanding compiler construction fundamentals.

About

A simple lexical analyzer tool, written in C++.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published