|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +Ruby gem providing text classification via two algorithms: |
| 8 | +- **Bayes** (`Classifier::Bayes`) - Naive Bayesian classification |
| 9 | +- **LSI** (`Classifier::LSI`) - Latent Semantic Indexing for semantic classification, clustering, and search |
| 10 | + |
| 11 | +## Common Commands |
| 12 | + |
| 13 | +```bash |
| 14 | +# Run all tests |
| 15 | +rake test |
| 16 | + |
| 17 | +# Run a single test file |
| 18 | +ruby -Ilib test/bayes/bayesian_test.rb |
| 19 | +ruby -Ilib test/lsi/lsi_test.rb |
| 20 | + |
| 21 | +# Run tests with native Ruby vector (without GSL) |
| 22 | +NATIVE_VECTOR=true rake test |
| 23 | + |
| 24 | +# Interactive console |
| 25 | +rake console |
| 26 | + |
| 27 | +# Generate documentation |
| 28 | +rake doc |
| 29 | +``` |
| 30 | + |
| 31 | +## Architecture |
| 32 | + |
| 33 | +### Core Components |
| 34 | + |
| 35 | +**Bayesian Classifier** (`lib/classifier/bayes.rb`) |
| 36 | +- Train with `train(category, text)` or dynamic methods like `train_spam(text)` |
| 37 | +- Classify with `classify(text)` returning the best category |
| 38 | +- Uses log probabilities for numerical stability |
| 39 | + |
| 40 | +**LSI Classifier** (`lib/classifier/lsi.rb`) |
| 41 | +- Uses Singular Value Decomposition (SVD) for semantic analysis |
| 42 | +- Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD |
| 43 | +- Key operations: `add_item`, `classify`, `find_related`, `search` |
| 44 | +- `auto_rebuild` option controls automatic index rebuilding after changes |
| 45 | + |
| 46 | +**String Extensions** (`lib/classifier/extensions/word_hash.rb`) |
| 47 | +- `word_hash` / `clean_word_hash` - tokenize text to stemmed word frequencies |
| 48 | +- `CORPUS_SKIP_WORDS` - stopwords filtered during tokenization |
| 49 | +- Uses `fast-stemmer` gem for Porter stemming |
| 50 | + |
| 51 | +**Vector Extensions** (`lib/classifier/extensions/vector.rb`) |
| 52 | +- Pure Ruby SVD implementation (`Matrix#SV_decomp`) |
| 53 | +- Vector normalization and magnitude calculations |
| 54 | + |
| 55 | +### GSL Integration |
| 56 | + |
| 57 | +LSI checks for the `gsl` gem at load time. When available: |
| 58 | +- Uses `GSL::Matrix` and `GSL::Vector` for faster operations |
| 59 | +- Serialization handled via `vector_serialize.rb` |
| 60 | +- Test without GSL: `NATIVE_VECTOR=true rake test` |
| 61 | + |
| 62 | +### Content Nodes (`lib/classifier/lsi/content_node.rb`) |
| 63 | + |
| 64 | +Internal data structure storing: |
| 65 | +- `word_hash` - term frequencies |
| 66 | +- `raw_vector` / `raw_norm` - initial vector representation |
| 67 | +- `lsi_vector` / `lsi_norm` - reduced dimensionality representation after SVD |
0 commit comments