SlowDB is an educational vector database implementation focused on learning database internals, vector storage, and search mechanisms. Built entirely in Python for clarity and learning purposes, it prioritizes clear implementation over performance.
- Understanding vector database architecture
- Implementing core database components from scratch
- Learning about LSM trees, HNSW indexes, and vector compression
- Exploring distributed systems concepts
-
Memory-Mapped Storage
- Basic file structure implementation: This allows for efficient file I/O operations by mapping files directly into memory, enabling faster access to data.
- Simple append-only log: This design choice ensures that data is written sequentially, which is optimal for write-heavy workloads and simplifies recovery.
- Memory mapping with Python's
mmap: Usingmmapallows the database to handle large files without loading them entirely into memory, thus optimizing memory usage.
-
LSM Tree Implementation
- In-memory memtable using sorted dictionary: The memtable is a write-optimized structure that allows for fast inserts and updates. It is sorted to facilitate efficient reads and merges.
- Simple SSTable implementation: SSTables (Sorted String Tables) are immutable data structures that store sorted key-value pairs, providing efficient read access and enabling easy compaction.
- Basic compaction strategy: Compaction is crucial for maintaining performance as it merges smaller SSTables into larger ones, reducing the number of files and improving read efficiency.
- Vector metadata storage: Metadata about vectors (e.g., dimensions, offsets) is stored alongside the actual data to facilitate quick access and management.
-
Vector Storage
- Basic vector serialization: Vectors are serialized for storage, allowing for efficient disk usage and retrieval.
- Simple product quantization implementation: This technique reduces the storage requirements of vectors while maintaining their representational quality, which is essential for large datasets.
- NumPy-based vector operations: Leveraging NumPy for vector operations provides optimized performance due to its underlying C implementations.
-
HNSW Index
- Basic graph structure: The Hierarchical Navigable Small World (HNSW) graph is used for efficient nearest neighbor search, allowing for logarithmic time complexity in search operations.
- Insertion algorithm: The insertion algorithm ensures that new vectors are added to the graph while maintaining its navigability and structure.
- Search implementation: The search algorithm traverses the graph to find the nearest neighbors efficiently, utilizing a priority queue to manage candidate nodes.
- Simple distance metrics: Various distance metrics (e.g., Euclidean, cosine) are implemented to measure similarity between vectors, allowing users to choose the most appropriate metric for their use case.
-
Memory Management
- Active segment handling: This involves managing segments of data in memory to optimize read and write operations, ensuring that frequently accessed data is readily available.
- Basic cache implementation: A caching layer is implemented to store recently accessed data, reducing the need for repeated disk I/O operations.
- Memory usage monitoring: Monitoring memory usage helps in optimizing performance and preventing memory leaks, ensuring the system remains responsive.
-
Basic Distribution
- Consistent hashing implementation: This technique allows for efficient data distribution across multiple nodes, ensuring that the addition or removal of nodes has minimal impact on data locality.
- Simple node management: Basic functionalities for managing nodes in a distributed system are implemented, allowing for easy scaling and maintenance.
- Basic data partitioning: Data is partitioned across nodes to balance load and improve access times.
-
Replication
- Leader-follower implementation: This model ensures data availability and fault tolerance by replicating data across multiple nodes, with one node acting as the leader and others as followers.
- Basic failover mechanism: In case of node failure, the system can automatically promote a follower to a leader, ensuring continuous availability of the database.
- Core Implementation: Python 3.8+
- Key Dependencies:
- NumPy (vector operations): Provides efficient array operations and mathematical functions.
- mmap (memory-mapped files): Enables efficient file I/O operations.
- Protocol Buffers (serialization): Used for efficient data serialization and communication.
- gRPC (network communication): Facilitates communication between distributed components.
- Educational: Clear, well-documented code over performance, making it easier for learners to understand the underlying concepts.
- From Scratch: Minimal use of specialized libraries to encourage learning and understanding of core principles.
- Modular: Clean interfaces between components to promote maintainability and extensibility.
- Documented: Extensive comments and documentation to aid understanding and facilitate contributions.
- Testable: Comprehensive test coverage to ensure reliability and correctness of the implementation.
- Production-level performance: The focus is on educational value rather than optimization for production use.
- Production-ready reliability: While reliability is considered, the primary goal is to provide a learning platform.
- Complex optimization techniques: The implementation prioritizes clarity over advanced optimizations.
- Advanced compression methods: Basic compression techniques are used to keep the implementation straightforward.
(Coming soon)
- Basic storage engine implementation
- Simple vector operations
- HNSW index implementation
- Basic query capabilities
- Simple distribution system
This is an educational project. Contributions that improve code clarity, documentation, or add learning resources are welcome.