Atlas is a RAG-based knowledge copilot for Obsidian markdown vaults. It performs chunking, vector search, and grounded LLM responses with citation to original notes.
Introduction • Architecture • Setup • Quick Start • Results
As a famous LLM once said.
A knowledge copilot is an AI assistant that understands your knowledge base and helps you query, synthesize, and navigate it—while staying faithful to the original sources.
A good knowledge copilot can search semantically, answer with context, connect ideas, cite sources and assist, not replace thinking. Its basically a "copilot" for our brains.
A good knowledge copilot is built using RAG (Retrieval-Augmented Generation). And what is that?
RAG or Retrieval Augmented Generation is a technique used to retrieve external knowledge (as context) and feed it into the model when prompted with a query. When an LLM is trained on generic data to perform next token prediction, it might be good at that task but would not be able to answer specific questions such as based on our knowledge sources. Hence RAG ensures that we dont just rely on the model's weights but "augment" it using our knowledge base to answer questions related to our knowledge base.
RAG is good because:
- Reduces hallucinations
- Enables citations
- Keeps answers faithful to source material
Obsidian is a light weight application used to take notes and create knowledge bases. It saves all the notes as markdown making it easy to load, process and render a huge amount of notes.
- Obsidian Frontmatter
- Obsidian Frontmatter (YAML) is metadata at the top of a note, enclosed in ---, used for organizing data like aliases, tags, type, status, and links, enabling powerful filtering, display, and automation with core features.
- It provides key-value pairs for structured data, improving note management.
Imagine you're like me and spend a lot of time taking notes about various things in your life. I mean everything. Cooking, learning, grocery shopping, books read etc. I do that but increasingly realized that I dont quite often go back and utilize the knowledge in those notes. This is precisely because the knowledge is scattered and not structured. But with the help of a RAG based knowledge copilot, this knowledge generation can be offloaded to them and then we can interact with our knowledge base using natural language via an LLM.
So this repo is a RAG based knowledge copilot that operates on an Obsidian vault. So basically if you have a bunch of notes, you now have a natural language powered AI assistant who can answer your questions based off your notes.
The scope right now (ie, initially) would be to perform note summarization and answer questions based on note(s) reasoning.
TinyLlama was decided to be used. Its an interesting model because
- its pretrained from scratch
- has a tiny parameter budget (1.1 B) and hence its lightweight enough for our purposes
- pretrained on a massive dataset (3T tokens)
- we use the chat model
So it follows the scaling law that even a small LLM when trained on enough quality data leads to competetive performance.
Architecture diagram:
A sample of the obsidian_index.json is as below:
[
{
"note_id": "folder/sample note.md",
"title": "sample note",
"relative_path": "folder/sample note.md",
"raw_text": "note body. Lorem Ipsum",
"frontmatter": {"tags": ["personal", "health"], "date": "2023-10-01"},
"headings": ["Heading 1", "Heading 2"],
"tags": ["tag1", "tag2"],
"wikilinks": ["wikilink|custom name"],
"word_count": 127
},
...
]Its a list of dictionaries, where each dictionary represents one note from the obsidian vault.
Add the project root ie, Folder containing this README to PYTHONPATH whichever way you want. One way would be to create a .env and write the following in it
PYTHONPATH=\full\path\to\projectroot
And place this .env file in the project root. Works for VS Code.
Another option would be to run $env:PYTHONPATH = \full\path\to\projectroot in powershell to set the env variable and then run the scripts.
Install pytorch, torchvision via pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu118 - Conda doesnt install GPU version on Windows.
Before committing changes run pre-commit run --all-files or pre-commit run --file <file1>, <file2> ...
Run python .\atlas\core\ingest\obsidian_vault_processor.py
In the above script, modify
obsidian_vault_pathto point to your obsidian vault's root folder ie, the folder containing.obsidianfolderobsidian_index_pathto specify where theobsidian_index.jsonwill be saved. This json file contains the processed data after ingesting and processing the notes from the obsidian vault. See architecture section for the structure of this json.
Run python .\atlas\core\chunker\structural_chunker.py
In the above script, modify
processed_data_pathto specify where theobsidian_index.jsonis presentoutput_pathto specify where thechunked_data.jsonwill be saved. This json file contains the chunks generated from the notes processed by the "Obsidian Vault Processor" module. SeeREADMEinatlas/core/chunkerfor structure of this json.max_wordsto set what determines the size of chunks created. This should be changed primarily based on the token limit of the encoding model and context size of the LLM used in the later modules.
Run python .\atlas\core\embedder\sentence_transformer\impl_embedder.py
In the above script modify,
chunk_data_pathto specify where thechunked_data.jsonis presentoutput_pathto specify whereembedded_chunks.jsonwill be saved. This json is exactly similar tochunked_data.jsonwith the addedembeddingfor each chunk. SeeREADMEinatlas/core/embedderfor structure of this json.encoder_config_pathto specify your own configuration settings for the encoder model used to generate the chunk embeddings. By default, seealtas/core/configs/sentence_transformer_config.yamlfor changing the encoder model used and its configuration. The following can be changed:
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
normalize_embeddings: true
device: cudaRun python .\atlas\core\indexer\run_indexer.py
In the above script modify,
results_save_pathto specify where the index and metadata file will be savedembedded_chunks_json_fileto specify where theembedded_chunks.jsonis present
Run python .\atlas\core\retriever\context.py
In the above script modify,
results_load_pathto specify where the index and metadata file are present and will be loaded fromuser_queryto specify the user prompt/querykto specify the number of most relevant chunks as the context for the user query
Run unit tests via VS Code
or
Run only unit tests - pytest -m unittest
Run only integration tests - pytest -m integration
Run only tests that can be run on CI - pytest -m runonci
Run ALL tests - pytest
Note : Anytime a pytest marker is added to a pytest, ensure it is registered in pytest.ini otherwise pytest will complain
