This repository contains a set of scripts designed to convert desired articles from a Wikipedia XML dump into plain text formatted files, making them suitable for building embedded model knowledge bases. Knowledge bases like this can enable retrieval-augmented generation (RAG), making it possible to increase recall accuracy and prevent hallucinations in language models. Combined with reasoning models, it is a potent combination that opens the door to assistants that have the ability to perform competent reasoning with real-world information.
The project consists of scripts that:
- Extract lists of desired Wikipedia article titles.
- Convert those articles from their original Wikipedia XML format into clean TXT files.
The extracted text can be used as training data for language models, semantic search engines, or other natural language processing applications.
Wikipedia dumps, even those containing only revision for each article, are large and generally unwieldy documents (expectedly so). Embedding an entire Wikipedia dump on a consumer GPU without running multiple instances of the embedding model being used would take an unbelievably long amount of time. Thus, we filter out articles based on the level of specificity required.
Wikipedia moderators compile and maintain a 5 level "vital article" list that categorizes articles by how fundamental their inclusion is in a wiki. This list allows for users to have control over the size of their knowledge base. Below is a table describing the distribution of data by level:
| Level | Number of Articles |
|---|---|
| 1 | 10 |
| 2 | 100 |
| 3 | 1000 |
| 4 | 10000 |
| 5 | 50000 |
Inspired by Einstein's approach to organizing knowledge, we break down Wikipedia articles into two distinct categories: general and special. The general level encompasses the foundational articles that form the broad base of your knowledge base (KB), ensuring it contains essential information across various domains. Meanwhile, the special level focuses on in-depth coverage of specific topics you choose, allowing for highly customized and comprehensive knowledge without sacrificing the core utility of your KB.
Run the following script to clone the repository and install dependencies:
git clone https://github.com/varunvasudeva1/wiki-kb.git
cd wiki-kb
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Download your Wikipedia dump of choice from here. A torrent download is recommended.
- Uncompress the
.bz2result to obtain an.xmlfile (warning: depending on the device you run this action on, it may take some time). Place this at the root of thewiki-kbdirectory. - Configure options in
config.json. - Run the script:
python main.py
data_filename: Filename of the XML containing articles.general_level: The desired level for your KB's general knowledge. Set value between 1 and 5.special_level: The desired level for your KB's specialized knowledge in select few topics. Set value between 1 and 5 ornullif no specialized knowledge topics are needed. Value must be greater thangeneral_level.special_level_topics: The topics that the special level applies to. Set to[]if no specialized knowledge topics are needed.
To utilize your KB with language models, the resulting text outputs will need to be embedded by an embedding model so that language models can use them before providing answers. If you're using a custom RAG pipeline, you probably don't need to read this section anyway.
If you don't know what embedding is and just want a way to ground your LLM responses, follow these steps to build your embedded KB:
- Install Open WebUI.
- Navigate to
Admin Settings>Documents.- Select between
OllamaandOpenAI. IfOllama, set the appropriate embedding model. For a good model that balances runtime and performance, runollama pull nomic-embed-textfrom a terminal and then select it as theEmbedding Model. - Crank
Embedding Batch Sizeto the maximum of8192. - Save your configuration.
- Select between
- Make your way to
Workspace>Knowledge>+. Choose a title and description. - Upload your
outputdirectory. - Wait. Embedding thousands of documents may take quite some time.
To see my detailed guide on setting up Open WebUI with other components for a complete LLM server, click here.
Tip
To strike a balance between efficiency of embedding and size of knowledge base, (GL = 3, SL = 4) or (GL = 4, SL = 5) is recommended. This will let your KB be complete enough to be very useful and not so large that embedding and uploading takes forever.
Contributions are welcome! If you'd like to contribute, please fork this repository and submit a pull request. Please discuss any major changes with the maintainers first via issues.
This project is licensed under the MIT License.