This project is an advanced information retrieval (IR) system developed as part of our university curriculum. It provides a graphical user interface to explore and evaluate various IR models and text processing techniques. The system is built using Python and leverages the tkinter library for the interface and nltk for natural language processing.
- TAOURIRT Hamza
- Sebti Asma
- Project Description
- Core Features
- Technical Architecture
- Preprocessing and Indexing
- Search and Retrieval Models
- Evaluation Framework
- Technologies Used
- Setup and Usage
- Prerequisites
- Running the Application
- Using the Interface
- File Structure
This application allows users to perform searches on a document collection using several classic information retrieval models. It is designed to be both a practical search tool and an educational platform for comparing the effects of different preprocessing pipelines and retrieval algorithms.
Users can input free-text or boolean queries, select tokenization and stemming methods, and choose from Vector Space, Probabilistic (BM25), or Boolean retrieval models. The system also features a robust evaluation mode that calculates standard IR metrics like Precision, Recall, and F-score, and visualizes performance with a Recall/Precision curve.
- Graphical User Interface: An intuitive UI built with
tkinterfor easy interaction. - Flexible Preprocessing:
- Tokenization: Choice between simple whitespace splitting (
Split) or a more robustRegexpTokenizer. - Normalization: Optional stemming with
PorterStemmerorLancasterStemmer.
- Tokenization: Choice between simple whitespace splitting (
- Multiple Retrieval Models:
- Vector Space Model: Supports three similarity measures:
- Scalar Product
- Cosine Similarity
- Jaccard Coefficient
- Probabilistic Model: Implements the BM25 ranking function with configurable
kandbparameters. - Boolean Model: Processes queries with
AND,OR, andNOToperators, respecting operator precedence.
- Vector Space Model: Supports three similarity measures:
- Comprehensive Evaluation Module:
- Uses a predefined set of queries and relevance judgements (
QueriesandJudgementsfiles). - Calculates and displays Precision, Recall, F-score, P@5, and P@10.
- Generates and displays an interpolated Recall/Precision curve for visual performance analysis.
- Uses a predefined set of queries and relevance judgements (
- Pre-computed Indexes: The system uses pre-generated inverted files and descriptor files to ensure fast query processing.
The project is divided into two main parts: an offline indexing script and the main online search interface.
The Create_Documents_Desc_Inv.py script is responsible for offline processing. It reads the raw text documents from the Collections/ directory and performs the following steps:
- Tokenization and Stop Word Removal: It processes the text to generate a list of tokens, removing common English stop words.
- Frequency and Weight Calculation: For each term in each document, it calculates the term frequency (TF) and the normalized TF-IDF weight. The weight is calculated as:
poids = (freq / max_freq) * log10(N / n_i + 1). - Stemming: It applies Porter and Lancaster stemmers to the tokens and recalculates frequencies and weights for the stemmed terms.
- Index Generation: It generates two types of files for each configuration (tokenization/stemming combination):
- Descriptor Files (
Descripteur*.txt): Stores(doc_id, token, frequency, weight)tuples, sorted by document ID. Used for models requiring document-term information. - Inverted Files (
Inverse*.txt): Stores(token, doc_id, frequency, weight)tuples, sorted by token. This is the classic inverted index structure.
- Descriptor Files (
The Interface.py file contains the logic for the GUI and the retrieval models. When a user executes a search:
- The query is preprocessed using the same tokenization and stemming settings selected in the UI.
- The appropriate pre-computed index file is loaded based on the UI settings.
- The selected retrieval model is executed:
- Vector/Probabilistic Models: Iterate through documents, calculate a relevance score (RSV) for each, and return a ranked list.
- Boolean Model: Parses the query logic and returns an unranked set of documents that strictly satisfy the boolean conditions.
- The results are displayed in the results table.
When the "Queries Dataset" option is enabled, the system enters evaluation mode:
- A query is selected from the
Queriesfile using the spinbox. - The corresponding set of known relevant documents is loaded from the
Judgementsfile. - The search is performed, and the retrieved document list is compared against the relevance judgements.
- Metrics are calculated and displayed, and the Recall/Precision graph is plotted using
matplotlib.
- Language: Python 3
- GUI:
tkinter - NLP:
nltk(Natural Language Toolkit) - Scientific Computing:
numpy - Plotting:
matplotlib
Ensure you have Python 3 installed. Then, install the required libraries:
pip install nltk numpy matplotlib