This project implements an algorithm designed to process and analyze text data by removing stopwords and performing linguistic analysis. The algorithm is built to be efficient and modular, allowing for easy integration into larger text processing pipelines. It leverages Python for its simplicity and robust library ecosystem.
- Diogo Almeida 70140: dmh.almeida@campus.fct.unl.pt
- Duarte Rodrigues 70150: dms.rodrigues@campus.fct.unl.pt
- Lina Lekbouri 72697: l.lekbouri@campus.fct.unl.pt
Before you get started, make sure you have the following software installed on your machine:
- Git
- Python (3.8 version or higher)
Cloning the Repository
-
Open your terminal or command prompt.
-
Clone the repository using its URL:
git clone https://github.com/linalek/Data-Mining-Keyword-Extractor.git
-
Navigate into the project folder:
cd Data-Mining-Keywork-Extractor
To get started with this project, follow these steps to set up your environment:
-
Create a Virtual Environment
Create a virtual environment to isolate project dependencies:python -m venv venv -
Activate the Virtual Environment
Activate the virtual environment:- On Windows:
venv\Scripts\activate - On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install Dependencies
Install the required dependencies listed inrequirements.txt:pip install -r requirements.txt
To run the tests, ensure you are in the root folder PAD_Project2. For example, to run the test_stopwords test located in the tests folder, execute:
python -m tests.test_stopwords
To run any other test, replace test_stopwords with the name of the desired test file in the tests folder, using the format python -m tests.<test_name>.
NOTE: to run the test for stopwords, the path of a corpus test for test1 must be change in the code:
("Test1", read_text_files("path/to/your/corpus"))To run the main extraction pipeline on a specific corpus, follow these steps:
-
Prepare your corpus
Ensure that your corpus is a folder containing plain text files. Each file should represent one document. The algorithm will process all files within the folder. -
Set the corpus path
Open themain.pyfile and locate thecorpus_path:strvariable inside themain()function. Replace the variable with the path to your corpus folder. For example:corpus_path:str = "path/to/your/corpus"
-
Run the main script Once the corpus path is correctly set, execute the main script from the root of the project using:
python main.py