This web application is designed to label scraped webpages. It allows users to annotate and tag web content for further analysis. With this application, you can easily label and categorize webpages based on your specific requirements.
For more information, see the next section "Further Documentation".
- Clone the Repository
git clone https://github.com/Pantonius/TagPag.git
cd TagPag- Setup a Virtual Environment
For example install pyenv as per their instructions and setup a virtual environment for the project:
pyenv install 3.12.7
pyenv virtualenv 3.12.7 tagpag-envFor Windows, e.g., install conda as per their instructions and setup a virtual environment for the project:
conda create -n tagpag-env python=3.12.7
conda activate tagpag-env- Install the Requirements
pip install -r requirements.txtor
conda install --file requirements.txt- Start the Project
streamlit run src/app.pyNotice that a new .env file has been created from the .env-example file, which uses the example data located in example_workdir.
At this point you can take a look around. Maybe the usage documentation can be of service.
-
Open the
.envwith any text editor. If the file does not exist, create one copying the content of.env-exampleinto.env(e.g., usecp .env-example . env). -
Set up a
WORKING_DIR(i.e., a directory that will contain all the data of the project) andLABELS(i.e., the labels that will be used to tag the webpages).
WORKING_DIR = '/PATH/TO/WORKING_DIRECTORY'
LABELS = 'label_1,label_2,label_3'- Make sure that the
TASKS_FILEandHTML_DIRare in theWORKING_DIRECTORY
The TASKS_FILE should contain, at least, two columns which are defined by TASKS_ID_COLUMN (by default, _id) and TASKS_URL_COLUMN (by default, url).
The HTML_DIR should contain the html files that are associated with the tasks. The following naming scheme should be used for the html files: TASK_ID.html, where TASK_ID is the value of the TASKS_ID_COLUMN.
Your folder structure should look like this:
WORKING_DIR/
├── TASKS_FILE
└── HTML_DIR
├── FIRST_ID.html
├── SECOND_ID.html
└── ...
If you copied .env from .env-example, the program will assume the following naming
example_workdir
├── tasks.csv
└── html
├── FIRST_ID.html
├── SECOND_ID.html
└── ...
- (Re)-start TagPag
streamlit run src/app.pyA more detailed guide to setting up the project can be found in the doc folder. It will lead you through the process using the example data of the example_workdir.
For more information on how to use Streamlit, refer to the Streamlit Documentation.
