Data Extraction tool from the Spanish digital newspaper La Razón.

1. Scraps the web

Stores news in "data" folder in json format

2. Process the files

Splits the list of news in n-chunks for further reading and tokenizing
Creates n temporary files where news of every chunk are written

3. Tokenization

Opens and groups all temporary files to tokenize and write them in plain text
Writes every token in a json file as list

4. Tagging

Adds the gramathical cathegory to every word

5. Frequencies

Example in freqs folder

Counts the times that a word with a given tag appears in the corpora

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
freqs		freqs
scrapper		scrapper
tagged_data		tagged_data
tokenized		tokenized
README.md		README.md
parser.py		parser.py
parser_funcs.py		parser_funcs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Extraction tool from the Spanish digital newspaper La Razón.

1. Scraps the web

2. Process the files

3. Tokenization

4. Tagging

5. Frequencies

About

Uh oh!

Releases

Packages

Languages

Borreg0/WebSc

Folders and files

Latest commit

History

Repository files navigation

Data Extraction tool from the Spanish digital newspaper La Razón.

1. Scraps the web

2. Process the files

3. Tokenization

4. Tagging

5. Frequencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages