common-crawl

the goal of this project is to ingest and analyze sample data from internet crawled public data prepared by common-crawl.

execute download.sh bash file to go to commoncrawl.org and download 3 segments of the most recent data to local

sh download.sh

run the following command to start a postgres db instance and our dockerized python project

docker compose up

for each downloaded WARC file extract every external link you can find and save them in txt file line by line(ingest_warc.py)
Load the data into a single column postgres table on docker (insert_db.py)
read from table into pandas dataframe for next step aggregations(read_table.py)
add a column to df as a flag that indicates if the link redirects to a home page or a subsection (add_flag.py)
Aggregate dataframe by primary links and compute their frequency by also keeping track of subsections(aggregare_by_doamin.py)
add a column to df to indicate country of the url (add_country.py)
Add a column that categorizes the type of content hosted by the website by using an external api (categorize_domains.py)
save aggregated dataframe into postgre db (save_aggregate_table.py)
The final aggregation result dataframe saved in columnar (arrow) files following a partition schema on county name (save_results.py)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
sql		sql
src/common_crawl		src/common_crawl
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
download.sh		download.sh
requirements.txt		requirements.txt

Provide feedback