GitHub, the go-to hub for developers worldwide, has seen a whopping 12 million contributors shaping around 31 million projects since 2008. It's the busy epicenter of open-source development.
Now, picture this: you're a student navigating the coding universe or someone just stepping into a new stack. GitHub, with its countless projects, feels like a vast library without a roadmap. The challenge here is real - finding your way through the code maze.
So, what's the plan? We're diving into the GitHub realm to decode the secrets. What programming languages are dominating the scene and what tricks are being used.
Why bother, you ask? For students grasping coding nuances or those embarking on a new tech stack, it's like having a guidebook. We aim to demystify GitHub, making it more navigable, and bring you insights into the coding practices that define different languages. Ready to jump into the GitHub adventure with us? 🚀
Since we’ll be working with huge amounts of data, we need a powerful tool that allows us to perform all the required operations and analysis. The large dataset considered for this project comprises more than 2.8 million open source GitHub repositories, over 145 million unique commits, 2 billion different file paths, and the contents of the latest revision for 163 million files.
- Python 3.9
- Pip. Install the following dependencies:
- Python-dotenv:
pip install python-dotenv - Pandas:
pip install pandas
- Python-dotenv:
- PySpark
- Set .env file with Bucket path to the datasets (see .env for reference)
The data flow starts with datasets stored in a GCS bucket. Python scripts are run through PySpark jobs, which are submitted to Dataproc, which processes the data and stores the results back in the GCS bucket.
The project consists of 9 scripts, each with a specific purpose. They are built from an interface in order to streamline the coding process.
The scripts, and a brief explanation of what they do, are as follows:
- Top 15 languages: obtains the top 15 languages used in GitHub repositories.
- Top 5 licenses: obtains the top 5 open-source licenses used in GitHub repositories.
- Main vs master: obtains the number of repositories that use the master branch vs the main branch as a head branch.
- Most active repos: obtains the 25 repos with the most commits and at least one commit in the last two years.
- How many repos have READMEs: obtains how many repos have a README as documentation.
- How many repos have .md: obtains how many repos have a file.md but isn't a README.
- Top 5 single language repositories: obtains the top 5 languages used in the repositories with just one language.
- Multiple language repositories: obtains and combines multi-language statistics for repositories. Needs argument -l or --language and a language. Includes:
- Total count of repositories with more than one language
- The average number of languages per repository
- The top 25 combinations of languages of a certain language. For example, the top 25 languages combinations used in repositories that use Python.
- Top build tools: obtains the top build tools used with the number of repositories using them.
- Clone the repository.
- Install the requirements.
- Run the script you want to execute:
- There's a flag for testing purposes (
-tor--test), which will run the script on a small subset of the data. For example, to run thetop_15_languagesscript on a small subset of the data, runspark-submit scripts/top_15_languages.py -t. - Some scripts have a flag (
-lor--language) to choose the language to filter the script. For example, to run thetop_5_single_language_repositoriesscript for the language Python, runspark-submit scripts/top_5_single_language_repositories.py -t -l Python. - The logs of the script can be found in the
logsfolder. These provide a cleaner view of the script's execution.
- There's a flag for testing purposes (
- Clone the scripts in the bucket, in the same directories as in the repo. Also store the datasets in
/data - Install requirements:
python -m pip install python-dotenv
export BUCKET={bucket_dir}
cd ~
mkdir logs
touch logs/logs.log
- Scripts can be run with the following commands:
spark-submit --py-files $BUCKET/scripts/script_interface.py $BUCKET/scripts/{script_name}.py. For example, to run thetop_15_languagesscript, runspark-submit --py-files $BUCKET/scripts/script_interface.py $BUCKET/scripts/top_15_languages.py.
When searching for large datasets, we found out Google Cloud offers multiple public datasets through the BigQuery platform. Our dataset originates directly from GitHub, publicly available on BigQuery under "Github Activity Data on BigQuery". It has data up to November of 2022.
The dataset contains information from open source GitHub repositories, and from each one of them it provides very detailed information about commits, contents, files, languages and licenses.
The dataset exceeds 3TB in size, in total. It consists of the following tables: commits, contents, files, languages, licenses, sample_commits, sample_contents, sample_files and sample_repos.
For our scripts, we’ll be focusing mainly on languages, licenses, sample_contents, sample_files and sample_commits which come up to 37GB in size.
The test dataset is a small subset of the original dataset. It is used for testing purposes and is not included in the final results. The datasets were obtained using the following query on BigQuery and then downloaded as a .csv:
SELECT * FROM `bigquery-public-data.github_repos. {table_name}` LIMIT 1000
These can be found on the resources folder.
Some of these datasets had to be cleaned in order to be used, mostly because in BigQuery some values were stored as records (so when it downloaded, it downloaded as a column with a .json file). The cleaning scripts used can be found in the src folder. This was mostly useful for the scripts that used the table languages.
Due to size and limitations on costs in GCP, the datasets used were the samples ones. For example, the table contents weighs 2.44TB while sample_contents just 24GB.
Nonetheless, the scripts provided in this repo are able to run with these huge datasets.
The aforementioned datasets were downloaded from BigQuery and uploaded to GCP, to streamline the process of developing and executing the scripts and costs.
