This repository contains the replication package for the research paper "TGIF: The Evolution of Developer Commit Times". Follow the steps outlined below to reproduce the results presented in the study.
The list of GitHub repositories analyzed in the study is provided at sampling/projects-accepted.txt.
- Replication Package for "TGIF: The Evolution of Developer Commits Over Time"
- Visit the SEART GitHub Search.
- Apply the following filters:
- Number of Commits: Minimum = 12,730
- Number of Stars: Minimum = 10
- Number of Forks: Minimum = 10
- Number of Contributors: Minimum = 10
- Exclude Forks
- Created no later than: 12/31/2024
- Download the search results as a CSV file.
- Navigate to the
samplingdirectory. - Run
extract_and_validate_repos.pyto extract repository names from the CSV file and validate they exist on GitHub. - Fetch the projects by running
fetch-projects.sh.
- Return to the base directory and then navigate to the
data-cleaning/inactive-projectsdirectory. - Run
remove_inactive_repos.pyto identify and remove repositories with last commit before 2015 from results.json.
- Return to the base directory and then navigate to the
data-cleaning/timezone-reliability-assessmentdirectory. - Run
count_all_timezone_commits.bashfor every desired year to calculate all commits per timezone. - Run
analyze_filtered_timezone_commits.pyto count commits from contributors with timezone variation (filters out likely automated commits). - Run
calculate_yearly_timezone_variations.pyto calculate variation metrics including standard deviation, coefficient of variation, and entropy.
- Return to the base directory and then navigate to the
write-data-in-csvdirectory. - Generate commit counts and proportions per day by running
commit_count_per_day.py. - Generate commit counts and proportions per hour by running
commit_count_per_hour.py.
- Return to the base directory and then navigate to the
statistical-analysisdirectory. - For Mann-Kendall trend tests, navigate to
mann-kendalland run the desired scripts. - For Kruskal-Wallis tests, navigate to
kruskal-wallisand runkruskal.py. - For effect size calculations, navigate to
effect-sizesand run the Cohen's h scripts. - For linear regression analysis, navigate to linear-regression and run the regression assumption scripts to verify that the linear regression assumptions are not satisfied.
- Return to the base directory and then navigate to the
plotsdirectory. - Run the desired plotting scripts such as
daily_stacked_bar_chart.py,hourly_frequencies.py,total_commits_per_period.py, etc.
To analyze the distribution of programming languages in the sampled projects:
- Navigate to the
distribution-of-languagesdirectory. - Run
find_loc_and_occurrences.pyto count repositories and lines of code per programming language. - Run
find_last_commits_per_project.pyto analyze repository activity over time and generate active repos per year data.
To analyze differences between company-backed and volunteering projects:
- Navigate to the
paid_vs_volunteeringdirectory. - See
random_sample.pyto understand how the random sample of repositories for manual classification was generated. - Observe the classification of the repositories in
random_repos_sample.txtas company or volunteering projects. - Run
python find_enterprise_projects.pyto find the number of repositories from our sample that is classified as enterprise or enterprise-like from a dataset about open-source enterprise software(https://zenodo.org/records/3742962).
- Navigate to the
projects_maturitydirectory. - Run
python find_average_std.pyto find the average and the standard deviation for the repositories of the sample.
- Navigate to the
freebsd-agedirectory. - Run
./freebsd-age.shand observe the average age for FreeBSD developers in 2007 and 2023.