Replication Package for "TGIF: The Evolution of Developer Commit Times"

This repository contains the replication package for the research paper "TGIF: The Evolution of Developer Commit Times". Follow the steps outlined below to reproduce the results presented in the study.

The list of GitHub repositories analyzed in the study is provided at sampling/projects-accepted.txt.

Collect Initial Projects

Visit the SEART GitHub Search.
Apply the following filters:
- Number of Commits: Minimum = 12,730
- Number of Stars: Minimum = 10
- Number of Forks: Minimum = 10
- Number of Contributors: Minimum = 10
- Exclude Forks
- Created no later than: 12/31/2024
Download the search results as a CSV file.

Sampling

Navigate to the sampling directory.
Run extract_and_validate_repos.py to extract repository names from the CSV file and validate they exist on GitHub.
Fetch the projects by running fetch-projects.sh.

Data Cleaning & Writing Data to CSV Files

Data Cleaning

Return to the base directory and then navigate to the data-cleaning/inactive-projects directory.
Run remove_inactive_repos.py to identify and remove repositories with last commit before 2015 from results.json.

Assess Timezone Reliability

Return to the base directory and then navigate to the data-cleaning/timezone-reliability-assessment directory.
Run count_all_timezone_commits.bash for every desired year to calculate all commits per timezone.
Run analyze_filtered_timezone_commits.py to count commits from contributors with timezone variation (filters out likely automated commits).
Run calculate_yearly_timezone_variations.py to calculate variation metrics including standard deviation, coefficient of variation, and entropy.

Write Data to CSV

Return to the base directory and then navigate to the write-data-in-csv directory.
Generate commit counts and proportions per day by running commit_count_per_day.py.
Generate commit counts and proportions per hour by running commit_count_per_hour.py.

Statistical Analysis & Plots

Return to the base directory and then navigate to the statistical-analysis directory.
For Mann-Kendall trend tests, navigate to mann-kendall and run the desired scripts.
For Kruskal-Wallis tests, navigate to kruskal-wallis and run kruskal.py.
For effect size calculations, navigate to effect-sizes and run the Cohen's h scripts.
For linear regression analysis, navigate to linear-regression and run the regression assumption scripts to verify that the linear regression assumptions are not satisfied.
Return to the base directory and then navigate to the plots directory.
Run the desired plotting scripts such as daily_stacked_bar_chart.py, hourly_frequencies.py, total_commits_per_period.py, etc.

Distribution of Programming Languages

To analyze the distribution of programming languages in the sampled projects:

Navigate to the distribution-of-languages directory.
Run find_loc_and_occurrences.py to count repositories and lines of code per programming language.
Run find_last_commits_per_project.py to analyze repository activity over time and generate active repos per year data.

Paid vs Volunteering Analysis

To analyze differences between company-backed and volunteering projects:

Navigate to the paid_vs_volunteering directory.
See random_sample.py to understand how the random sample of repositories for manual classification was generated.
Observe the classification of the repositories in random_repos_sample.txt as company or volunteering projects.
Run python find_enterprise_projects.py to find the number of repositories from our sample that is classified as enterprise or enterprise-like from a dataset about open-source enterprise software(https://zenodo.org/records/3742962).

Average and Standard Deviation of the age of Repositories

Navigate to the projects_maturity directory.
Run python find_average_std.py to find the average and the standard deviation for the repositories of the sample.

FreeBSD Age of Developers

Navigate to the freebsd-age directory.
Run ./freebsd-age.sh and observe the average age for FreeBSD developers in 2007 and 2023.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Replication Package for "TGIF: The Evolution of Developer Commit Times"

Table of Contents

Collect Initial Projects

Sampling

Data Cleaning & Writing Data to CSV Files

Data Cleaning

Assess Timezone Reliability

Write Data to CSV

Statistical Analysis & Plots

Distribution of Programming Languages

Paid vs Volunteering Analysis

Average and Standard Deviation of the age of Repositories

FreeBSD Age of Developers

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
data-cleaning		data-cleaning
distribution-of-languages		distribution-of-languages
freebsd-age		freebsd-age
paid_vs_volunteering		paid_vs_volunteering
plots		plots
projects_maturity		projects_maturity
sampling		sampling
statistical-analysis		statistical-analysis
write-data-in-csv		write-data-in-csv
.gitignore		.gitignore
README.md		README.md
create-zip.sh		create-zip.sh
requirements.txt		requirements.txt

vtalos/commit-patterns-replication-package

Folders and files

Latest commit

History

Repository files navigation

Replication Package for "TGIF: The Evolution of Developer Commit Times"

Table of Contents

Collect Initial Projects

Sampling

Data Cleaning & Writing Data to CSV Files

Data Cleaning

Assess Timezone Reliability

Write Data to CSV

Statistical Analysis & Plots

Distribution of Programming Languages

Paid vs Volunteering Analysis

Average and Standard Deviation of the age of Repositories

FreeBSD Age of Developers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages