Skip to content

This repository contains the replication package for the research paper 'TGIF: The Evolution of Developer Commit Times'

Notifications You must be signed in to change notification settings

vtalos/commit-patterns-replication-package

Repository files navigation

Replication Package for "TGIF: The Evolution of Developer Commit Times"

This repository contains the replication package for the research paper "TGIF: The Evolution of Developer Commit Times". Follow the steps outlined below to reproduce the results presented in the study.

The list of GitHub repositories analyzed in the study is provided at sampling/projects-accepted.txt.

Table of Contents


Collect Initial Projects

  1. Visit the SEART GitHub Search.
  2. Apply the following filters:
    • Number of Commits: Minimum = 12,730
    • Number of Stars: Minimum = 10
    • Number of Forks: Minimum = 10
    • Number of Contributors: Minimum = 10
    • Exclude Forks
    • Created no later than: 12/31/2024
  3. Download the search results as a CSV file.

Sampling

  1. Navigate to the sampling directory.
  2. Run extract_and_validate_repos.py to extract repository names from the CSV file and validate they exist on GitHub.
  3. Fetch the projects by running fetch-projects.sh.

Data Cleaning & Writing Data to CSV Files

Data Cleaning

  1. Return to the base directory and then navigate to the data-cleaning/inactive-projects directory.
  2. Run remove_inactive_repos.py to identify and remove repositories with last commit before 2015 from results.json.

Assess Timezone Reliability

  1. Return to the base directory and then navigate to the data-cleaning/timezone-reliability-assessment directory.
  2. Run count_all_timezone_commits.bash for every desired year to calculate all commits per timezone.
  3. Run analyze_filtered_timezone_commits.py to count commits from contributors with timezone variation (filters out likely automated commits).
  4. Run calculate_yearly_timezone_variations.py to calculate variation metrics including standard deviation, coefficient of variation, and entropy.

Write Data to CSV

  1. Return to the base directory and then navigate to the write-data-in-csv directory.
  2. Generate commit counts and proportions per day by running commit_count_per_day.py.
  3. Generate commit counts and proportions per hour by running commit_count_per_hour.py.

Statistical Analysis & Plots

  1. Return to the base directory and then navigate to the statistical-analysis directory.
  2. For Mann-Kendall trend tests, navigate to mann-kendall and run the desired scripts.
  3. For Kruskal-Wallis tests, navigate to kruskal-wallis and run kruskal.py.
  4. For effect size calculations, navigate to effect-sizes and run the Cohen's h scripts.
  5. For linear regression analysis, navigate to linear-regression and run the regression assumption scripts to verify that the linear regression assumptions are not satisfied.
  6. Return to the base directory and then navigate to the plots directory.
  7. Run the desired plotting scripts such as daily_stacked_bar_chart.py, hourly_frequencies.py, total_commits_per_period.py, etc.

Distribution of Programming Languages

To analyze the distribution of programming languages in the sampled projects:

  1. Navigate to the distribution-of-languages directory.
  2. Run find_loc_and_occurrences.py to count repositories and lines of code per programming language.
  3. Run find_last_commits_per_project.py to analyze repository activity over time and generate active repos per year data.

Paid vs Volunteering Analysis

To analyze differences between company-backed and volunteering projects:

  1. Navigate to the paid_vs_volunteering directory.
  2. See random_sample.py to understand how the random sample of repositories for manual classification was generated.
  3. Observe the classification of the repositories in random_repos_sample.txt as company or volunteering projects.
  4. Run python find_enterprise_projects.py to find the number of repositories from our sample that is classified as enterprise or enterprise-like from a dataset about open-source enterprise software(https://zenodo.org/records/3742962).

Average and Standard Deviation of the age of Repositories

  1. Navigate to the projects_maturity directory.
  2. Run python find_average_std.py to find the average and the standard deviation for the repositories of the sample.

FreeBSD Age of Developers

  1. Navigate to the freebsd-age directory.
  2. Run ./freebsd-age.sh and observe the average age for FreeBSD developers in 2007 and 2023.

About

This repository contains the replication package for the research paper 'TGIF: The Evolution of Developer Commit Times'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •