Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.

JetBrains-Research/task-tracker-post-processing

Repository files navigation

JetBrains Research CircleCI

Status: Archived

No longer maintained

Table of Contents

TaskTracker postprocessing

Overview

This tool prepares raw data collected by TaskTracker plugin for further analysis. This data contains snapshots of code collected during the solution process and records of user interaction with the IDE.

The tool consists of two major modules:

  • data processing
  • data visualization

Data processing

Requirements for the source data

  1. The source data has to be in the .csv format.
  2. Activity-tracker files have a prefix ide-events. We use activity-tracker plugin.
  3. Codetracker files can have any names with a prefix of the key of the task, the data for which is collected in this file. We use TaskTracker plugin at the same time with the activity tracker plugin.
  4. Columns for the activity-tracker files can be found in the const file (the ACTIVITY_TRACKER_COLUMN const).
  5. Columns for the task-tracker files can be found in the const file (the TASK_TRACKER_COLUMN const).

Processing

The correct order for data processing is:

  1. Do primary data preprocessing (use preprocess_data function from preprocessing.py).
  2. Merge task-tracker files and activity-tracker files (use merge_tt_with_ati function from merging_tt_with_ati.py).
  3. Find tests results for the tasks (use run_tests function from tasks_tests_handler.py).
  4. Reorganize files structure (use reorganize_files_structure function from task_scoring.py).
  5. [Optional] Remove intermediate diffs (use remove_intermediate_diffs function from intermediate_diffs_removing.py).
  6. [Optional, only for Python language] Remove inefficient statements (use remove_inefficient_statements function from inefficient_statements_removing.py).
  7. [Optional] Add int experience column (use add_int_experience function from int_experience_adding.py).

Note: you can use the actions independently, the data for the Nth step must have passed all the steps before it.

Available languages

  • C++
  • Java
  • Kotlin
  • Python

Visualization

You can visualize different parts of the pipeline.

Participants distribution

Note: Run before 'reorganize_files_structure' because the old files structure is used to count unique users.

Use get_profile_statistics function from statistics_gathering.py to get the age and experience statistics. After that, run plot_profile_statistics function from profile_statistics_plots.py with the necessary column and options. Use serialized files with statistic as a parameter.

Two column types are available:

  1. STATISTICS_KEY.AGE
  2. STATISTICS_KEY.EXPERIENCE

Two chart types are available:

  1. CHART_TYPE.BAR
  2. CHART_TYPE.PIE

Other options:

  1. to_union_rare use to merge the rare values. The rare value means the frequency of the value is less than or equal to STATISTICS_RARE_VALUE_THRESHOLD from consts.py. Default value for STATISTICS_RARE_VALUE_THRESHOLD is 2.
  2. format use to save the output into a file in different formats. The default value is html because the plots are interactive.
  3. auto_open use to open plots automatically.
  4. x_category_order use to choose the sort order for X axis. Available values are stored in PLOTTY_CATEGORY_ORDER from consts.py. The default value is PLOTTY_CATEGORY_ORDER.TOTAL_ASCENDING.

Tasks distribution

Note: Run after 'reorganize_files_structure'.

Use plot_tasks_statistics function from tasks_statistics_plots.py to plot tasks statistics.

Available options:

  1. plot_name use to choose the filename. The default value is task_distribution_plot.
  2. format use to save the output into different formats. The default value is html because the plots are interactive.
  3. auto_open use to open plots automatically.

Activity tracker plots

Use create_ati_data_plot function from ati_data_plots to plot length of the current fragment together with the actions performed in IDE.

Scoring solutions plots

Note: Run after 'run_tests'.

Use plot_scoring_solutions function from scoring_solutions_plots.py to plot scoring solutions.


Installation

Simply clone the repository and run the following commands:

  1. pip install -r requirements.txt
  2. pip install -r dev-requirements.txt
  3. pip install -r test-requirements.txt

Usage

Run the necessary file for available modules:

File Module Description
processing.py Data processing module Includes all steps from the Data processing section
plots.py Plots module Includes all plots from the Visualization section

A simple configuration: python <file> <args>

Use -h option to show help for each module.

Data processing module

See description: usage

File for running: preprocessing.py

Required arguments:

  1. path — the path to data.

Optional arguments:

--level — use to set the level for the preprocessing. Available levels:

Value Description
0 primary data processing
1 merge codetracker files and activity-tracker files
2 find tests results for the tasks
3 reorganize files structure
4 remove intermediate diffs
5 [only for Python language] remove inefficient statements
6 add int experience column, default value

Note: the Nth level runs all the levels before it. The default value is 3.

Plots module

See description: usage

File for running: plots.py

Required arguments:

  1. path — the path to data.
  2. plot_type — the type of plot. Available values:
Value Description
participants_distr use to visualize Participants distribution
tasks_distr use to visualize Tasks distribution
ati use to visualize Activity tracker plots
scoring use to visualize Scoring solutions plots

Optional arguments:

Parameter Description
‑‑type_distr distribution type. Only for plot_type: participants_distr. Available values are programExperience and age. The default value is programExperience.
‑‑chart_type chart type. Only for plot_type: participants_distr. Available values are bar and pie. The default value is bar.
‑‑to_union_rare use to merge the rare values. Only for plot_type: participants_distr.
‑‑format use to save the output into a file in different formats. For all plots except plot_type: ati Available values are html and png. The default value is html.
‑‑auto_open use to open plots automatically.

Tests running

We use pytest library for tests.

Note: If you have ModuleNotFoundError while you try to run tests, please call pip install -e . before using the test system.

Note: We use different compilers for checking tasks. You can find all of them in the Dockerfile. But we also use kotlin compiler for checking kotlin tasks, you need to install it too if you have kotlin files.

Use python setup.py test from the root directory to run ALL tests. If you want to only run some tests, please use param --test_level.

You can use different test levels for param --test_level:

Param Description
all all tests from all modules (the default value)
plots tests from the plots module
process tests from the preprocessing module
test_scoring tests from the test scoring module
util tests from the util module
cli tests from the cli module

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages