Skip to content

Repository for analysis and experiments in the BigCode project.

License

Notifications You must be signed in to change notification settings

hughesthe1st/bigcode-analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

BigCode Analysis

This repository is for the analysis done in BigCode Project. You can find analysis of datasets, models, architecture choices and more.

Contents

  • Data analysis: In the folder data_analysis, we analyze these two datasets: python-all-license (private) and python-safe-license. We provide the following statistics:
    • percentage of near duplicates
    • percentage of configuration/test and uncommon files
    • file size distribution
    • loss analysis
    • natural language distribution in comments/docstrings and number of files that can be successfully compiled

We also provide code to run near-deduplication, and to detect natural language of comments in Python datasets.

  • Multi query attention experiments, for details refer here

About

Repository for analysis and experiments in the BigCode project.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.5%
  • Python 3.4%
  • Shell 0.1%