Preprocessing Scripts for Influenza virus files

This repository contains a preprocessing pipeline for Influenza proteins data file. The pipeline offers several options for preprocessing data, including removing reassortant entries, translation, and removal of duplicates.

The data should be organized as follows:

{path_folder}  
│  
├── ha_cleaned/  
│      ├── ha_{bin}.fa  
│      └── ...  
│  
└── na_cleaned/  
       ├── na_{bin}.fa  
       └── ...

{path_folder}: This is the main folder containing the data.
ha_cleaned/: This subfolder contains the hemagglutinin (HA) sequences after initial manual curation.
- ha_{bin}.fa: This file contains HA sequences for a specific flu season bin (e.g., ha_10_11.fa for the flu season 2010-2011).
na_cleaned/: This subfolder contains the neuraminidase (NA) sequences after initial manual curation.
- na_{bin}.fa: This file contains NA sequences for a specific flu season bin (e.g., na_10_11.fa for the flu season 2010-2011).

The sequences in these files have undergone initial cleaning steps, such as removing partially sequenced strains and sequences with a large number of ambiguous characters, which are not included in these preprocessing scripts.

Additionally, two more variables can be modified at the beginning of the file:

first_bin: The first flu season bin to analyze.
last_bin: The last flu season bin to analyze.

These variables allow you to specify a subset of files to process if needed. For examp

Below are the available options:

Options

1. Remove Reassortant

This option removes reassortant entries from the data files. Reassortants are entries that contain mixed genetic material and are often filtered out in certain analyses.

2. Translate

This option translates the data files into another language. Translation can be useful for multilingual analysis or for adapting the data to a specific audience.

3. Remove Duplicates

This option removes duplicate entries from the data files. Duplicate entries can skew analysis results and are often removed to ensure data integrity.

4. Full Pipeline (Not Suggested)

This option runs the full preprocessing pipeline, including removing reassortants, translation, and removing duplicates. However, using this option is not suggested as it is better to run the three steps separatedly and check the intermediate results before proceeding to the next step (first remove the reassortant, then translate, and finally removing duplicates).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
clean_duplicates.py		clean_duplicates.py
clean_reassortant.py		clean_reassortant.py
main_pre_processing.py		main_pre_processing.py
translate_sequences.py		translate_sequences.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocessing Scripts for Influenza virus files

The data should be organized as follows:

Options

1. Remove Reassortant

2. Translate

3. Remove Duplicates

4. Full Pipeline (Not Suggested)

About

Uh oh!

Releases 1

Packages

Languages

valegale/preprocessing_influenza

Folders and files

Latest commit

History

Repository files navigation

Preprocessing Scripts for Influenza virus files

The data should be organized as follows:

Options

1. Remove Reassortant

2. Translate

3. Remove Duplicates

4. Full Pipeline (Not Suggested)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages