This repository contains a Nextflow pipeline for harmonizing proteomics data using the ProHarMeD package. The pipeline is designed to process and refine proteomic datasets, improving data clarity and utility by filtering protein IDs, remapping gene names, and mapping orthologs.
The corresponding Docker container, which is called in the Nextflow file, can be found here: https://hub.docker.com/repository/docker/kadam0/proharmed/general
To run the pipeline, you need to specify several parameters:
--count_file: This flag should be followed by the path to your count file, which contains the raw counts data for your analysis.--out_dir: Use this flag to specify the output directory where the results will be stored.--file_type: The file type for plots (e.g.,png,pdf).--protein_column: Name of the column with protein IDs.--organism: The organism for protein ID filtering.--rev_con: Boolean flag indicating if decoy protein IDs should be kept.--reviewed: Boolean flag indicating if protein IDs should be reduced to reviewed ones only.--mode: Mode for remapping gene names.--gene_column: Name of the column with gene names.--skip_filled: Boolean flag indicating if already filled gene name cells should be skipped.--fasta: Path to the FASTA file for remapping gene names.--tar_organism: Target organism for ortholog mapping.
Example command:
./nextflow run main.nf --count_file path/to/count_file.txt --out_dir output_directory --file_type png --protein_column "Protein IDs" --organism rat --rev_con false --reviewed false --mode uniprot_one --gene_column "Gene names" --skip_filled true --fasta path/to/fasta_file.fasta --tar_organism human
The pipeline generates several output files and directories:
filtered_protein_idsDirectory: Contains filtered protein IDs with relevant details.remap_gene_namesDirectory: Contains remapped gene names.map_orthologsDirectory: Contains ortholog mapping results.harmonized_data.csv: The final merged dataset after processing.
Each directory includes data files and visualization plots summarizing the processing steps and outcomes.
To get started, clone this repository and ensure you have Nextflow installed. Prepare your count file according to the required format and run the command with the appropriate paths to your files. Check the output directory for results and detailed analysis.
Note: This README provides a general guide. Users should adjust paths and file names according to their specific project structure and requirements.
The filter_protein_ids function processes a DataFrame containing protein IDs. It allows for the removal of decoy IDs, contamination IDs, and the option to filter IDs based on a specified organism. Additionally, the function can restrict protein IDs to reviewed ones and manage empty cells in the dataset.
data: A DataFrame that includes a column with protein IDs.protein_column: The name of the column indatathat contains the protein IDs.organism: (Optional) Specify the organism to filter the protein IDs. Options:human,rat,mouse,rabbit.rev_con: (Optional) A boolean value indicating if decoy protein IDs should be kept. Default isFALSE.keep_empty: A boolean value indicating if rows with empty protein ID cells should be retained. Default isFALSE.res_column: The name of the column where the filtered protein IDs will be stored. IfNULL, the originalprotein_columnwill be overwritten. Default isFilteredProteinIDs.reviewed: (Optional) A boolean value indicating if the protein IDs should be reduced to reviewed ones only. Default isFALSE.
The remap_genenames function remaps protein IDs to their associated gene names, offering various modes for filling empty entries or optionally replacing existing gene names.
data: A DataFrame containing a column with protein IDs.mode: Mode of refilling. Options:all,fasta,uniprot,uniprot_primary,uniprot_one. Default isuniprot_one.protein_column: Name of the column with protein IDs. Default isFilteredProteinIDs.gene_column: (Optional) Name of the column with gene names. Default isGene names.res_column: Name of the column for results. IfNULL, thegene_columnwill be overwritten. Default isRemappedGeneNames.skip_filled: (Optional) A boolean value indicating if already filled gene name cells should be skipped. Default isTRUE.organism: (Optional) Specify the organism for matching IDs. Options:human,rat,mouse,rabbit. Default israt.fasta: (Optional) FASTA file to use when mode isallorfasta. Default isNULL.keep_empty: A boolean value indicating if rows with empty reduced gene names should be retained. Default isTRUE.
The map_orthologs function maps gene names from one organism to their orthologs in another organism. This function is useful for cross-species analyses and comparative genomics studies.
data: A DataFrame containing a column with gene names.gene_column: The name of the column indatathat contains the gene names. Default isRemappedGeneNames.organism: The organism corresponding to the current gene names in the data. Default israt.tar_organism: The target organism to which you want to map the orthologs. Default ishuman.res_column: The name of the column where the orthologs will be stored. IfNULL, thegene_columnwill be overwritten. Default isOrthologs.keep_empty: A boolean value indicating whether rows with empty cells (where orthologs could not be found) should be kept or deleted. Default isTRUE.
The create_overview_plot function generates an overview plot of the logging data. This plot provides a high-level summary of the data processing steps.
logging: A DataFrame containing the overview logging data returned by one of the ProHarMeD methods.out_dir: (Optional) The directory where the plot will be saved.file_type: (Optional) The file format for the saved plot. Options:png,pdf,jpg.
The create_filter_detailed_plot function creates a detailed plot specifically for the logging data of the filter_protein_ids method. This plot provides insights into how the protein IDs were filtered based on various criteria.
logging: A DataFrame containing detailed logging data returned by thefilter_protein_idsmethod.organism: The organism specified in thefilter_protein_idsmethod.reviewed: The reviewed parameter value used in thefilter_protein_idsmethod.decoy: The rev_con parameter value used in thefilter_protein_idsmethod.out_dir: (Optional) The directory where the plot will be saved.file_type: (Optional) The file format for the saved plot.
The create_remap_detailed_plot function generates a detailed plot for the logging data associated with the remap_genenames method. This plot provides insights into the gene name remapping process.
logging: A DataFrame containing detailed logging data returned by theremap_genenamesmethod.out_dir: (Optional) The directory where the plot will be saved.file_type: (Optional) The file format for the saved plot.
The create_ortholog_detailed_plot function generates a detailed plot for the logging data associated with the map_orthologs method. This plot illustrates the mapping of orthologs from one organism to another.
logging: A DataFrame containing detailed logging data returned by themap_orthologsmethod.organism: The source organism used in themap_orthologsmethod.out_dir: (Optional) The directory where the plot will be saved.file_type: (Optional) The file format for the saved plot.