Skip to content

Input format error #8

@hanclair

Description

@hanclair

Hello,

I am hitting an error with what I assume to be a file format issue. This is the error message that I get:

2022-09-13 10:09:24,357 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Method: rr
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Null Model: perm
2022-09-13 10:09:24,360 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-13 10:09:24,360 | tejaas.main | DEBUG | Using 2 cores
2022-09-13 10:09:24,389 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-13 10:09:24,769 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-13 10:09:25,710 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-13 10:09:25,710 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
2022-09-13 10:09:30,776 | tejaas.iotools.data | DEBUG | Selecting common samples of genotype and gene expression
2022-09-13 10:09:30,776 | tejaas.iotools.data | DEBUG | Before expression selection: 7155 genes from 218 samples
Traceback (most recent call last):
  File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
    sys.exit(main())
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
    data.load()
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 248, in load
    expression_selected = rpkm._normalize_expr(expression[:, exprmask][indices, :])
IndexError: arrays used as indices must be of integer (or boolean) type

This is my Tejaas command:

mpirun -n ${NCORE} ${RUN_PATH} --vcf ${GENOFILE} --include-SNPs ${INCSTRING} --chrom ${CHROM} \
                               --gx ${GXPRFILE} --gxcorr ${GXPRFILE} --gtf ${GENEINFO} --trim --outprefix ${OUTPREFIX_RR} \
                               --method rr --prior-sigma 0.1 --knn 30 --cismask --null perm --psnpthres 0.1 --pgenethres 0.1 --biotype 

This is the format of my expression file:

Gene_IDs        A001    A002    A003    A004    A005   
g5649   4.80323384689818        5.02117811663358        3.92602967965938        3.92602967965938        4.29661519218949        
g5653   4.28097534826939        5.02117811663358        3.92602967965938        3.92602967965938        3.92602967965938        
g5654   3.92602967965938        3.92602967965938        5.08096788235617        5.17450993990352        4.29661519218949        

My VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A001    A002    A003    A004    A005    
1       243     ScPoQx8_4268;HRSCAF=5273_243    T       C       131     .       .       GT:PL   0/1:64,0,34     0/0:.   ./.:.   0/0:.   ./.:.   
1       247     ScPoQx8_4268;HRSCAF=5273_247    A       G       125     .       .       GT:PL   0/1:64,0,35     0/0:.   ./.:.   0/0:.   ./.:.     
1       275     ScPoQx8_4268;HRSCAF=5273_275    C       T       159     .       .       GT:PL   0/0:.   0/0:.   ./.:.   0/0:.   ./.:.   

My annotation file:

ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        gene    1       1540    0.98    -       .       ID "g1"; gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        mRNA    1       1540    0.98    -       .       ID "g1.t1"; Parent "g1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        CDS     1110    1540    0.98    -       0        Parent "g1.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        exon    1110    1540    .       -       .        Parent "g1.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        gene    15395   25390   0.65    +       .       ID "g2"; gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        mRNA    15395   25390   0.65    +       .       ID "g2.t1"; Parent "g2" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        CDS     15395   15430   0.65    +       0        Parent "g2.t1" gene_type "protein_coding";
ScPoQx8_4355;HRSCAF=5393        AUGUSTUS        exon    15395   15430   .       +       .        Parent "g2.t1" gene_type "protein_coding";

This GTF was generated from a GFF3 using rtracklayer in R, and then I manually added gene_type "protein_coding"; to the last field.

Troubleshooting that I tried:

  1. Judging by the solution from the closed issue "GTF read error", I assumed my issue is the GTF file format. I tried to verify that the expression and VCF file formats are acceptable by test running Tejaas with my expression and VCF files and the gencode.v19 GTF that came with the example, but hit the same error as above.
  2. I thought perhaps Tejaas does not allow for VCF with missing data, so I removed all the sites with any missing genotypes in the VCF but also hit the same error as above. (Are VCFs with missing genotypes allowed?)
  3. I simplified the chromosome names in the annotation file to match those from gencode, like this:
chr1    AUGUSTUS        gene    28451   29982   0.29    -       .       ID "g13599"; gene_type "protein_coding";
chr1    AUGUSTUS        mRNA    28451   29982   0.29    -       .       ID "g13599.t1"; Parent "g13599" gene_type "protein_coding";
chr1    AUGUSTUS        CDS     28451   28712   0.43    -       1        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        exon    28451   28712   .       -       .        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        CDS     29864   29982   0.29    -       0        Parent "g13599.t1" gene_type "protein_coding";
chr1    AUGUSTUS        exon    29864   29982   .       -       .        Parent "g13599.t1" gene_type "protein_coding";

but ended up with a different error:

2022-09-13 12:23:29,867 | tejaas.utils.args | INFO | Running TEJAAS v1.0.1
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Method: rr
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Null Model: perm
2022-09-13 12:23:29,872 | tejaas.utils.args | INFO | Sigma_beta: 0.1
2022-09-13 12:23:29,872 | tejaas.main | DEBUG | Using 2 cores
2022-09-13 12:23:29,909 | tejaas.iotools.data | DEBUG | Reading expression levels for trans-eQTL discovery
2022-09-13 12:23:30,386 | tejaas.iotools.data | DEBUG | Reading expression levels for target gene discovery
2022-09-13 12:23:31,327 | tejaas.iotools.data | DEBUG | Found 7155 genes of 218 samples
2022-09-13 12:23:31,327 | tejaas.iotools.data | DEBUG | Reading gencode file for gene information
Traceback (most recent call last):
  File "/groups/stern/home/hanc/miniconda3/envs/py39/bin/tejaas", line 8, in <module>
    sys.exit(main())
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/main.py", line 73, in main
    data.load()
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/data.py", line 241, in load
    gene_info = readgtf.gencode(self.args.gtf_file, trim=self.args.gxtrim, biotype=self.args.biotype)
  File "/groups/stern/home/hanc/miniconda3/envs/py39/lib/python3.9/site-packages/tejaas/iotools/readgtf.py", line 48, in gencode
    gene_name = infolist[4].strip().split(' ')[1].replace('"','')
IndexError: list index out of range

Now I am not sure which input file is likely causing the issue.
I am able to successfully run the example files that came with the download.

Any suggestions would much appreciated. Thank you.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions