kfuku52 / amalgkit Goto Github PK

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics

License: BSD 3-Clause "New" or "Revised" License

Python 55.13% Shell 0.07% R 44.81%

amalgkit's Introduction

Overview

AMALGKIT (/əm`ælgkit/) is a toolkit to integrate RNA-seq data from the NCBI SRA database and from private fastq files to generate unbiased cross-species transcript abundance dataset for a large-scale evolutionary gene expression analysis.

Installation

# Installation with pip
pip install git+https://github.com/kfuku52/amalgkit

# This should show complete options
amalgkit -h

Functions

See Wiki for details.

amalgkit metadata: NCBI SRA metadata retrieval
amalgkit integrate: Appending local fastq info to a metadata table
amalgkit config: Creating a series of config files for the metadata selection
amalgkit select: Selecting SRA entries for analysis
amalgkit getfastq: Generating fastq files
amalgkit quant: Transcript abundance estimation
amalgkit merge: Generating transcript abundance tables
amalgkit cstmm: Cross-species TMM normalization using single-copy genes
amalgkit curate: Automatic removal of outlier samples and unwanted biases
amalgkit csca: Generating plots with cross-species correlation analysis
amalgkit sanity: Checking the integrity of AMALGKIT input and output files

Citation

Although AMALGKIT supports novel unpublished functions, some functionalities including metadata curation, expression level quantification, and further curation steps have been described in this paper, in which we reported the transcriptome amalgamation of 21 vertebrate species.

Fukushima K*, Pollock DD*. 2020. Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution. Nature Communications 11: 4459 (DOI: 10.1038/s41467-020-18090-8) open access

Licensing

amalgkit is BSD-licensed (3 clause). See LICENSE for details.

amalgkit's People

Contributors

Stargazers

Watchers

Forkers

docxology

amalgkit's Issues

raise Exception if returncode is not zero.

It would be nice for debugging if amalgkit throws an error when external tools do not exit correctly. Here is an example code. Could you insert similar assertions or raise Exception() in relevant places?

    run_iqtree = subprocess.run([g['iqtree_exe'], '-s', g['alignment_file'], '-te', g['rooted_tree_file'],
                                 '-m', g['iqtree_model'], '--seqtype', 'CODON'+str(g['genetic_code']),
                                 '--threads-max', str(g['threads']), '-T', 'AUTO', '--ancestral', '--rate', '--redo'],
                                stdout=sys.stdout, stderr=sys.stderr)
    assert (run_iqtree.returncode==0), "IQ-TREE did not finish safely: {}".format(run_iqtree.stdout.decode('utf8'))

curate infiles

@Hego-CCTB Some of the curate infile PATHs are handled in a weird way. It's jointed with --work_dir like below, so it assumes a truncated relative PATH to be provided. This is not something users can easily expect. Could you change them so that users can provide the full PATH or the complete, not truncated, relative PATH?

meta_out = os.path.join(args.work_dir, args.metadata)

(By the way, --metafile will be renamed to --metadata in curate in my next push to make it consistent with the other amalgkit subcommand.)

cstmm output files

I noted that cstmm generates normalization_factor_histogram.pdf and normalization_factor.tsv under --work_dir. These files should be stored under a new subdirectory.

Metadata assisted quantification (batch mode)

I'm currently looking at ~2000 SRR Runs of about 20 species. Up until now I manually ordered all downloaded files by species and let a local bash script run amalgkit quant for me. For this amount of data, though, this seems unpractical.

Since the metadata file contains all relevant information, it should be fairly easy to implement automatic quantification, as long as the user provides the folder containing all getfastq outputs, as well as an Index folder containing relevant kallisto indices.

While I'm at it, I can turn this into a --batch mode, where all quantification processes are sent to a cluster queue.

infer quant --ref and --index

As reference fasta should be identifiable from the input metadata table (by species name), we should support quant --ref "infer", which automatically detects the reference fasta in the default reference directory (I think we don't have such dir yet, so let's say amalgkit_out/reference/). File names should be like Arabidopsis_thaliana_ANYANNOTATIONHERE.fasta.gz. quant --index "infer" should also be supported similarly.

versions

As some people have started using this program, we need a well-defined version number to easily track down a problem. I guess you can learn how to do this from csubst. The version number is defined in init.py and loaded by setup.py and other modules when needed.
https://github.com/kfuku52/csubst

minimal test dataset

To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from metadata to curate with it. Ideally,

small file size of .sra: some bacterial dataset?
multiple BioProjects: 2 or 3?
2 species
reference transcriptome fasta files are downloadable, maybe from amalgkit repo
pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

obsoleted options in getfastq

NCBI no longer supports the aspera download, so we should obsolete these options.

--ascp yes|no         default=no: Download SRA files using ascp instead of http protocol.
  --ascp_exe PATH       default=ascp: PATH to ascp executable.
  --ascp_key PATH       default=: PATH to ascp key. See https://www.ncbi.nlm.nih.gov/books/NBK242625/
  --ascp_option STR     default=-v -k 1 -T -l 300m: ascp options.

simplify subparsers

The arguments shared among subparsers can be simplified like this. We can specify parents= in aadd_parser.
https://stackoverflow.com/questions/7498595/python-argparse-add-argument-to-multiple-subparsers

Implement catfastq

Adding a new sub-command that fetches RNA-seq SRA files and generates assembly-ready concatenated fastq files. Input files should be the outputs of amalgkit metadata. This feature wouldn't be immediately necessary for @Hego-CCTB's project but needed for my collaboration so I will work on it.

Expected dependency

pfastq-dump or parallel-fastq-dump

change --work_dir

--work_dir may not be the best name to represent the functionality of this option. People would expect something like --outdir, because --work_dir sounds like it's a place to generate temporary files.

too stringent outlier removal

@Hego-CCTB I remember that I sent suggestions to you regarding this issue. Could you give me some updates?

species-wise merge

merge should be done for each species but not implemented so currently.

quant overhaul

While fixing some minor bugs in quant, it became clear that quant needs some love.

obsolete --build_index
make metadata a required input
read library type(single/paired end) and fragment length directly from the local metadata, instead of querying SRA again.
the way kallisto is called currently makes it harder to implement alternative quantification programs (such as salmon)
theoretically, the mapping rate can be directly taken from the kallisto output

@kfuku52 what's your opinion on this?

amalgkit batch jobs

It seems you are interested in running amalgkit in parallel; Here is the script with which I downloaded/quantified 161 SRAs using amalgkit.
amalgkit.sh.txt

now rownames in curate outputs

Currently, curate output tsv files do not contain gene IDs as rownames. I'll fix it later today.

Implement quant

The metadata command should be followed by the quant command which isn't implemented yet. The command should look like:
amalgkit quant --id SRR6337924 --fastp yes --threads 8 --ref Anolis_carolinensis_genes.fasta

quant should deal with the following steps for one SRA entry at a time:

Downloading a SRA file by parallel-fastq-dump or pfastq-dump
Data cleaning by fastp
Quantification by kallisto (other tools can be supported in future)

We have a prototype bash file for this command.

getfastq --id is currently broken

when just looking for an SRA ID instead of the metadata.tsv, I get this error:

amalgkit getfastq --threads 8 --id SRR7699519 -e [email protected]
amalgkit getfastq: start
pigz found. It will be used for compression/decompression in read name formatting.
--id is specified. Downloading SRA metadata from Entrez.
Traceback (most recent call last):
File "/Users/s229181/anaconda/anaconda3/envs/dev/bin/amalgkit", line 254, in
args.handler(args)
File "/Users/s229181/anaconda/anaconda3/envs/dev/bin/amalgkit", line 31, in command_getfastq
getfastq_main(args)
File "/Users/s229181/anaconda/anaconda3/envs/dev/lib/python3.7/site-packages/amalgkit/getfastq.py", line 517, in getfastq_main
metadata = getfastq_metadata(args)
File "/Users/s229181/anaconda/anaconda3/envs/dev/lib/python3.7/site-packages/amalgkit/getfastq.py", line 477, in getfastq_metadata
search_term = getfastq_search_term(sra_id, args.entrez_additional_search_term)
NameError: name 'sra_id' is not defined

Looking through the code, "sra_id" is only assigned by accessing the metadata.tsv, which is mutually exclusive to --id. I think the easiest solution would be as it was before the "metadata update", by creating a new metadata.tsv with just a single entry and have the rest of the code run as it is right now.

Ah, also we need to add PigZ to the dependencies.

layout annotation error?

SRR9020424 is annotated as a paired-end sequencing, but its fastq-dump output looks like those generated from single-end SRA. Such SRA will be detected/treated as single-end with a warning. I'm working on it.

consistent colors between curate and csca

Tissue colors do not match between curate and csca pdf files.

cstmm single copy extraction

Currently, in cstmm, single-copy orthologs are obtained using OrthoFinder's Orthogroups_SingleCopyOrthologues.txt. But this doesn't allow users to add more species to OrthoFinder for better accuracy and to extract genes that are single-copy in amalgkit input species, but not necessarily among all species. Orthogroups.GeneCount.tsv will allow such procedure.

AttributeError: 'NoneType' object has no attribute 'decode'

I'll fix this.

[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 23,471
[index] number of k-mers: 18,099,550
[index] number of equivalence classes: 37,507
[quant] running in single-end mode
[quant] will process file 1: /lustre6/home/lustre1/kfuku/my_project/daniel_friedman/20210402_amalgkit/amalgkit_out/getfastq/SRR1036398/SRR1036398.amalgkit.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 11,911,736 reads, 0 reads pseudoaligned
[~warn] no reads pseudoaligned.
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds
[~warn] Warning, zero reads pseudoaligned check your input files and index

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 326, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 43, in command_quant
    quant_main(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/quant.py", line 106, in quant_main
    call_kallisto(args, in_files, metadata,sra_id,output_dir,index)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/quant.py", line 58, in call_kallisto
    assert (kallisto_out.returncode == 0), "kallisto did not finish safely: {}".format(kallisto_out.stdout.decode('utf8'))
AttributeError: 'NoneType' object has no attribute 'decode'

ValueError: Cannot convert non-finite values (NA or inf) to integer

I'll fix this.

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 326, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 33, in command_getfastq
    getfastq_main(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/getfastq.py", line 663, in getfastq_main
    total_sra_bp = int(metadata.df['total_bases'].astype(int).sum())
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/generic.py", line 5874, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1068, in astype_nansafe
    raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

AssertionError: None of expected extensions (.amalgkit.fastq.gz .rename.fastq.gz .fastp.fastq.gz .fastq.gz) found

fixed in feca4a6

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 326, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 43, in command_quant
    quant_main(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/quant.py", line 78, in quant_main
    ext = get_newest_intermediate_file_extension(sra_stat, work_dir=output_dir_getfastq)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/util.py", line 51, in get_newest_intermediate_file_extension
    assert 'ext_out' in locals(), 'None of expected extensions ('+' '.join(extensions)+') found in '+work_dir
AssertionError: None of expected extensions (.amalgkit.fastq.gz .rename.fastq.gz .fastp.fastq.gz .fastq.gz) found in /lustre6/home/lustre1/kfuku/my_project/daniel_friedman/20210402_amalgkit/amalgkit_out/getfastq/SRR584147

getfastq and prefetch 2.8.0.

currently doing a fresh install of amalgkit and dependencies on a virtual machine and it looks like getfastq is currently incompatible with the latest version of prefetch.

it looks like prefetch got rid of the --output_directory parameter, now throwing an error in getfastq saying it doesn't know the parameter. The fix is simple, we just have to remove the --ouptut_directory parameter from amalgkit.

The question is, do we update amalgkit to the latest prefetch version, or do we just require an old version of prefetch instead?

cstmm inputs

It looks like cstmm is decoupled from merge, because merge doesn't produce a directory which can be specified as --count in cstmm. Do you manually edit file names and directory structures?

curate output files

curate generates all output files under --work_dir. They should be organized into sub-directories under amalgkit_out/curate/.

num_read_fastp

I couldn't identify when num_read_fastp is added to the metadata table, and its lack caused an error in the curate step. Am I missing something?

integrating private fastq files

Is there any streamlined way to integrate your own fastq files not derived from SRA? If not, I'll implement a subcommand for it but I assume there is, as you work on both public and private data. Please let me know.

conda recipe

Currently, it's not a high-priority issue, but we would need this upon publication.

getfastq download option

Since NCBI migrated to AWS/GCP, the ASPERA/FTP downloads are not working. We should deprecate related options.

I'm not sure if prefetch tries downloading from NCBI's server only, but if so, we should add new options for the https download from AWS and GCP. AWS download was significantly faster than NCBI, at least in this data.
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3638806

cstmm R package dependency

Currently, cstmm depends on these:

library(limma)
library(tidyr)
library(dplyr)

Do they all really necessary? In principle, dep packages should be minimal, and we should use base R functions instead of packages if the same process can be accomplished in a not-so-complicated way. we do need edgeR but how do we use limma-specific functions here? tidyr and dplyr speed up our exploratory data analysis, but for programs like amalgkit, they are more like so-called "technical debt" because they are not essential but increase the installation complexity.

Amalgkit 0.6.0.0 changelog

This is going to be a bigger update, affecting multiple currently open issues. So I'll post the changelog in here and refer to this from the other issues.

Implement curate

After quant, quantified expression data should be subjected to iterative deletions of strange RNA-seq data and the removal of hidden biases. We would implement this step in a new command curate.

Because the prototype script is already written in R, we can just wrap it using the python interface.
https://github.com/kfuku52/amalgkit/blob/master/util/transcriptome_curation.r

The minimum command should look like:
amalgkit curate --metadata metadata_all_species.tsv --species "Homo sapiens" --threads 8

In this step, all RNA-seq data within a species should be handled simultaneously. curate should be the final step of amalgkit and the curated data should be ready for downstream analyses such as tree-based Ornstein-Uhlenbeck modeling of expression level.

Introduce a natural language processing for metadata curation

Currently metadata command relies on the huge keyword library in the .config files. This could be largely replaced/automatically-generated by a natural language processing technique. Even if we successfully replaced it, the .config files shouldn't be completely abandoned because it provides a platform of manual tweaking by users.

free keyword config

Currently the config files lack the one for free keywords (i.e., keywords that are not associated with a particular Entrez search field), but search_term_tissue.config can be used as is, so why don't we rename it as search_term_keyword.config?

add developer

Add you as a developer in setup.py

restarting with preexisting directory

When I use getfastq, create_run_dir() always tries to make a new directory. Could you implement an option to use a specified preexisting directory so that amalgkit can detect some files like .sra and skip time-consuming steps?

ModuleNotFoundError: No module named 'nltk'

Hi Matthias: I got this error in using getfastq. Could you update setup.py and README for this new dependency?

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 220, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 28, in command_getfastq
    from amalgkit.getfastq import getfastq_main
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/amalgkit/getfastq.py", line 2, in <module>
    from amalgkit.metadata import Metadata
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/amalgkit/metadata.py", line 5, in <module>
    from nltk.stem import WordNetLemmatizer
ModuleNotFoundError: No module named 'nltk'

TMM normalization

Here is the script I used for TMM normalization in this work.

multispecies_normalization.ipynb.zip

Could you integrate it in the curate step or as a new subcommand?

species directory name

The directory names of the merge outputs look like this:/merge/GENUS SPECIES
A space in PATH is a cause of trouble. Could you change to this? /merge/GENUS_SPECIES

BUSCO for cstmm

Typically, more single-copy genes can be detected with BUSCO compared with Orthofinder. It doesn't hold the highest priority, but the BUSCO input for cstmm could be a future improvement.

deprecated --threads

Why did you remove --threads from quant in 970b0ca?

KeyError: 0

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1625, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1632, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 326, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 43, in command_quant
    quant_main(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/quant.py", line 95, in quant_main
    call_kallisto(args, in_files, metadata,sra_id,output_dir,index)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/quant.py", line 17, in call_kallisto
    nominal_length = metadata.df['nominal_length'][0]
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/series.py", line 851, in __getitem__
    return self._get_value(key)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/series.py", line 959, in _get_value
    loc = self.index.get_loc(label)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 0

new amalgkit subcommand for cross-species correlation analysis

--batch base

Currently, --batch takes zero-based index, so, on SGE for example, array jobs should be submitted like --batch $[${SGE_TASK_ID}-1]. I'm thinking if it would be better if the index starts from one, not zero. The one-based index is also easy to locate data when browsing the metadata table with Excel. @Hego-CCTB If there is no issue, I'd like to change it.

Adding amalgkit sanity

Ideas for adding a new functionality to amalgkit, called amalgkit sanity. The original idea was having a functionality that's basically an automatic check-list. Looking for presence/absence of required inputs and outputs of the various core amalgkit functionality (metadata, getfastq, quant and curate). In other issues here, we found other purposes for this as well, so here is a list of things that can be handled by amalgkit sanity:

Checking presence/absence of inputs/outputs (absence would throw a warning)
amalgkit quant --index "infer" may call upon amalgkit sanity (as mentioned in #28), although this is could be handled by quant alone.
Auto generation of metadata for private fastq files (as mentioned in #27)

I think this is more than enough to justify a separate functionality, rather than expanding existing ones. Any other ideas for tasks that could be handled by sanity?

curate plot overhaul with ggplot2

Currently, curate generates a diagnostic plot (heatmap, dendrogram....) using R's base function with many packages. At some point, this should be replaced with ggplot2's framework for better extensibility and unified package dependency.

amalgkit/amalgkit/transcriptome_curation.r

Lines 4 to 13 in 62ac91b

 library(pcaMethods, quietly = TRUE) 

 library(colorspace, quietly = TRUE) 

 library(RColorBrewer, quietly = TRUE) 

 library(sva, quietly = TRUE) 

 library(MASS, quietly = TRUE) 

 library(NMF, quietly = TRUE) 

 library(dendextend, quietly = TRUE) 

 library(amap, quietly = TRUE) 

 library(pvclust, quietly = TRUE) 

 library(Rtsne, quietly = TRUE)

new SVA

FPKM transformation may be incorrect.

In transcriptome_curation.r, I found that 1 is added to count values in FPKM calculation. This wouldn't be correct.

res = exp(log(counts + 1) - log(effective_lengths) - log(sum(counts)) + log(1e+09))

	library(pcaMethods, quietly = TRUE)
	library(colorspace, quietly = TRUE)
	library(RColorBrewer, quietly = TRUE)
	library(sva, quietly = TRUE)
	library(MASS, quietly = TRUE)
	library(NMF, quietly = TRUE)
	library(dendextend, quietly = TRUE)
	library(amap, quietly = TRUE)
	library(pvclust, quietly = TRUE)
	library(Rtsne, quietly = TRUE)