Giter Club home page Giter Club logo

pykofamsearch's Introduction

PyKofamSearch

Fast implementation of KofamScan optimized for high-memory systems using PyHmmer. PyKofamSearch can handle fasta in uncompressed or gzip format and databases in either HMM or Python pickle serialized format. No intermediate files are created.

Benchmarking

Database Tool Single Threaded 12 Threads
Full Database PyKofamSearch 21:34 2:45
KofamScan 21:53 3:40
Enzymes PyKofamSearch 7:39 0:56

* Time in minutes for 4977 proteins in test/test.faa.gz.

Official benchmarking for hmmsearch algorithm implemented in PyHMMER against HMMER from Larralde et al. 2023:

drawing

Usage:

Recommended usage for PyKofamSearch is on systems with 1) high RAM; 2) large numbers of threads; and/or 3) reading/writing to disk is charged (e.g., AWS EFS). Also useful when querying a large number of proteins.

  • Using the official KOfam database files:

    # Download database
    DATABASE_DIRECTORY=/path/to/database_directory/
    mkdir -p ${DATABASE_DIRECTORY}/Annotate/KOFAM/
    wget -v -O - ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz | gzip -d > ${DATABASE_DIRECTORY}/Annotate/KOFAM/ko_list
    wget -v -c ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz -O - |  tar -xz
    
    # Run PyKofamSearch
    pykofamsearch.py -i test/test.faa.gz  -o output.tsv -b ${DATABASE_DIRECTORY}/Annotate/KOFAM -p=-1
  • Using the serialized database files:

    Database can be uncompressed pickle or gzipped pickle.

    # Full database
    pykofamsearch.py -i test/test.faa.gz  -o output.tsv -b ~/Databases/KOFAM/database.pkl.gz -p=-1
    
    # Enzymes only
    pykofamsearch.py -i test/test.faa.gz  -o output.enzymes.tsv -b ~/Databases/KOFAM/database.enzymes.pkl.gz -p=-1
  • Grouping hits by query protein:

    reformat_pykofamsearch.py -i pykofamsearch_output.tsv -o pykofamsearch_output.reformatted.tsv

Options:

$ pykofamsearch.py -h
usage: pykofamsearch.py -i <proteins.fasta> -o <output.tsv> -d

    Running: pykofamsearch.py v2024.4.18 via Python v3.10.14 | /Users/jolespin/miniconda3/envs/kofamscan_env/bin/python3.10

options:
-h, --help            show this help message and exit

I/O arguments:
-i PROTEINS, --proteins PROTEINS
                        path/to/proteins.fasta. stdin does not stream and loads everything into memory. [Default: stdin]
-o OUTPUT, --output OUTPUT
                        path/to/output.tsv [Default: stdout]
--no_header           No header

Utility arguments:
-p N_JOBS, --n_jobs N_JOBS
                        Number of threads to use [Default: 1]

HMMSearch arguments:
-e EVALUE, --evalue EVALUE
                        E-value threshold [Default: 0.1]
-a, --all_hits        Return all hits and do not use curated threshold. Not recommended for large queries.

Database arguments:
-d DATABASE_DIRECTORY, --database_directory DATABASE_DIRECTORY
                        path/to/kofam_database_directory/ cannot be used with -b/-serialized_database
-b SERIALIZED_DATABASE, --serialized_database SERIALIZED_DATABASE
                        path/to/database.pkl cannot be used with -d/--database_directory

Outputs:

  • From pykofamsearch.py:

    id_protein id_ko threshold score e-value definition
    SRR13615825__k127_135326_1 K00012 377.73 6.245e+02 7.34282e-188 UDPglucose 6-dehydrogenase [EC:1.1.1.22]
    SRR13615825__k127_87070_1 K00012 377.73 4.751e+02 1.15847e-142 UDPglucose 6-dehydrogenase [EC:1.1.1.22]
    SRR13615825__k127_278295_3 K00020 348.7 3.639e+02 5.39377e-109 3-hydroxyisobutyrate dehydrogenase [EC:1.1.1.31]
    SRR13615825__k127_23043_1 K00033 157.6 3.598e+02 8.14325e-108 6-phosphogluconate dehydrogenase [EC:1.1.1.44 1.1.1.343]
    SRR13615825__k127_278295_3 K00042 389.27 3.941e+02 2.58098e-118 2-hydroxy-3-oxopropionate reductase [EC:1.1.1.60]
  • From reformat_pykofamsearch.py:

    id_protein number_of_hits ids names evalues scores
    SRR13615825__k127_135326_1 1 ['K00012'] ['UDPglucose 6-dehydrogenase [EC:1.1.1.22]'] [7.34282e-188] [624.5]
    SRR13615825__k127_87070_1 1 ['K00012'] ['UDPglucose 6-dehydrogenase [EC:1.1.1.22]'] [1.15847e-142] [475.1]
    SRR13615825__k127_278295_3 2 ['K00020', 'K00042'] ['3-hydroxyisobutyrate dehydrogenase [EC:1.1.1.31]', '2-hydroxy-3-oxopropionate reductase [EC:1.1.1.60]'] [5.39377e-109, 2.58098e-118] [363.9, 394.1]
    SRR13615825__k127_23043_1 1 ['K00033'] ['6-phosphogluconate dehydrogenase [EC:1.1.1.44 1.1.1.343]'] [8.14325e-108] [359.8]
    SRR13615825__k127_72951_1 1 ['K00053'] ['ketol-acid reductoisomerase [EC:1.1.1.86]'] [1.54591e-108] [362.2]
  • From reformat_pykofamsearch.py with -b/--best_hits_only:

    id_protein id name evalue score
    SRR13615825__k127_135326_1 K00012 UDPglucose 6-dehydrogenase [EC:1.1.1.22] 7.34282e-188 624.5
    SRR13615825__k127_87070_1 K00012 UDPglucose 6-dehydrogenase [EC:1.1.1.22] 1.15847e-142 475.1
    SRR13615825__k127_278295_3 K00042 2-hydroxy-3-oxopropionate reductase [EC:1.1.1.60] 2.58098e-118 394.1
    SRR13615825__k127_23043_1 K00033 6-phosphogluconate dehydrogenase [EC:1.1.1.44 1.1.1.343] 8.14325e-108 359.8
    SRR13615825__k127_72951_1 K00053 ketol-acid reductoisomerase [EC:1.1.1.86] 1.54591e-108 362.2

If you use this tool, please cite the following sources:

  • Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020 Apr 1;36(7):2251-2252. doi: 10.1093/bioinformatics/btz859. PMID: 31742321; PMCID: PMC7141845.

  • Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20. PMID: 22039361; PMCID: PMC3197634.

  • Larralde M, Zeller G. PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics. 2023 May 4;39(5):btad214. doi: 10.1093/bioinformatics/btad214. PMID: 37074928; PMCID: PMC10159651.

Notes:

PyKofamSearch output is slightly different than KofamScan. For example, in the test case the number of significant hits from KofamScan is 1188 while PyKofamSearch is 1190. All hits in from KofamScan are in PyKofamSearch output.

License:

The code for PyKofamSearch is licensed under an MIT License

Please contact [email protected] regarding any licensing concerns.

pykofamsearch's People

Contributors

jolespin avatar

Stargazers

Xiaomeng Wang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.