Giter Club home page Giter Club logo

big-scape's People

Contributors

adraismawur avatar emzodls avatar jorgecnavarrom avatar satriaphd avatar

Watchers

 avatar

big-scape's Issues

Implement pyhmmer

When a user runs BiG-SCAPE
Then pyhmmer should be used for finding and aligning protein domains
So that a user does not need to install these dependencies beforehand

Tasks:

  • Choose a native HMMER implementation
  • Implement HMMER in python
  • Replace native Hmmscan
  • replace native Hmmalign

Implement BiG-SLICE pre-processing

Integrate as much of BiG-SLICE as possible in the application. There are a number of things in BiG-SLICE that can improve the application:

  • Use Sqlite for data storage
    • Load gbks
    • Do hmmscan
    • Do hmmalign
    • generate distances
    • generate networks
  • Generate features using BiG-SLICE data
  • Try pre-filtering

Clean up schema

From #25 task "Clean up schema"

Clean the sqlite schema, removing any tables or indices that are not used in the actual workflow.

Something reports "Ignored unknown character X (seen 1 times)" to console

A module used during distance calculation is printing something to console. This does not occur in the file logs.

Excerpt from log:

2022-02-03 13:29:46,924 INFO      NRPS (616 BGCs)
2022-02-03 13:29:46,925 INFO       Writing annotation files
2022-02-03 13:29:46,929 INFO       Calculating all pairwise distances
2022-02-03 13:30:42,530 INFO       Removing 550 non-relevant MIBiG BGCs
2022-02-03 13:30:42,530 INFO       Writing output files
2022-02-03 13:30:42,870 INFO      Calling Gene Cluster Families
2022-02-03 13:30:43,421 INFO      Cutoff: 0.3
**Ignored unknown character X (seen 1 times)**
2022-02-03 13:30:45,030 INFO      Others (671 BGCs)
2022-02-03 13:30:45,031 INFO       Writing annotation files
2022-02-03 13:30:45,035 INFO       Calculating all pairwise distances
2022-02-03 13:31:03,199 INFO       Removing 558 non-relevant MIBiG BGCs
2022-02-03 13:31:03,200 INFO       Writing output files
2022-02-03 13:31:03,600 INFO      Calling Gene Cluster Families
2022-02-03 13:31:04,457 INFO      Cutoff: 0.3

This should be reported to the log as info instead. Also ignores quiet mode.

Measure RAM consumption

When the application is running
Then the application should measure the ram consumed during the process
So that users may see the load on the system, and
So that developers may spot increases, decreases or no change in memory consumption during development

Tasks:

  • Find a memory consumption tracker, preferably a native one
  • Implement tracker

Finalizing tasks

Tasks to finish before 30-06-2022

  • Resolve differences in results with master branch
  • PEP-8 pass
  • reorganize files where necessary
  • Test flows
    • Global
    • Glocal
    • Mix
    • Query
  • Fix database progress checking (hmmalign)
  • Optimize data IO
  • Docstring for each method
  • Docstring for each class
  • Clean up schema
  • Comment pass

Reduce thread downtime between BGC families

Currently there is a bit of single-threaded work being done between BGC families in pairwise comparison.

This is probably mostly these three components:

  • Pre-processing of data. potential fix: process all data before doing pairwise comparison using multiple threads
  • Post-processing of data. potential fix: first collect all comparisons, then do post-processing
  • clustering. potential fix: do after collecting all pairwise comparisons instead of per family. use multiple threads

BGC data overhaul

Development task

Currently the BGC data is still somewhat all over the place. Most of it is in BgcInfo due to 0781e72, but there are still things in bgctools.BgcData

Either unify this or migrate over to the storage method BiG-SLICE is using.

Tasks:

  • Implement BiG-SLICE style storage

    • domain hit information to sqlite
    • msa information to sqlite
    • biosynthetic gene info to sqlite
    • taxonomy information to sqlite
    • rework ArrowerSVG to use database
    • rework generate_network to use database
    • Remove any unused code
    • Comment changes
    • Test other workflows
  • Refactor BgcData

  • Deduplicate data

  • Optimize complexities (lists instead of sets & vice versa)

Commit to python3

Development task:

Commit to using python3. This mostly means removing any backwards compatibility

Tasks:

  • Remove backwards compatibility
  • Employ F strings

Automatically download MIBiG files

When a user indicates that MIBiG files are used, and a MIBiG version is set or implied,
Then a set of GBK files for that MIBiG version should be downloaded and extracted
So that the users are not required to do this manually, and
So that the repository does not need to retain the MIBiG files

Tasks:

  • Automatically download MIBiG tgz
  • Extract MIBiG files
  • Figure out what to do with file discrepancy vs included MIBiG files

Implement proper logger

When the application is executed, and
When the user has specified a log level or one is implied (default = warning)
Then a proper logger should be used to display logs on the CLI and to write a log to a file
So that users can choose what severity of logs to be informed of

Tasks:

  • Choose a logger
  • Implement logger
    • Logging to terminal
    • Logging to file
    • Choose log level
  • Replace print statements
  • Replace empty sys.exit or sys.exit(0) with sys.exit(1) where there is an error case
  • Replace sys.exit with log.error + sys.exit(1)
  • Implement quiet mode
  • Handle Sklearn warning
  • Implement log directory option

Optimize threading

Development task

Distance calculation:

  • Replace distance calculation threading with in/out queue
  • Optimize thread number
  • #14

Hmm

Implement filters

Implement two types of filter:

  • Jaccard filter: Assumes pairs with jaccard index < threshold to be unrelated
  • Feature filter: calculates BiG-SLICE-like features from Pfam-A domain bitscores and compiles into distance. Assume pairs with feature distance < threshold to be unrelated

Tasks:

  • Implement Jaccard filter
  • Add Jaccard filter to configuration
  • Generate features from Pfam-A
  • Filter using Pfam-A features
  • Add Pfam-A feature filter to configuration

Automatically download Pfam

When a storage location for the Pfam files is specified or implied (default [big_scape_dir]/pfam)
When a version of Pfam is specified or implied (default 31.0)
Then that version of the Pfam .hmm file should be downloaded and extracted
So that the users are not required to do this manually

Tasks:

  • Implement checking for pfam at specified dir
  • Implement pfam version command line argument
  • Implement downloading relevant pfam
  • Implement extracting pfam to directory
  • Implement running hmmpress on downloaded pfam
  • Remove downloading from dockerfile

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.