Giter Club home page Giter Club logo

cazy_webscraper's People

Contributors

hobnobmancer avatar widdowquinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cazy_webscraper's Issues

handling blank pdb accession from uniprot

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

With the addition of alpha-fold predicted 3D structures to UniProt into the structures table, the PDB accession column is empty for the predicted alpha-fold structures(s). The null value in the PDB accession column is added to the local database, but they should not be.

To Reproduce

cw_get_uniprot_data <path to local database> --pdb

Expected behavior

If the no PDB accession is included in the PDB accession cell of the UniProt structure table, no PDB accession should be added to the local CAZyme database.

ModuleNotFoundError: No module named 'cazy_webscraper.cazy_webscraper‘

Describe the bug

I finished installing cazy-webscraper package using 'pip install cazy-webscraper', when I try to use it then get the following error,"ModuleNotFoundError: No module named 'cazy_webscraper.cazy_webscraper‘ ", Then I try to install from source code, I also get same error. I don't know the reason. Can anyone assist me? thanks!
image

Where is the downloaded data?

Describe the bug

Issuing the command:

cazy_webscraper --families GH169

downloads no data and creates no database - only log files:

CAZy_connection_failures_CW_2021-07-13--06-48-12.log
Format_and_parsing_errors_CW_2021-07-13--06-48-12.log
SQL_errors_CW_2021-07-13--06-48-12.log

I expected there to be some downloaded data in a local database.

Using the command:

cazy_webscraper --families GH169 --verbose -o cazydb

Places both log files and an SQLite3 database in the directory cazydb:

$ ls -1 cazydb
CAZy_connection_failures_CW_2021-07-13--06-49-43.log
Format_and_parsing_errors_CW_2021-07-13--06-49-43.log
SQL_errors_CW_2021-07-13--06-49-43.log
cazy_scrape_2021-07-13--06-49-43.db
$ sqlite3 cazydb/cazy_scrape_2021-07-13--06-49-43.db 
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> .tables
cazymes           cazymes_pdbs      genbanks          taxs            
cazymes_ecs       cazymes_uniprots  kingdoms          uniprots        
cazymes_families  ecs               logs            
cazymes_genbanks  families          pdbs            
sqlite> SELECT * FROM families;
1|GH169|

Expected behavior

A local database should be created containing the requested download data.

BLAST database from GenBank retrieval

To save time for the user, it could be helpful to have the option to enable cazy_webscraper to build a local BLAST database from the sequences retrieved from GenBank.

If download is interrupted, no intermediate results are stored.

Downloading of significant amounts of data may take some time. If there is an interruption for any reason, the script stops, but none of the gathered data is available to the user. This could be extremely frustrating and discourage reuse.

Some options to provide kinder behaviour could include:

  • place all downloaded data in a local SQLite3 database (my preferred option; it is transactional, so requires a transaction to complete in order to update - this avoids partial data dumps; also, the database is persistent and can be targeted readily with expansions/plugins for other tools)
  • write all data to a growing file/files (this could end up with syncing issues if the program is interrupted part-way through a write)

Use uniprot accessions in `get_uniprot_data`

Is your feature request related to a problem? Please describe.

At present get_uniprot_data only accepts a list of GenBank accessions. However, the UniProt accessions of the proteins of interest may already be in the local CAZyme database, and additional data and/or updated data from UniProt is desired.

Describe the solution you'd like

Provide an option to provide a list of UniProt accessions, this will skip the retrieval of UniProt IDs and data from UniProt is retrieved for the corresponding protein records in the local CAZyme database.

Describe alternatives you've considered

The present work around is the query the SQL database to retrieve the corresponding GenBank accession for each UniProt accession.

Add command-line options for class, family, species, etc.

The config file approach is great if you want or need to preserve or repeat your query.

If you want a "quick" search (e.g. when testing) then it would be good to have the option to specify options directly. It also lowers the barrier of entry for new users (not everyone knows how to create, or debug, a YAML file).

For instance:

cazy_webscraper -g [email protected] -o outdir --class GH
cazy_webscraper -g [email protected] -o outdir --family GH1
cazy_webscraper -g [email protected] -o outdir --class GH --species Acinetobacter
cazy_webscraper -g [email protected] -o outdir --family GH1 --species "Pectobacterium atrosepticum" -p pdb

Tutorials

To encourage usage and broaden the pool of potential users, a set of video and written tutorials should be created to walk users through the program.

These tutorials can also walk users through how each of the cmd-line flags alters the behaviour of the scraper.

Use NCBI Tax IDs

Atm each taxonomy lineage retrieved from NCBI Taxonomy is assigned a unique internal ID, however, each NCBI Taxonomy ID number is unique and should be used instead to prevent redundancy.

Crashes when retrieving NCBI seqs: http.client.IncompleteRead

Crashes when retrieving protein sequences from NCBI.

Describe the bug

  • Retrieving the protein sequences for all CBM CAZymes from NCBI
  • Crashed when handling NCBI accessions that had previously failed to be found in NCBI
  • Could also cache the accessions of proteins for which sequences were retrieved while download is progressing

To Reproduce

  1. Build a local CAZyme database: cazy_webscraper [email protected] -o cazydb
  2. Retrieve protein seqs for CBMs: cw_get_genbank_seqs cazydb [email protected] --classes 'CBM'

Error:

Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.8/http/client.py", line 555, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/opt/anaconda/lib/python3.8/http/client.py", line 522, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.8/http/client.py", line 587, in _readinto_chunked
    chunk_left = self._get_chunk_left()
  File "/opt/anaconda/lib/python3.8/http/client.py", line 557, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

Also ran what appears to be an infinite loop for a protein accession, which kept raising:

Runtime error raised when batch quering
Possible result of a accessions not being in NCBI
Attempt identification of the causal accession later

Expected behavior

Not to crash
Cache the accessions of proteins for which protein sequences were retrieved
Also cache the protein sequences, and possibly delete after - this would allow the continuation of a download if it crashes, instead of having to restart from scratch

Use Semantic Versioning

Is your feature request related to a problem? Please describe.

The pre_version_1_release version naming is non-standard and not easily understood by users or developers.

Describe the solution you'd like

We should use semantic versioning to make it easier for ourselves and users to understand what versions mean.

With this scheme, pre_version_1_release might become, for example:

  • 1.0-alpha
  • 1.0-beta
  • 1.0-rc1
  • 1.0-rc2

etc. depending on stage.

No PDB acessions matched and Retrieving no protein structure files

I tried to run basic commands from README and Documentation. As my primary goal is to retrieve the PDB files, I started creating a local database with

cazy_webscraper <e-mail> --families GH -o GH.db

And next, some command to get pdb structures

cw_get_pdb_structures GH.db --classes GH pdb

However, I got the following output:

Using default CAZy class synonyms
Applying CAZy class filter(s)
Retrieving GenBank accessions for selected CAZy classes:   0%| | 0/1 [00:00<?, ?Retrieving CAZymes for CAZy class GH
Retrieving GenBank accessions for selected CAZy classes: 100%|█| 1/1 [00:44<00:0
Retrieving GenBank accessions for selected CAZy families: 0it [00:00, ?it/s]
Applying no taxonomic filters
Loading existing PDB db records: 0it [00:00, ?it/s]
Loading existing Genbank_Pdbs db records: 0it [00:00, ?it/s]
No PDB accessions matched the criteria provided.
Retrieving no protein structure files from PDB

I ran different settings with PL, GH, and GT, and I got the same result.

My system configuration:

Linux 5.19.0-35-generic x86_64
Ubuntu 22.04.2 LTS
conda 23.1.0

Potentially streamlined scraping

@widdowquinn I think I've thought of another way to increase the rate of scraping CAZy

Atm, when parsing a protein (our current working protein):

  1. The scraper checks if the protein is already present in the local database by querying by the primary GenBank accession.
  2. If the current working protein has already been stored in the local database (identified by it's primary GenBank accession), the scraper checks that all UniProt, PDB and non-primary GenBank accessions listed for the current working protein are associated with the current working protein in the local database.
  3. If any of the listed UniProt, PDB and non-primary GenBank accessions are not associated with the current working protein then they are added to the local database.

Could we assume that for every family a protein appears in, it's associated data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) are the same? I.e. the data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) presented in the row of the HTML table for a given protein will be the same for every CAZy family HTML table the protein appears in.
By applying this assumption the number of queries to the local CAZy database can be significantly reduced. The scraper would only need to check if the current working protein is associated with the current working CAZy family.

An alternative is to add in a streamline-scraping mode that is enabled at the cmd-line and applies this assumption. When invoking this method of scraping, the user would be warned at the beginning that this mode may not retrieve all UniProt, PDB and non-primary GenBank accessions etc., in case of potential inconsistencies in the CAZy dataset or previous errors when retrieving data for the current working protein.

An advancement on the 'streamline-scraping' mode would be to enable users to customise to want extent the streamlining is applied. The user could specify against which criteria the streamlining is applied, for example --streamline uniprot,pdb,ec would apply this assumption to UniProt accessions, PDB accessions and EC number but not source organisms and GenBank accessions. To make life easier I could add in a full option that would apply the streamlining mode to UniProt, PDB and GenBank accessions, EC numbers and source organisms, and thus save the user writing out all 5.

Expand NCBI taxonomy data

Is your feature request related to a problem? Please describe.

Only the name of the species (genus, species and strain) and the taxonomy of the species is retrieved from CAZy.
For improved accessibility to data and increased ability to compare between different levels of taxa, it may be useful to add an option to expand the taxonomy data included in a local CAZyme database.
For example, call to NCBI to retrieve:

  • Taxonomy ID
  • Phylum
  • Class
  • Order
  • Family

Describe the solution you'd like

For each scientific name in a local CAZyme database, use the expand module to retrieve NCBI taxonomy data.

Describe alternatives you've considered

N/A

Additional context

N/A

Bio.Entrez NotXMLError

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

While retrieving protein sequences from NCBI, if the Bio.Entrez NotXMLError is raised, the tool crashes and does not retrieve any of the remaining protein sequences.

To Reproduce

Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced:

Command: cw_get_genbank_seqs all_cazy_2022-08-22.db <email> --families GH50

Error:

Traceback (most recent call last):
  File "/home/user/anaconda3/.../cw_get_genbank_seqs", line 33, in <module>
    sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cw_get_genbank_seqs')())
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 160, in main
    seq_dict, no_seq = get_sequences(genbank_accessions, args)  # {gbk_accession: seq}
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 297, in get_sequences
    seq_dict, success_accessions, failed_accessions = retry_failed_queries(
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 366, in retry_failed_queries
    new_seq_dict, no_seq = get_sequences(query, args, retry=True)
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 223, in get_sequences
    epost_webenv, epost_query_key = bulk_query_ncbi(query_list, args)
  File "/home/user/.../cazy_webscraper/expand/genbank/sequences/get_genbank_sequences.py", line 337, in bulk_query_ncbi
    epost_result = Entrez.read(
  File "/home/user/anaconda3/.../Bio/Entrez/__init__.py", line 508, in read
    record = handler.read(handle)
  File "/home/user/anaconda3/.../Bio/Entrez/Parser.py", line 345, in read
    raise NotXMLError(e) from None
Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (no element found: line 1, column 0). Please make sure that the input data are in XML format.

Expected behavior

cazy_webscrapershould be able to handle this error and continue on retrieving the rest of protein sequences.

Missing `.gitignore`

Installing cazy_webscraper for development includes/dumps a number of files that are not in the git repository and show up as files to be managed.

A .gitignore file should accompany the repository to avoid inclusion of these unnecessary files.

Add PyPI distribution

Instructions are provided for installation via conda, but many people prefer to use pip as their package manage for Python.

Please could cazy_webscraper be distributed via PyPI? (See, e.g. PyPI docs)

Fails to retrieve data from UniProt

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

Crashes when trying to retrieve data from UniProt.

Raises UnboundLocalError: local variable 'response' referenced before assignment

To Reproduce

Please include the specific steps (including all code) you performed, so that we can check if the behaviour can be reproduced:

  1. Create a local CAZyme database
  2. Retrieve data from UniProt for specific families: cw_get_uniprot_data cazy.db --families CE4,GH28,GH30,PL1,PL3,PL4 --pdb

Trace back

Traceback (most recent call last):
  File "/home/.../.local/bin/cw_get_uniprot_data", line 8, in <module>
    sys.exit(main())
  File "/home/.../.local/lib/python3.8/site-packages/cazy_webscraper/expand/uniprot/get_uniprot_data.py", line 180, in main
    uniprot_gkb_dict = get_uniprot_accessions(gbk_dict, args)  # {uniprot_acc: {'gbk_acc': str, 'db_id': int}}
  File "/home/.../.local/lib/python3.8/site-packages/saintBioutils/uniprot/__init__.py", line 110, in get_uniprot_accessions
    uniprot_batch_response = response.decode('utf-8')
UnboundLocalError: local variable 'response' referenced before assignment

Setup

Please provide a brief summary of your setup/computer you are using. For example:

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version 16.04.5 LTS

Update to `sqlalchemy` 2.x

As mentioned in PR #102, cazy_webscraper was designed to use sqlalchemy version 1.4.20.

At some point cazy_webscraper should be updated to use sqlalchemy version 2.x.

Handling CAZymes with no GenBank accession

Is your feature request related to a problem? Please describe.
Not a technical 'bug', but working with proteins without a GenBank accession is difficult in downstream analysis.

To avoid adding duplicate protein records and associated all proteomic data with the respective protein, each unique protein is identified by it GenBank protein accession number.

This is find for proteins that have a GenBank accession number, however, many proteins don't. In these cases, cazy_webscraper pools all these records together and assigns them the GenBank accession 'NA'.

The problem with is is that it dissociates the taxonomic data of the proteins. Proteins with no GenBank accession are all assigned the same taxonomy, CAZy families, PDB accessions, UniProt accessions, etc which is incorrect.

Describe the solution you'd like

Most proteins that do not have a GenBank accession have a UniProt accession. Therefore, in cases where a protein does not have a GenBank accession, the unique proteins should be identified by their unique primary UniProt accession. This would make working with UniProt sourced proteins a lot easier!

This maybe be added to cazy_webscraper soon after release of version 1

log table contents

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

The data written to the log table is not clear, or human readable.

Expected behavior

The data written to the log table should be:

  • clear
  • concise
  • readable understood

More (and more useful) information with `-v`/`--verbose`

Describe the solution you'd like

The information provided by -v/--verbose is by default not informative in the way I would expect.

When downloading a CAZy family, I would expect to see the following information:

  • a summary of the command-line options used
  • information about the number of CAZy family members found at CAZy
  • information about the number of CAZy family members downloaded (and why/why not)
  • an indication of where to find the downloaded data

What is provided is:

$ cazy_webscraper --families GH169 -v
[WARNING] [scraper.utilities.parse_configuration]: Using default CAZy class synonyms
Parsing protein pages for GH169: 100%|█████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00,  1.30s/it]
Parsing protein pages for GH169: 0it [00:00, ?it/s]                                                                              | 0/172 [00:00<?, ?it/s]
Parsing Glycoside Hydrolases (GHs) families: 100%|█████████████████████████████████████████████████████████████████████| 172/172 [00:20<00:00,  8.30it/s]
Parsing CAZy classes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:31<00:00, 31.72s/it]
=====================cazy_webscraper=====================
Finished scraping CAZy
Scrape initated at 2021-07-13 06:40:13
Scrape finished at 2021-07-13 06:40:55
Total run time: 0 days 00:00:42
Version: v0.1.6
Thank you for using cazy_webscraper. Expected academic practise is to cite the work of others.
Please cite cazy_webscraper in your work:
Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021):cazy_webscraper Microbiology Society Annual Conference 2021 poster. figshare. Poster.
https://doi.org/10.6084/m9.figshare.14370860.v7

Add CAZy download log information to local database table

The current model has log information describing the scraping actions that populate a database in the form of a "sidecar" log file. To ensure reproducibility, these need to be ported around with the database they refer to.

It may be more robust to add a table to the database that duplicates some of this data - such as date of download and command-lines - so that in isolation users can reconstruct how the data was collected.

Do not make complete download of CAZy the default operation

Is your feature request related to a problem? Please describe.

By making the default operation of cazy_webscraper be "download all of CAZy" it is very easy for users to overwhelm the service, denying it to others.

Describe the solution you'd like

Either restrict operation only to download of specific classes/families, or make it more difficult to specify download of the complete CAZy database.

Whatever solution is used, downloading the entire database should be a conscious act for the user, not the default when running the tool with no arguments.

selecting proteins matching criteria

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

Sometimes cazy_webscraper fails to retrieve CAZymes matching the user's critieria when multiple filters are applied. The --ec_filter needs particular interest.

To Reproduce

cw_extract_db_seqs genbank --families GH1,GH2 --ec_filter 3.2.1.*

Work around

Interrogate the database via native SQL to retrieve a list of protein accessions that match the desired criteria, then use this list to define the CAZymes of interest.

Failing to retrieve UniProt data

Describe the bug

When using cw_get_uniprot_data to retrieve data from UniProt, no data is retrieved and added to the local CAZyme database

To Reproduce

  1. Build a local CAZyme database: cazy_webscraper <email> -o cazy.db
  2. cw_get_uniprot_data cazy.db --families 20 --pdb
Built output directory: .cazy_webscraper_2022-11-18_20-03-08/uniprot_data_retrieval
Using default CAZy class synonyms
Retrieving GenBank accessions for selected CAZy classes: 0it [00:00, ?it/s]
Applying CAZy family filter(s)
Retrieving GenBank accessions for selected CAZy families:   0%|                                                           | 0/1 [00:00<?, ?it/s]Retrieving CAZymes for CAZy family PL20
Retrieving GenBank accessions for selected CAZy families: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.02it/s]
Applying no taxonomic filters
Retrieving UniProt data for 76
Batch retrieving UniProt IDs: 11it [00:00, 15.03it/s]                                                                                           
Batch retrieving protein data from UniProt: 0it [00:00, ?it/s]
Adding data to the local CAZyme database
Retrieving existing UniProt records from db: 0it [00:00, ?it/s]
Separating new and existing records: 0it [00:00, ?it/s]
Loading existing PDB db records: 0it [00:00, ?it/s]
Identifying new PDBs to add to db: 0it [00:00, ?it/s]
Loading existing Genbank_Pdbs db records: 0it [00:00, ?it/s]
Identifying new protein-PDB relationships to add to db: 0it [00:00, ?it/s]

No data is retrieved from UniProt.

Expected behavior

Retrieve data from UniProt and add to the local CAZyme database

Incorrect DB schema

Describe the bug

The ER model of the SQLite3 database created using cazy_webscraper creates a many-to-many relationship between cazymes and genbanks (GenBank accessions). However, there should be a one-to-many relationship between cazymes and genbanks:

cazymes 1 - - * genbanks

Expected behaviour

  • The table cazymes_genbanks should not be present
  • The table genbanks should contain the column cazyme_id, which is a foreign key from the table cazymes
  • The table cazymes should backtranslate onto the table genbanks.

Unexpected error message when retrieving AA UniProt sequences

First I built the dabatase with:
cazy_webscraper <email> --classes AA

Then I tried:
cw_get_uniprot_data <path_to_db> --families AA17 -s

And I got this output:

Built output directory: .cazy_webscraper_2023-04-25_17-38-06\uniprot_data_retrieval
Using default CAZy class synonyms
Retrieving GenBank accessions for selected CAZy classes: 0it [00:00, ?it/s]
Applying CAZy family filter(s)
Retrieving GenBank accessions for selected CAZy families:   0%|                                                                                  | 0/1 [00:00<?, ?it/s]Retrieving CAZymes for CAZy family AA17
Retrieving GenBank accessions for selected CAZy families: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.69it/s]
Applying no taxonomic filters
Retrieving UniProt data for 418
Retrieving data for 418 proteins
[['CCD28157.1', 'ETN25003.1', 'EGZ04327.1', 'EQC34366.1', 'ETM55527.1', 'EEY56117.1', 'ETL32367.1', 'ETL25332.1', 'ETO77111.1', 'ETK81747.1', 'ETO67284.1', 'ETL88378.1', 'ETN06075.1', 'ETI38762.1', 'ETK71917.1', 'ETO83077.1', 'CCA20830.1', 'EGZ04492.1', 'KDO27085.1', 'ETM02278.1', 'ETP30231.1', 'EQC39423.1', 'ETO81104.1', 'ETK81734.1', 'ETM00651.1', 'CCI47381.1', 'ETI42112.1', 'ETL41683.1', 'ETN20052.1', 'EEY58933.1', 'ETP25841.1', 'ETI48323.1', 'EGZ23522.1', 'ETP18130.1', 'EEY69088.1', 'ETN11075.1', 'EEY61639.1', 'EEY61638.1', 'ETK90881.1', 'ETK81744.1', 'KDO29253.1', 'ETK73643.1', 'ETN20049.1', 'ETP39574.1', 'ETO70369.1', 'ETL27076.1', 'ETK95850.1', 'ETN10519.1', 'ETK88975.1', 'EGZ05857.1', 'KDO26913.1', 'ETO83083.1', 'EGZ27273.1', 'ETI35283.1', 'ETI31514.1', 'ETL32408.1', 'UIZ22004.1', 'ETL85648.1', 'ETM00659.1', 'ETK81732.1', 'ETI41722.1', 'ETI41730.1', 'AHO49056.1', 'ETK95846.1', 'ETP53850.1', 'ETK88582.1', 'ETN10510.1', 'ETO77816.1', 'ETI41706.1', 'ETI32208.1', 'ETI50994.1', 'EGZ10739.1', 'EQC34755.1', 'ETP36460.1', 'KDO27086.1', 'ETK81751.1', 'EEY60789.1', 'EEY58927.1', 'UIZ27392.1', 'ETI41723.1', 'ETK88280.1', 'ETW01779.1', 'ETP53138.1', 'EGZ21313.1', 'CCA13926.1', 'EGZ10738.1', 'ETN25011.1', 'EQC33678.1', 'ETN20074.1', 'ETK95840.1', 'EGZ21309.1', 'ETP39568.1', 'ETL27074.1', 'ETI41715.1', 'ETI35061.1', 'ETL49218.1', 'ETL41675.1', 'ETM32606.1', 'ETN20054.1', 'ETO70332.1', 'ETM55516.1', 'ETI48679.1', 'ETN20047.1', 'KDO27114.1', 'ETN14473.1', 'ETI38761.1', 'ETO71194.1', 'ETI41714.1', 'ETM31823.1', 'EEY68485.1', 'ETM48357.1', 'ETN14460.1', 'ETN10508.1', 'ETM55515.1', 'ETP39587.1', 'ETL80314.1', 'EGZ23520.1', 'ETL85650.1', 'AIG55447.1', 'EEY58936.1', 'EGZ08731.1', 'EGZ21314.1', 'CCA17179.1', 'ETI32192.1', 'ETN10809.1', 'UIZ26027.1', 'EEY58932.1', 'EGZ27342.1', 'ETL88376.1', 'ETN24019.1', 'ETL49222.1', 'ETI56033.1', 'EQC42132.1', 'ETP53858.1', 'ETL35141.1', 'ETI56043.1', 'ETN19254.1', 'EGZ08727.1', 'ETP08653.1', 'ETP03147.1', 'ETI54329.1', 'ETK88959.1', 'ETP53851.1', 'ETM55520.1', 'ETK95844.1', 'EQC25604.1', 'EGZ21312.1', 'ETL25321.1', 'ETO84778.1', 'ETO84785.1'], ['ETK88962.1', 'UIZ24201.1', 'AIG55787.1', 'ETO62052.1', 'EQC25608.1', 'ETM48060.1', 'ETI41729.1', 'EEY54090.1', 'ETI56037.1', 'ETP52131.1', 'ETP08632.1', 'ETN25010.1', 'EEY58944.1', 'EGZ05561.1', 'ETO70366.1', 'EEY58939.1', 'EEY68486.1', 'ETN20053.1', 'ETL35158.1', 'ETL35153.1', 'ETI56084.1', 'ETK78975.1', 'ETM42059.1', 'ETK75541.1', 'ETM32479.1', 'ETL47559.1', 'ETP02046.1', 'AIG55790.1', 'ETN11038.1', 'ETL49225.1', 'EGZ07203.1', 'ETO70367.1', 'ETP25842.1', 'UIZ28766.1', 'ETM97400.1', 'AIG55491.1', 'ETP29504.1', 'EQC24790.1', 'UIZ25173.1', 'ETP53849.1', 'ETO83080.1', 'ETL35151.1', 'ETN24018.1', 'ETP11451.1', 'KAF4046070.1', 'ETK78977.1', 'ETN20064.1', 'EGZ21311.1', 'ETL94825.1', 'ETI54331.1', 'ETN14463.1', 'ETP30207.1', 'ETK72575.1', 'EEY59753.1', 'ETK78714.1', 'ETM38805.1', 'UIZ26903.1', 'ETI42600.1', 'ETI41724.1', 'ETM41652.1', 'ETL35135.1', 'ETV73941.1', 'ETP25843.1', 'ETL35162.1', 'UIZ21835.1', 'ETL35152.1', 'EGZ23516.1', 'ETP53854.1', 'UIZ26907.1', 'EEY55873.1', 'ETI31519.1', 'ETO70368.1', 'ETV73994.1', 'ETP36689.1', 'ETO67480.1', 'UIZ27394.1', 'EGZ05551.1', 'ETM02264.1', 'ETI54320.1', 'EGZ07202.1', 'AIG56201.1', 'EQC24659.1', 'ETI38512.1', 'ETV73954.1', 'EGZ05560.1', 'ETM55519.1', 'ETO67483.1', 'ETN14469.1', 'ETM31809.1', 'ETO70335.1', 'ETO67485.1', 'EQC26776.1', 'ETL32386.1', 'ETL88833.1', 'UIZ22002.1', 'ETM00653.1', 'EGZ08733.1', 'ETL94840.1', 'EGZ17951.1', 'ETI48337.1', 'EGZ07231.1', 'ETO77414.1', 'ETM41632.1', 'ETL79229.1', 'ETN20068.1', 'ETP52112.1', 'KDO29254.1', 'EEY58934.1', 'ETN00173.1', 'ETI31960.1', 'CCI46093.1', 'ETI44371.1', 'ETL88406.1', 'ETK95841.1', 'ETL95131.1', 'ETV73947.1', 'EQC24789.1', 'ETP46069.1', 'ETM02263.1', 'EQC25603.1', 'ETK81737.1', 'ETN20056.1', 'EGZ06334.1', 'ETP11442.1', 'ETV71159.1', 'EEY58926.1', 'EQC34358.1', 'ETW03002.1', 'ETN01693.1', 'ETP29981.1', 'AIG55448.1', 'ETK82650.1', 'EGZ27343.1', 'EEY67612.1', 'ETK81749.1', 'ETI41709.1', 'ETP39589.1', 'ETP18135.1', 'ETL78558.1', 'ETP24138.1', 'ETI48336.1', 'ETP11452.1', 'EGZ08724.1', 'ETN19515.1', 'ETI55312.1', 'ETP11455.1', 'ETI38498.1', 'ETP39588.1', 'EGZ27341.1', 'AIG56266.1'], ['ETP11448.1', 'ETP24139.1', 'UIZ26906.1', 'ETP46800.1', 'ETL49223.1', 'ETO59016.1', 'ETM41638.1', 'EGZ08732.1', 'ETI52336.1', 'ETK81748.1', 'UIZ24199.1', 'AIG55788.1', 'EGZ08725.1', 'ETM02269.1', 'EEY58931.1', 'ETL78541.1', 'ETL35539.1', 'ETN00163.1', 'ETN19290.1', 'ETP24148.1', 'AHO49057.1', 'EEY68484.1', 'ETI35285.1', 'ETK82180.1', 'ETP12314.1', 'ETM41647.1', 'ETO83056.1', 'ETO70334.1', 'EGZ05859.1', 'EGZ08736.1', 'ETP02071.1', 'ETN16020.1', 'ETO60944.1', 'ETP01315.1', 'ETO60224.1', 'EGZ08730.1', 'ETI35284.1', 'ETL27075.1', 'KDO27087.1', 'ETK78976.1', 'ETI56040.1', 'ETP18136.1', 'ETL88399.1', 'ETW02996.1', 'AIG55708.1', 'ETV73948.1', 'ETO70333.1', 'EGZ08739.1', 'ETP08651.1', 'EEY56114.1', 'EEY58928.1', 'EGZ04444.1', 'ETI56038.1', 'EEY67611.1', 'ETP33252.1', 'EEY58943.1', 'KDO29252.1', 'ETI41731.1', 'ETI38720.1', 'EGZ05556.1', 'ETI38760.1', 'ETL32154.1', 'UIZ25176.1', 'ETP11454.1', 'ETI42158.1', 'ETO63823.1', 'EGZ06395.1', 'ETK81741.1', 'EGZ08734.1', 'ETK88276.1', 'ETI31524.1', 'ETL95552.1', 'ETK71888.1', 'EGZ05562.1', 'ETW02997.1', 'EGZ08128.1', 'ETO70341.1', 'ETL94839.1', 'ETK71904.1', 'ETI56039.1', 'ETI41716.1', 'ETM48076.1', 'ETI31959.1', 'ETL80307.1', 'ETL88392.1', 'ETP11446.1', 'EEY65096.1', 'KDO30778.1', 'ETM41641.1', 'ETO70735.1', 'KDO27115.1', 'ETM32478.1', 'ETI48338.1', 'EQC42107.1', 'ETO70329.1', 'ETN20072.1', 'ETP28200.1', 'ETI54319.1', 'EGZ05552.1', 'ETN14458.1', 'ETP39581.1', 'KDO27089.1', 'ETI54333.1', 'ETO83057.1', 'ETK81755.1', 'ETP46074.1', 'EGZ27274.1', 'ETL47567.1', 'ETP25849.1', 'EGZ08747.1', 'ETM55513.1', 'ETI41700.1', 'ETP50074.1', 'EGZ06330.1', 'AIG55793.1', 'ETP25844.1', 'ETI41704.1', 'ETO84776.1']]
Batch retrieving protein data from UniProt:   0%|                                                                                                | 0/3 [00:00<?, ?it/s]WARNING [bioservices.UniProt:596]:  status is not ok with Forbidden
Batch retrieving protein data from UniProt:   0%|                                                                                                | 0/3 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\alexs\anaconda3\envs\ai\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\alexs\anaconda3\envs\ai\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\alexs\anaconda3\envs\ai\Scripts\cw_get_uniprot_data.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\cazy_webscraper\expand\uniprot\get_uniprot_data.py", line 147, in main
    downloaded_uniprot_data, all_ecs = get_uniprot_data(gbk_data_to_download, cache_dir, args)
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\cazy_webscraper\expand\uniprot\get_uniprot_data.py", line 348, in get_uniprot_data
    uniprot_df = UniProt().get_df(entries=query, limit=args.uniprot_batch_size)
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\bioservices\uniprot.py", line 851, in get_df
    res = self.search(
  File "C:\Users\alexs\anaconda3\envs\ai\lib\site-packages\bioservices\uniprot.py", line 744, in search
    batch = batch.split("\n")[1:]
AttributeError: 'int' object has no attribute 'split'

Remove unnecessary files

Several unnecessary files are contained in the repository. These appear to have been carried through after local installation as they weren't ignored by a .gitignore files (see #8). These should be removed.

cazy_webscraper.egg-info/PKG-INFO
cazy_webscraper.egg-info/SOURCES.txt
cazy_webscraper.egg-info/top_level.txt
scraper/__pycache__/__init__.cpython-38.pyc
scraper/__pycache__/cazy_webscraper.cpython-38.pyc
scraper/file_io/__pycache__/__init__.cpython-38.pyc
scraper/parse/__pycache__/__init__.cpython-38.pyc
scraper/utilities/__pycache__/__init__.cpython-38.pyc

Allow the user to determine the software version

Is your feature request related to a problem? Please describe.

cazy_webscraper appears to have no way to determine the current version of the tool, from the command-line.

Describe the solution you'd like

Add a cazy_webscraper --version command-line flag.

KeyError when converting CAZy class abbreviations

When using the --classes filter a KeyError is raised.

...site-packages/cazy_webscraper/sql/sql_interface/get_selected_gbks.py", line 205, in get_class_fam_genbank_accessions
class_abbrev = CLASS_ABBREVIATIONS[cazy_class]
KeyError: 'CE'

Fix logger inheritance

When using the --verbose/-V flag, the logger level should be changed to INFO. However, the logger is not correctly inheriting this change in logger level, and child loggers are set to level WARNING

None standard database table naming format

Describe the bug

Currently the tables in the SQL database created using cazy_webscraper are spelt with a lower case letter. Standard SQL practise dictates that tables should start with an upper case letter.

Expected behaviour

Standard SQL practise dictates that tables should start with an upper case letter. Therefore, the tables cazymes should be Cazymes, or to match standard writing practise it should be CAZymes.

Tidying `conda`, `pip`, and `setup.py`

  1. The requirements in the three dependency lists do not currently agree with each other.

The conda and pip package managers do not overlap exactly. Some packages are available only in pip and others only in conda/bioconda. The repository contains an environment.yml for conda and a requirements.txt for pip. Additionally, the setup.py file will install biopython and pandas.

These three descriptions of code requirements should - ideally - agree, but do not:

  • setup.py pins biopython at >= 1.76 and pandas at >= 1.0.3
  • requirements.txt does not specify any versions
  • environment.yml pins biopython at 1.78, and pandas at 1.1.2

If packages are to be pinned, they should really be pinned at the same versions, unless there is some specific intent otherwise (and this should be flagged). It is fine to have a specific intent (e.g. environment.yml specifies a very detailed known good environment, but that may be overkill, here)

  1. The requirements could be organised differently - and possibly be more easily maintained.

The setup.py file should state which packages are necessary to run, in the install_requires argument, though setuptools does not make any attempt to ensure that the installed packages are mutually compatible. Any version pins here should perhaps take their lead from a known good configuration?

The environment.yml file only works with conda, but it does allow version pinning and also installation of pip-specific packages. The current environment.yml appears to be autogenerated (hence the very specific pins) and includes many packages that are implied dependencies of the required packages. The environment.yml file could be trimmed down considerably to contain only those packages that are necessary to run the script/module (and these then imply the other dependencies). Any pip-only modules can be specified here under a pip field.

The requirements.txt file mixes dependencies necessary for the module/script with those that are required only for development. These could be split into separate requirements.txt (corresponding to install_requires from setup.py and requirements-dev.txt files, so that packages which are not required for the module/script are not installed unnecessarily.

  1. The installation instructions can then be streamlined, and separate instructions given for those who are only using, and those who are also maintaining/developing, the tool.

The four routes for a user should be something like:

  • python setup.py install (takes dependencies from setup.py's install_requires
  • conda env create --file environment.yml (takes dependencies from environment.yml and optionally other pip-specific requirements files
  • conda install --file requirements.txt
  • pip install -r requirements.txt

then for a developer/maintainer

  • conda install --file requirements-dev.txt
  • pip install -r requirements-dev.txt

(then pip install -e .)
are good options for installing things like flake8/pylint etc.

  1. Some necessary modules were missing from one or more dependency definition:
  • requests
  • mechanicalsoup
  • pyyaml
  • tqdm

Conda installlation instruction correction

The conda installation instructions currently read:

[use] the following command at the command-line in the terminal: `conda asdasd cazy_webscraper`

which should be conda install cazy_webscraper (with bioconda in your channels) or conda install -c bioconda cazy_webscraper.

When collecting all sequences for a family, use the last page number to set upper limit on progress bar

Currently, the progress bar for commands like:

cazy_webscraper.py -v -g me@my_domain -l test.log -o outdir

doesn't know how long it has to go, so the user doesn't know either:

Retrieving proteins from GH1: 0it [00:00, ?it/s]       cazy_webscraper: 2020-12-03 14:27:10,010 - Retrieving proteins from http://www.cazy.org/GH1_all.html?debut_PRINC=2000#pagination_PRINC
Retrieving proteins from GH1: 1it [00:04,  4.01s/it]   cazy_webscraper: 2020-12-03 14:27:14,230 - Retrieving proteins from http://www.cazy.org/GH1_all.html?debut_PRINC=3000#pagination_PRINC
Retrieving proteins from GH1: 1001it [00:11,  2.81s/it]cazy_webscraper: 2020-12-03 14:27:20,435 - Retrieving proteins from http://www.cazy.org/GH1_all.html?debut_PRINC=4000#pagination_PRINC

The total number of pages for a family (in combination with a taxon level) is found on the first page, so this could be set in the range that tqdm sees - as it seems to be for the outer loops (families, classes)? That could then give the user an idea of how long they might need to wait.

fix parsing `families` configuration

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

CBM families are identified as invalid CAZy families, but should be recognised as valid.

To Reproduce

cazy_webscraper <user_email> --families CBM5

Expected behavior

Should not raise any issues.

Crashes when retrieving seqs from NCBI

Describe the bug

Crashed when adding downloaded seqs to the local CAZyme db. Func call missing args.

Also failing to identify valid IDs into download NCBI sequence record IDs. For example, fails to extract the ID from prf||2109195A and sp|B2FSW8.1|EALGL_STRMK.

To Reproduce

  1. `cazy_webscraper [email protected] -o cazy.db --classes AA,CE
  2. cw_get_genbank_seqs cazy.db [email protected]

`--families` option does not work as stated in tutorial

Describe the bug

The tutorial at suggests the following command should download a subset of CAZy families:

cazy_webscraper.py --families GH2,PL5,CE1,CE2,AA10

but it produces an error:

$ cazy_webscraper.py --families GH2,PL5,CE1,CE2,AA10
usage: cazy_webscraper.py [-h] [-c config file] [-d {None,class,family}] [-f] [-g Email address of user] [-l log file name] [-n] [-o output file name]
                          [-genbank_output output file name] [-pdb_output output file name] [-p {None,mmCif,pdb,xml,mmtf,bundle}] [-s] [-v]
cazy_webscraper.py: error: unrecognized arguments: --families GH2,PL5,CE1,CE2,AA10

To Reproduce

  • Install cazy_webscraper v1.0
  • Issue the command cazy_webscraper.py

Expected behavior

CAZy family data is downloaded

Basic validation against public CAZy site

It's hard to do a full validation of scraping without having access to the remote backend of CAZy (which no-one seems to get to see).

A way to sanity-check that the scraping captured all the data it needed to might be to check the counts of each family at the CAZy website (see image) against locally downloaded records for each CAZy family. This would need to account for deduplication and the handful of records that are discarded because they are incomplete.

Screenshot 2021-02-04 at 17 26 47

When writing output files, create missing parent directories

Is your feature request related to a problem? Please describe.

It is tedious for a user to need to create parent directories. The tutorial however indicates:

When requesting cazy_webscraper make an output directory, the parent of the directory we wish to make must already exist.

Describe the solution you'd like

It would improve the user interface if, for example:

cazy_webscraper.py -o dir1/dir2/dir3/mydb

created dir1, dir2, and dir3 for the user. This is trivial with Path.mkdir(dirname, parents=True).

Add GUI

Is your feature request related to a problem? Please describe.

A GUI to cazy_webscraper to improve accessibility.

Additionally, this packaging could improve the packaging and installation of cazy_webscraper by deminishing the need for using the command-line.

Describe the solution you'd like

Use Python Gooey to create a GUI and package to tool for distribution.

Use `cazy_webscraper` rather than `cazy_webscraper.py` as the command.

Describe the solution you'd like

We can currently use, e.g.:

cazy_webscraper.py -o mydb

but the .py is redundant, easy to forget and looks unusual in a POSIX context. The command would be cleaner as:

cazy_webscraper -o mydb

Additional context

As users may have incorporated the tool into pipelines already, the old command cazy_webscraper.py will also need to be retained. This could be handled by having two commands point to the same entry point in setup.py.

API missing opt to include 'kingdoms' in output

Please complete this report in full and as much detail as possible. It will help with getting the bug fixed far sooner!

Describe the bug

The --include flag is missing the option 'kingdom' to include the taxonomic kingdom in the output file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.