Comments (3)
Could we assume that for every family a protein appears in, it's associated data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) are the same?
I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession."
Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…
By applying this assumption the number of queries to the local CAZy database can be significantly reduced. The scraper would only need to check if the current working protein is associated with the current working CAZy family.
If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?
Overall I suspect the web/network IO is the slow step, and in-memory database queries are fast by comparison, so the speed-up is possibly not so great. Do you have any numbers to show how impactful this change might be?
from cazy_webscraper.
I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession."
Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…
I agree that it's not the best practice for data integrity. But, I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.
If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?
I added in this assumption so the user can 'customise' it. The user can define which combination of UniProt, GenBank, and PDB accessions and EC numbers to presume are the same each time a given protein appears in CAZy. I set my run to presume UniProt and PDB accessions and EC numbers are the same, but to always check that all GenBank accessions are retrieved for each instance a given protein appears in CAZy, as it's the GenBank accessions that are so important for pyrewton
.
- Applying this assumption I was able to scrape 350,835 proteins in 16 hrs 16 minutes (this is scraping CAZy family GH1-GH22 in that order).
- When not applying the assumption (when scraping CAZy families GH1-GH22 in that order) it took over 30 hours.
The rate of scrape varies from family to family. The time required to parse a protein entry from CAZy is shortest when adding a new protein and longest when adding data to existing proteins. Therefore, families with have a higher ratio of previously scraped to newly scraped proteins have a significantly slower rate. Therefore, it's extremely difficult to accurately predict the total time to scrape CAZy. But approximately, if not applying the assumption that UniProt, PDB and EC numbers are the same for each time a given protein is scraped from CAZy it was looking like the scraper would take 7-8 days (at least!) to scrape the entirety of CAZy. When applying these assumptions (and only double-checking the GenBank accessions) I'm looking at under 4 days.
I think it's a behaviour to in the scrape but strongly highlight in the documentation that it isn't the best to practise for data integrity of the data included in the assumption. It should only be used if performing an extremely large scrape, such as the whole database, multiple CAZy classes or a total scrapping time of over a week, and the assumptions should be applied to a minimum (e.g. only apply to assumption to data that is not absolutely essential to or utilised in downstream processing).
The advantage for extremely large scrapes (e.g. the whole of CAZy) is that it significantly reduces the number of proteins that take 1 one or more seconds to scrape/parse. There are ~2,000,000 protein entries in CAZy to be parsed. If half those take 1 second then that's already 11 days to scrape those 1,000,000 entries.
At the moment:
- while assuming UniProt, PDB and EC numbers are the same, I'm averaging (taking into account the data also scraped when writing this) 1 CAZy entry per 0.169s
- Without applying the assumption, I was averaging between 1 CAZy entry per 0.25 and 0.29s.
At that time of not applying the assumption, I had only scraped GH, which is half the database and it had taken over 3.5 days to do, also the average was increasing with each CAZy family. I would expect the second half to potentially slower because of an increase in the ratio of previously scraped to new proteins, as this is the consistent behaviour I've found. At an average of 1 CAZy entry per 0.3s the scrape will take at 7 days, anything greater than that then it's over a week to scrape CAZy. I also had to restart the scrape as I realised the behaviour wasn't correct when it was handling reattempting to connect to CAZy after a connection timed out. I wasn't convinced it was parsing the page after the connection had previously timed out so would have led to a partial scrape of CAZy. Now it's all sorted, but I don't want it to take a week or more to do
The scraper does allow the user to add data to an existing database. Therefore, you can apply the assumption when performing the large data scrape, then if there are specific subsets for which you want to ensure, for instance, you have all the PDB accessions, you could re-scrape just those subsets, adding the data to the database created by the large CAZy scrape. With the logging system in the database, if you shared the database with someone then they can see exactly what you did: if you did or did not apply the assumption, and how it was applied.
It's a poor workaround for speeding up the scraper. But it gets the GenBank accessions I need for pyrewton. Longer-term a better solution for increasing the rate of performing large scrapes on CAZy that don't have a potentially negative affect on data integrity might be worth looking into.
Overall I suspect the web/network IO is the slow step
It certainly is when the CAZy server is having a slow day and it can take 10 or more attempts of having the connection timeout at 45 seconds to be able to reach CAZy.
from cazy_webscraper.
I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.
I think you'd be able to divine this from your scraped data, if you scraped a "raw" version of the website tables.
It's probably OK to assume that all the data is consistent, but as it's an assumption, it should be clearly stated (and ideally it would be checked, too…)
With that in mind, I'm kind of annoyed with myself that I didn't suggest scraping the site in its entirety as it stands, as a local version on-disk so that you don't have to wait for network connections while developing the database and other logic :(
from cazy_webscraper.
Related Issues (20)
- Fix logger inheritance
- Use NCBI Tax IDs
- Bio.Entrez NotXMLError HOT 1
- Fails to retrieve data from UniProt
- Crashes when retrieving NCBI seqs: http.client.IncompleteRead
- Failing to retrieve UniProt data HOT 4
- API missing opt to include 'kingdoms' in output
- Update to `sqlalchemy` 2.x HOT 1
- Crashes when retrieving seqs from NCBI
- No PDB acessions matched and Retrieving no protein structure files HOT 3
- Unexpected error message when retrieving AA UniProt sequences HOT 2
- Add subcommands HOT 1
- Increase unit test coverage
- Crashing when retrieving taxs from NCBI HOT 6
- Incomplete read error HOT 3
- Reduced memory demand HOT 1
- Crashing when retrieving taxs from NCBI - perhaps related to #120 HOT 1
- cazy_webscraper - error downloading database HOT 1
- Error during retrieving taxa info from NCBI HOT 1
- Incorrect parsing of NCBI protein version accession
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cazy_webscraper.