<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Potentially streamlined scraping about cazy_webscraper HOT 3 CLOSED

HobnobMancer commented on June 23, 2024

Potentially streamlined scraping

from cazy_webscraper.

Comments (3)

widdowquinn commented on June 23, 2024

Could we assume that for every family a protein appears in, it's associated data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) are the same?

I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession."

Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…

By applying this assumption the number of queries to the local CAZy database can be significantly reduced. The scraper would only need to check if the current working protein is associated with the current working CAZy family.

If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?

Overall I suspect the web/network IO is the slow step, and in-memory database queries are fast by comparison, so the speed-up is possibly not so great. Do you have any numbers to show how impactful this change might be?

from cazy_webscraper.

HobnobMancer commented on June 23, 2024

I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession."
Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…

I agree that it's not the best practice for data integrity. But, I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.

If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?

I added in this assumption so the user can 'customise' it. The user can define which combination of UniProt, GenBank, and PDB accessions and EC numbers to presume are the same each time a given protein appears in CAZy. I set my run to presume UniProt and PDB accessions and EC numbers are the same, but to always check that all GenBank accessions are retrieved for each instance a given protein appears in CAZy, as it's the GenBank accessions that are so important for pyrewton.

Applying this assumption I was able to scrape 350,835 proteins in 16 hrs 16 minutes (this is scraping CAZy family GH1-GH22 in that order).
When not applying the assumption (when scraping CAZy families GH1-GH22 in that order) it took over 30 hours.

The rate of scrape varies from family to family. The time required to parse a protein entry from CAZy is shortest when adding a new protein and longest when adding data to existing proteins. Therefore, families with have a higher ratio of previously scraped to newly scraped proteins have a significantly slower rate. Therefore, it's extremely difficult to accurately predict the total time to scrape CAZy. But approximately, if not applying the assumption that UniProt, PDB and EC numbers are the same for each time a given protein is scraped from CAZy it was looking like the scraper would take 7-8 days (at least!) to scrape the entirety of CAZy. When applying these assumptions (and only double-checking the GenBank accessions) I'm looking at under 4 days.

I think it's a behaviour to in the scrape but strongly highlight in the documentation that it isn't the best to practise for data integrity of the data included in the assumption. It should only be used if performing an extremely large scrape, such as the whole database, multiple CAZy classes or a total scrapping time of over a week, and the assumptions should be applied to a minimum (e.g. only apply to assumption to data that is not absolutely essential to or utilised in downstream processing).

The advantage for extremely large scrapes (e.g. the whole of CAZy) is that it significantly reduces the number of proteins that take 1 one or more seconds to scrape/parse. There are ~2,000,000 protein entries in CAZy to be parsed. If half those take 1 second then that's already 11 days to scrape those 1,000,000 entries.

At the moment:

while assuming UniProt, PDB and EC numbers are the same, I'm averaging (taking into account the data also scraped when writing this) 1 CAZy entry per 0.169s
Without applying the assumption, I was averaging between 1 CAZy entry per 0.25 and 0.29s.

At that time of not applying the assumption, I had only scraped GH, which is half the database and it had taken over 3.5 days to do, also the average was increasing with each CAZy family. I would expect the second half to potentially slower because of an increase in the ratio of previously scraped to new proteins, as this is the consistent behaviour I've found. At an average of 1 CAZy entry per 0.3s the scrape will take at 7 days, anything greater than that then it's over a week to scrape CAZy. I also had to restart the scrape as I realised the behaviour wasn't correct when it was handling reattempting to connect to CAZy after a connection timed out. I wasn't convinced it was parsing the page after the connection had previously timed out so would have led to a partial scrape of CAZy. Now it's all sorted, but I don't want it to take a week or more to do

The scraper does allow the user to add data to an existing database. Therefore, you can apply the assumption when performing the large data scrape, then if there are specific subsets for which you want to ensure, for instance, you have all the PDB accessions, you could re-scrape just those subsets, adding the data to the database created by the large CAZy scrape. With the logging system in the database, if you shared the database with someone then they can see exactly what you did: if you did or did not apply the assumption, and how it was applied.

It's a poor workaround for speeding up the scraper. But it gets the GenBank accessions I need for pyrewton. Longer-term a better solution for increasing the rate of performing large scrapes on CAZy that don't have a potentially negative affect on data integrity might be worth looking into.

Overall I suspect the web/network IO is the slow step

It certainly is when the CAZy server is having a slow day and it can take 10 or more attempts of having the connection timeout at 45 seconds to be able to reach CAZy.

from cazy_webscraper.

widdowquinn commented on June 23, 2024

I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.

I think you'd be able to divine this from your scraped data, if you scraped a "raw" version of the website tables.

It's probably OK to assume that all the data is consistent, but as it's an assumption, it should be clearly stated (and ideally it would be checked, too…)

With that in mind, I'm kind of annoyed with myself that I didn't suggest scraping the site in its entirety as it stands, as a local version on-disk so that you don't have to wait for network connections while developing the database and other logic :(

from cazy_webscraper.

Potentially streamlined scraping about cazy_webscraper HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent