Hey, I'm one of the engineers at GitHub responsible for the archive

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Gracefully handle 429 responses about npm-search HOT 6 CLOSED

bk2204 commented on June 3, 2024

Gracefully handle 429 responses

from npm-search.

Comments (6)

MartinKolarik commented on June 3, 2024 2

Since we only need to list files in the root of the repo, it seems all the requests for different name variations could be replaced with a single call like this https://api.github.com/repos/algolia/npm-search/git/trees/HEAD, reducing the number of requests from 19 to 1 per repo. I'm not sure what's the regular rate of processed updates per hour though (@bodinsamuel I'm sure you have stats for this) and the 5k API limit would probably be still too low, even though this approach would be a lot more efficient.

A while back, we were actually able to obtain an increased limit on API calls for jsDelivr services (but only API, not the raw endpoint), maybe if the requirement for the API approach was just a little over the regular limit, this would be the way. Another option would also be combining the approaches - using up to 5k requests via the API and doing the brute-force only after the API limit is reached.

from npm-search.

Haroenv commented on June 3, 2024

Is there a different avenue that does allow us to access multiple files at once, so we'd go from multiple files per package to only one? https://github.com/algolia/npm-search/blob/master/src/changelog.ts#L137 Would that slow down the rate enough to be accessible @bk2204 ?

thanks for contacting us in advance!

from npm-search.

bodinsamuel commented on June 3, 2024

Hey thanks for proactively reaching out.
The tool is indexing NPM registry to Algolia and provide it to many users for free (codesandbox, yarn, jsdelivr, etc...)
From time to time we need to do a full reindex because the schema changed for example.
Reindexing more than 2millions packages with additional information like the changelog takes a very long time.

Is there a different avenue that does allow us to access multiple files at once, so we'd go from multiple files per package to only one?

We already optimised the thing by querying jsdelivr to get the fileslist and query directly the changelog. But there are always packages that are not yet package for which we don't know where is the changelog.
Ideally that would be great to have our IP in an allowlist or have an alternative bucket to hit.

I'll try to see what I can do for handling rate limit but since we don't want to index partial information, we'll have to pause the entire indexing by 30minutes which will adds up pretty quickly in our process.

NB: @Haroenv the changelog is really there for yarn frontend, and it always bothered me to store that much info in the json. If we removed entirely it would be even better. wdyt?

from npm-search.

bk2204 commented on June 3, 2024

You can use the API to get access to those files, but you're also going to be limited to 5000 requests per hour (if authenticated), just like you would be to the raw file endpoints. If you needed access to an entire revision, you could use shallow clone, partial clone, or a tarball, but I don't think the demand is sufficient in your case to request an entire full revision, since you probably just need a few files. Pulling a full revision in that case would likely produce worse results for both you and us.

We don't have a way to provide large bulk data requests that isn't rate limited and that's intentional, since we need to prevent users from tipping over our services.

If you did nothing here, you'd continue to get 429 responses, and your data will be incomplete. The only particular problem that we have at this moment is that we're seeing alerts for serving more 429s than we expect. However, it might cause problems in the future, and we'd definitely appreciate a gentler approach here.

from npm-search.

MartinKolarik commented on June 3, 2024

@bk2204 I believe this has been partially improved already by not making the requests in some cases, but I'm looking into adding a further rate limit here. Do you have any suggestion as to what number of requests per second to https://raw.githubusercontent.com/ would be acceptable and avoid the 429 responses?

from npm-search.

bk2204 commented on June 3, 2024

The number of acceptable requests per hour per IP is 5000 to that service. That number includes requests to raw.githubusercontent.com as well as autogenerated archives (tarballs and zipballs). Beyond that, requests can be served a 429 and may be automatically blocked for a period (which I believe is 30 minutes) once that happens.

Your 5000 requests can be spread out over the hour or in a short burst over a few minutes; we're not very particular. Since there are 3600 seconds in an hour, that's about 83 per minute if you want to measure it that way.

from npm-search.

Gracefully handle 429 responses about npm-search HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent