Giter Club home page Giter Club logo

Comments (6)

MartinKolarik avatar MartinKolarik commented on June 3, 2024 2

Since we only need to list files in the root of the repo, it seems all the requests for different name variations could be replaced with a single call like this https://api.github.com/repos/algolia/npm-search/git/trees/HEAD, reducing the number of requests from 19 to 1 per repo. I'm not sure what's the regular rate of processed updates per hour though (@bodinsamuel I'm sure you have stats for this) and the 5k API limit would probably be still too low, even though this approach would be a lot more efficient.

A while back, we were actually able to obtain an increased limit on API calls for jsDelivr services (but only API, not the raw endpoint), maybe if the requirement for the API approach was just a little over the regular limit, this would be the way. Another option would also be combining the approaches - using up to 5k requests via the API and doing the brute-force only after the API limit is reached.

from npm-search.

Haroenv avatar Haroenv commented on June 3, 2024

Is there a different avenue that does allow us to access multiple files at once, so we'd go from multiple files per package to only one? https://github.com/algolia/npm-search/blob/master/src/changelog.ts#L137 Would that slow down the rate enough to be accessible @bk2204 ?

thanks for contacting us in advance!

from npm-search.

bodinsamuel avatar bodinsamuel commented on June 3, 2024

Hey thanks for proactively reaching out.
The tool is indexing NPM registry to Algolia and provide it to many users for free (codesandbox, yarn, jsdelivr, etc...)
From time to time we need to do a full reindex because the schema changed for example.
Reindexing more than 2millions packages with additional information like the changelog takes a very long time.

Is there a different avenue that does allow us to access multiple files at once, so we'd go from multiple files per package to only one?

We already optimised the thing by querying jsdelivr to get the fileslist and query directly the changelog. But there are always packages that are not yet package for which we don't know where is the changelog.
Ideally that would be great to have our IP in an allowlist or have an alternative bucket to hit.

I'll try to see what I can do for handling rate limit but since we don't want to index partial information, we'll have to pause the entire indexing by 30minutes which will adds up pretty quickly in our process.


NB: @Haroenv the changelog is really there for yarn frontend, and it always bothered me to store that much info in the json. If we removed entirely it would be even better. wdyt?

from npm-search.

bk2204 avatar bk2204 commented on June 3, 2024

You can use the API to get access to those files, but you're also going to be limited to 5000 requests per hour (if authenticated), just like you would be to the raw file endpoints. If you needed access to an entire revision, you could use shallow clone, partial clone, or a tarball, but I don't think the demand is sufficient in your case to request an entire full revision, since you probably just need a few files. Pulling a full revision in that case would likely produce worse results for both you and us.

We don't have a way to provide large bulk data requests that isn't rate limited and that's intentional, since we need to prevent users from tipping over our services.

If you did nothing here, you'd continue to get 429 responses, and your data will be incomplete. The only particular problem that we have at this moment is that we're seeing alerts for serving more 429s than we expect. However, it might cause problems in the future, and we'd definitely appreciate a gentler approach here.

from npm-search.

MartinKolarik avatar MartinKolarik commented on June 3, 2024

@bk2204 I believe this has been partially improved already by not making the requests in some cases, but I'm looking into adding a further rate limit here. Do you have any suggestion as to what number of requests per second to https://raw.githubusercontent.com/ would be acceptable and avoid the 429 responses?

from npm-search.

bk2204 avatar bk2204 commented on June 3, 2024

The number of acceptable requests per hour per IP is 5000 to that service. That number includes requests to raw.githubusercontent.com as well as autogenerated archives (tarballs and zipballs). Beyond that, requests can be served a 429 and may be automatically blocked for a period (which I believe is 30 minutes) once that happens.

Your 5000 requests can be spread out over the hour or in a short burst over a few minutes; we're not very particular. Since there are 3600 seconds in an hour, that's about 83 per minute if you want to measure it that way.

from npm-search.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.