Giter Club home page Giter Club logo

Comments (30)

jywarren avatar jywarren commented on August 11, 2024 3

Thanks everyone for your help!!

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 2

And I did all the /contributors?page=<n> URLs yesterday, too, just not via sheets — I had an old batch script I wrote when the Save Page Now 2 API was in beta, and pulled it out for this to see if it worked any better. It’s kinda 6 of one half a dozen of the other.

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 2

Update: tracking URL for the spectrums_080000-159998 sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=5070d061-49ec-494e-b8bb-64ebeb7a4e8b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc%2Fedit%23gid%3D452795522

from spectral-workbench.

bensheldon avatar bensheldon commented on August 11, 2024 2

Update: tracking URL for the spectrums_159999-239997 sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=072cb2f7-6bfa-49c6-8501-a12063981c2b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE%2Fedit%23gid%3D325879669

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 2

Started a second round on the rows that failed from spectrums_080000-159998. There are 493 that were 5xx errors or that failed because of something in SPN. I did not include a lot of rows that had a 502 error accessing favicon.ico, since it doesn’t seem critical and was surely captured at a recent time already.

Tracking URL: https://archive.org/services/wayback-gsheets/check?job_id=713ddd55-36bd-4c3a-8624-52af82558e0d&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%3Fusp%3Dsharing

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 1

Happy to help if this is still needed. As a side note, I noticed https://spectralworkbench.org/sets?page=115 seems to get me a 500 error, so it looks like there’s something broken that prevents the last page of results from rendering.

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 1

Here are some Wayback-compatible sheets for the 4 types that don’t need a DB dump. I broke up the spectrums into batches of 10,000, but did it with a quick script, so I can re-do them easily if another batch size seems reasonable.

https://drive.google.com/drive/folders/1q3p6k5Q5fy0KqxFy-Tav20_LZSVAIUx7?usp=sharing

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 1

does 114 work?

Yep, works fine.

can you allow anyone to be editor?

Done. Didn’t think of this before, sorry!

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size?

Also done! In the same folder:

I also ran the sheet for all the /sets/<n> URLs last night.

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024 1

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

I got slightly better results (there were still ~64k that were already captured). I wasn’t worrying about these since, if they were already captured a zillion times in the same day, the job was done.

Re the favicon, is it clear that the rest of the page archived fine?

I’m not totally sure. IIRC from talking with Vangelis (who did the SPN2 rewrite), the capture would still get saved in this situation, but I may be misremembering or things may have changed.

Ah, i'm now getting this: This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "[email protected]" if you would like to discuss this more.

My second run was entirely this. Given how high that number is (!) and the fact that things sped up, it definitely sounds like other folks at the archive or elsewhere are on this.

I’d consider switching gears to taking the lists of all the pages you want and using the CDX index to make sure they’ve been archived. Then just make a smaller list of what’s missing (if anything).

You can use the wayback Python package to do this more easily (disclosure: I’m the maintainer):

import wayback
from datetime import date

client = wayback.WaybackClient()

# List the time, status, and URL captured for everything since 8/10
for record in client.search('https://spectralworkbench.org/*', from_date=date(2022, 8, 10), limit=10_000):
    print(f"{record.timestamp}: ({record.status_code}) {record.url}")

# Outputs:
# 2022-08-18 00:00:20+00:00: (200) https://spectralworkbench.org/
# 2022-08-18 00:00:39+00:00: (503) https://spectralworkbench.org/
# 2022-08-17 19:14:39+00:00: (301) http://spectralworkbench.org/analyze/spectrum/4474
# 2022-08-16 23:27:13+00:00: (None) https://spectralworkbench.org/assets/adapterjs/publish/adapter.min-0c17431f9d1a50badfff11e14667aeda1023bfebbccfc27893d88cb46cbc9687.js
# 2022-08-17 07:26:48+00:00: (200) https://spectralworkbench.org/assets/analyze-ddc787ced325eab2b23f319d4886faa8dbb53581999f65967b30fe0d93fc3527.js
# etc...

And probably check the ones with status codes >=200 and <300 and >=400 and <500 to make sure they cover every URL you are concerned about.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

I see - does 114 work?

Oh excellent. I just made a script to generate the spectrums one - adding the spreadsheet above. Some spectra will have been deleted and won't exist, and I can try running a query to find exactly which do exist... but this is a good backup approach if we don't finish in time. And I should try doing that one in 100k batches i think... or maybe 3x 80k batches.

I see the contributors and sets too, will add links above. HUGE APPRECIATION THANK U!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Ah, can you allow anyone to be editor? Apparently the script needs that. Or, possibly if you share GSheets privileges with the Archive service it can read them if you begin the process; if you do, please share the in-progress status link which I believe should be world-readable. Maybe try the sets/# collection first?

If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size? The limits are listed differently in different places but I believe 80k is the shortest limit I've seen.

Thanks again, this is super helpful!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

I'll begin submitting my own initial 79,999 list of spectra starting at https://spectralworkbench.org/spectrums/1 up to https://spectralworkbench.org/spectrums/79999, since I have that ready to submit.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Lol fwiw I see this on the first 80k records: Velocity: ~7 rows/min. ETA: ~212 hours. That's about 9 days!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

So, if you're able to submit the second set of 80k, we'll need all the time we can get! Although, if it takes that long, it means we can likely submit more than one batch at a time, assuming the SpectralWorkbench.org server can handle all the requests.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Thanks all! I found that in spectrums 0-80000, about 8k succeeded and 72k failed. But, I also learned that Archive Team has been independently trying, so it's possible they started theirs and we hit the server too hard by doubling up? A lot of mine have 502s which is the server responding too slow. You can see the results here, which I sorted by success/fail in tab 2: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=1660752805

On the upside, it took only 17 hours, not 9 days.

Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?

Thank you @bensheldon !!!! Appreciate it!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Re the favicon, is it clear that the rest of the page archived fine? Thanks!!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

I'm marking each comment with 👍 if I've added it to the list above. Let me know if I miss anything!

Also marked items in the list at top which will be gotten by outlinks of spectra, which is one good reason to focus on just getting all the spectra since all other pages are within 1 outlink hop from those.

Thanks everyone!!! ❤️

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Ah, i'm now getting this:

This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "[email protected]" if you would like to discuss this more.

I will email them. You may get similar errors in your batches, if this is a system-wide limit...

from spectral-workbench.

Mr0grog avatar Mr0grog commented on August 11, 2024

(Note the limit=10_000 doesn’t limit the number of results, just the number of results per page (beyond this, it’ll automatically iterate through every page, so you don’t need to worry about anything other than setting it). You have to set it to something if you want to get all the results in a large set. It’s definitely a design problem that needs fixing, and comes from some funky behavior I didn’t understand originally in Wayback’s APIs. 😞)

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Hi all, just checking in to say:

  1. i asked folks at the Archive if they could lift the limit, it seems they maybe can...
  2. i haven't heard back from Archive Team re coordination, yet! I reached out on twitter too.
  3. haven't gotten clarification yet on if the favicon error means the page was/wasn't archived. i'll try asking that next.

Thanks all!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Hi all, updating -- MapKnitter is essentially done, just about 50 maps left to double-check. Circling back here today and tomorrow. I just resubmitted the 0-79k for a 3rd pass, since the second had hit the 100k per host limit. Updated above. Hopefully we get a better sense of the yield on this run.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Only 10k succeeded in that last run on 0-79k (3rd pass), leaving 53k from the first batch. But we hit the cap with ~35k of the 62.9k total requests. That still means that only about 1/3 of the remainder succeeded :-/

If I can get the 1st and 2nd batch down below 40k I'll start combining them. I can also try to narrow the required ones by checking which have been archived successfully; that won't hit our server as hard.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

I sent the second batch latest submission for a check to see what had been already done, perhaps by the Archive Team:

https://archive.org/services/wayback-gsheets/check?job_id=d885c81f-5849-4083-bdd8-70bec5ac2528&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac%2Fedit%23gid%3D452795522

https://docs.google.com/spreadsheets/d/1359MIeGzwH1Chl4R33ZhPjqyBdhDRVjUXhDgPG9L7ac/edit#gid=452795522

However it's not clear to me that it will create a new tab to show what it completed so we can read/sort it... let's see.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Just an update that I have about 80k left. Some which went past the daily host limit but showed 200 Success I am skipping, even though occasionally there's potential for those to not be a complete backup; I sampled a number of them and they were OK though.

We are under some pressure to wrap up asap so I am moving as fast as I can. Thanks!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Only 28k remaining, running now.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Only 7400 left now!

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

1284 left, re-running. Very close.

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

77 left: https://docs.google.com/spreadsheets/d/1hCpccXa3xmH4D09jufccZtY7WH6mzoEkBJts50__JKE/edit#gid=273693883

https://archive.org/services/wayback-gsheets/check?job_id=e0236c65-bad1-4f6a-bbf6-ab8c0b65943f&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hCpccXa3xmH4D09jufccZtY7WH6mzoEkBJts50__JKE%2Fedit%3Fusp%3Dsharing

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

OK, I think we got all but 8; unfortunately these seem not to have worked for some reason:

https://spectralworkbench.org/spectrums/80 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/80 (HTTP status=503).
-- | -- | -- | --
https://spectralworkbench.org/spectrums/123 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/123 (HTTP status=503).
https://spectralworkbench.org/spectrums/143 | New capture |   | Internal Server Error for https://spectralworkbench.org/spectrums/143 (HTTP status=500).
https://spectralworkbench.org/spectrums/1208 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1208 (HTTP status=503).
https://spectralworkbench.org/spectrums/1205 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/1205 (HTTP status=503).
https://spectralworkbench.org/spectrums/113 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/113 (HTTP status=503).
https://spectralworkbench.org/spectrums/114 | New capture |   | Service Unavailable for https://spectralworkbench.org/spectrums/114 (HTTP status=503).
https://spectralworkbench.org/spectrums/22 | Already captured | - | Internal Server Error for https://spectralworkbench.org/spectrums/22 (HTTP status=500).

from spectral-workbench.

jywarren avatar jywarren commented on August 11, 2024

Going to shut things down as soon as we can now!

from spectral-workbench.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.