Comments (30)
Thanks everyone for your help!!
from spectral-workbench.
And I did all the /contributors?page=<n>
URLs yesterday, too, just not via sheets — I had an old batch script I wrote when the Save Page Now 2 API was in beta, and pulled it out for this to see if it worked any better. It’s kinda 6 of one half a dozen of the other.
from spectral-workbench.
Update: tracking URL for the spectrums_080000-159998
sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=5070d061-49ec-494e-b8bb-64ebeb7a4e8b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1hV0R_d9iX0TrFd1hmCFQoQ-McgVnLZDlSd8jo4m49Rc%2Fedit%23gid%3D452795522
from spectral-workbench.
Update: tracking URL for the spectrums_159999-239997
sheet I just submitted is https://archive.org/services/wayback-gsheets/check?job_id=072cb2f7-6bfa-49c6-8501-a12063981c2b&google_sheet_url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1XBPHX92QuSAJz_Nc1d5FIieYKgoKoPDhMT4QfpA3_LE%2Fedit%23gid%3D325879669
from spectral-workbench.
Started a second round on the rows that failed from spectrums_080000-159998
. There are 493 that were 5xx errors or that failed because of something in SPN. I did not include a lot of rows that had a 502 error accessing favicon.ico
, since it doesn’t seem critical and was surely captured at a recent time already.
from spectral-workbench.
Happy to help if this is still needed. As a side note, I noticed https://spectralworkbench.org/sets?page=115 seems to get me a 500 error, so it looks like there’s something broken that prevents the last page of results from rendering.
from spectral-workbench.
Here are some Wayback-compatible sheets for the 4 types that don’t need a DB dump. I broke up the spectrums into batches of 10,000, but did it with a quick script, so I can re-do them easily if another batch size seems reasonable.
https://drive.google.com/drive/folders/1q3p6k5Q5fy0KqxFy-Tav20_LZSVAIUx7?usp=sharing
from spectral-workbench.
does 114 work?
Yep, works fine.
can you allow anyone to be editor?
Done. Didn’t think of this before, sorry!
If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size?
Also done! In the same folder:
spectrums_000001-079999
(You’re already doing this one, though)spectrums_080000-159998
(Just started this on my archive.org account)spectrums_159999-239997
spectrums_239998-251374
I also ran the sheet for all the /sets/<n>
URLs last night.
from spectral-workbench.
Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?
I got slightly better results (there were still ~64k that were already captured). I wasn’t worrying about these since, if they were already captured a zillion times in the same day, the job was done.
Re the favicon, is it clear that the rest of the page archived fine?
I’m not totally sure. IIRC from talking with Vangelis (who did the SPN2 rewrite), the capture would still get saved in this situation, but I may be misremembering or things may have changed.
Ah, i'm now getting this: This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "[email protected]" if you would like to discuss this more.
My second run was entirely this. Given how high that number is (!) and the fact that things sped up, it definitely sounds like other folks at the archive or elsewhere are on this.
I’d consider switching gears to taking the lists of all the pages you want and using the CDX index to make sure they’ve been archived. Then just make a smaller list of what’s missing (if anything).
You can use the wayback
Python package to do this more easily (disclosure: I’m the maintainer):
import wayback
from datetime import date
client = wayback.WaybackClient()
# List the time, status, and URL captured for everything since 8/10
for record in client.search('https://spectralworkbench.org/*', from_date=date(2022, 8, 10), limit=10_000):
print(f"{record.timestamp}: ({record.status_code}) {record.url}")
# Outputs:
# 2022-08-18 00:00:20+00:00: (200) https://spectralworkbench.org/
# 2022-08-18 00:00:39+00:00: (503) https://spectralworkbench.org/
# 2022-08-17 19:14:39+00:00: (301) http://spectralworkbench.org/analyze/spectrum/4474
# 2022-08-16 23:27:13+00:00: (None) https://spectralworkbench.org/assets/adapterjs/publish/adapter.min-0c17431f9d1a50badfff11e14667aeda1023bfebbccfc27893d88cb46cbc9687.js
# 2022-08-17 07:26:48+00:00: (200) https://spectralworkbench.org/assets/analyze-ddc787ced325eab2b23f319d4886faa8dbb53581999f65967b30fe0d93fc3527.js
# etc...
And probably check the ones with status codes >=200 and <300 and >=400 and <500 to make sure they cover every URL you are concerned about.
from spectral-workbench.
I see - does 114 work?
Oh excellent. I just made a script to generate the spectrums one - adding the spreadsheet above. Some spectra will have been deleted and won't exist, and I can try running a query to find exactly which do exist... but this is a good backup approach if we don't finish in time. And I should try doing that one in 100k batches i think... or maybe 3x 80k batches.
I see the contributors and sets too, will add links above. HUGE APPRECIATION THANK U!
from spectral-workbench.
Ah, can you allow anyone to be editor? Apparently the script needs that. Or, possibly if you share GSheets privileges with the Archive service it can read them if you begin the process; if you do, please share the in-progress status link which I believe should be world-readable. Maybe try the sets/# collection first?
If you're able to re-batch the spectrums into 79,999-long batches i think that's the ideal size? The limits are listed differently in different places but I believe 80k is the shortest limit I've seen.
Thanks again, this is super helpful!
from spectral-workbench.
I'll begin submitting my own initial 79,999 list of spectra starting at https://spectralworkbench.org/spectrums/1 up to https://spectralworkbench.org/spectrums/79999, since I have that ready to submit.
from spectral-workbench.
Lol fwiw I see this on the first 80k records: Velocity: ~7 rows/min. ETA: ~212 hours.
That's about 9 days!
from spectral-workbench.
So, if you're able to submit the second set of 80k, we'll need all the time we can get! Although, if it takes that long, it means we can likely submit more than one batch at a time, assuming the SpectralWorkbench.org server can handle all the requests.
from spectral-workbench.
Thanks all! I found that in spectrums 0-80000, about 8k succeeded and 72k failed. But, I also learned that Archive Team has been independently trying, so it's possible they started theirs and we hit the server too hard by doubling up? A lot of mine have 502s which is the server responding too slow. You can see the results here, which I sorted by success/fail in tab 2: https://docs.google.com/spreadsheets/d/1-lmKewkurcqQ5sVJbBjIhepcjDm3eNovaJQ3s3TpmaI/edit#gid=1660752805
On the upside, it took only 17 hours, not 9 days.
Am I understanding that @Mr0grog you are seeing much better success yields for your submissions?
Thank you @bensheldon !!!! Appreciate it!
from spectral-workbench.
Re the favicon, is it clear that the rest of the page archived fine? Thanks!!
from spectral-workbench.
I'm marking each comment with 👍 if I've added it to the list above. Let me know if I miss anything!
Also marked items in the list at top which will be gotten by outlinks of spectra, which is one good reason to focus on just getting all the spectra since all other pages are within 1 outlink hop from those.
Thanks everyone!!! ❤️
from spectral-workbench.
Ah, i'm now getting this:
This host has been already captured 100,153.0 times today. Please try again tomorrow. Please email us at "[email protected]" if you would like to discuss this more.
I will email them. You may get similar errors in your batches, if this is a system-wide limit...
from spectral-workbench.
(Note the limit=10_000
doesn’t limit the number of results, just the number of results per page (beyond this, it’ll automatically iterate through every page, so you don’t need to worry about anything other than setting it). You have to set it to something if you want to get all the results in a large set. It’s definitely a design problem that needs fixing, and comes from some funky behavior I didn’t understand originally in Wayback’s APIs. 😞)
from spectral-workbench.
Hi all, just checking in to say:
- i asked folks at the Archive if they could lift the limit, it seems they maybe can...
- i haven't heard back from Archive Team re coordination, yet! I reached out on twitter too.
- haven't gotten clarification yet on if the favicon error means the page was/wasn't archived. i'll try asking that next.
Thanks all!
from spectral-workbench.
Hi all, updating -- MapKnitter is essentially done, just about 50 maps left to double-check. Circling back here today and tomorrow. I just resubmitted the 0-79k for a 3rd pass, since the second had hit the 100k per host limit. Updated above. Hopefully we get a better sense of the yield on this run.
from spectral-workbench.
Only 10k succeeded in that last run on 0-79k (3rd pass), leaving 53k from the first batch. But we hit the cap with ~35k of the 62.9k total requests. That still means that only about 1/3 of the remainder succeeded :-/
If I can get the 1st and 2nd batch down below 40k I'll start combining them. I can also try to narrow the required ones by checking which have been archived successfully; that won't hit our server as hard.
from spectral-workbench.
I sent the second batch latest submission for a check to see what had been already done, perhaps by the Archive Team:
However it's not clear to me that it will create a new tab to show what it completed so we can read/sort it... let's see.
from spectral-workbench.
Just an update that I have about 80k left. Some which went past the daily host limit but showed 200 Success I am skipping, even though occasionally there's potential for those to not be a complete backup; I sampled a number of them and they were OK though.
We are under some pressure to wrap up asap so I am moving as fast as I can. Thanks!
from spectral-workbench.
Only 28k remaining, running now.
from spectral-workbench.
Only 7400 left now!
from spectral-workbench.
1284 left, re-running. Very close.
from spectral-workbench.
from spectral-workbench.
OK, I think we got all but 8; unfortunately these seem not to have worked for some reason:
https://spectralworkbench.org/spectrums/80 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/80 (HTTP status=503).
-- | -- | -- | --
https://spectralworkbench.org/spectrums/123 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/123 (HTTP status=503).
https://spectralworkbench.org/spectrums/143 | New capture | | Internal Server Error for https://spectralworkbench.org/spectrums/143 (HTTP status=500).
https://spectralworkbench.org/spectrums/1208 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/1208 (HTTP status=503).
https://spectralworkbench.org/spectrums/1205 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/1205 (HTTP status=503).
https://spectralworkbench.org/spectrums/113 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/113 (HTTP status=503).
https://spectralworkbench.org/spectrums/114 | New capture | | Service Unavailable for https://spectralworkbench.org/spectrums/114 (HTTP status=503).
https://spectralworkbench.org/spectrums/22 | Already captured | - | Internal Server Error for https://spectralworkbench.org/spectrums/22 (HTTP status=500).
from spectral-workbench.
Going to shut things down as soon as we can now!
from spectral-workbench.
Related Issues (20)
- Add routes test for sets/show2 HOT 1
- Add routes test for sets/calibrated HOT 6
- Add routes test for sets/find_match HOT 1
- Add routes test for spectrums/anonymous HOT 6
- Add routes test for spectrums/plots_rss HOT 5
- Add routes test for spectrums/clone_search HOT 14
- Add routes test for spectrums/compare_search HOT 2
- Add routes test for spectrums/set_search HOT 2
- Add routes test for spectrums/show2 HOT 3
- Add routes test for spectrums/all HOT 4
- Add a route helper, profile_path, for user profile HOT 17
- Reproduce detection of fluorescent lights from legacy capture interface into new capture interface (this then offers to calibrate, with an alert along the top!) HOT 3
- Implement redirect for images from Google Storage Bucket
- Rebuilding and restarting the container drops the database in staging
- Develop setting to increase capture image resolution above currently fixed 640px HOT 1
- Doesn't seem to respect the environment variables for the DB
- link doesnt work HOT 1
- Problems HOT 1
- Publishing to stable server HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spectral-workbench.