sangaline / wayback-machine-scraper Goto Github PK

View Code? Open in Web Editor NEW

408.0 15.0 73.0 84 KB

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Home Page: http://sangaline.com/post/wayback-machine-scraper/

License: ISC License

Python 100.00%

web-scraping wayback-machine wayback-archiver archive-dot-org python command-line-tool

wayback-machine-scraper's Issues

Error with setup

when I run pip install wayback-machine-scraper

I get the error

error: command 'D:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit status 1158

Is it possible for you to include a pre-built wheel file, so we don't have to compile this project ourselves.

Seems to be non functional

I have to admit I haven't spent any time troubleshooting, but it does look like this doesn't function as is anymore.

wayback-machine-scraper -f 20080623 -t 20080623 news.ycombinator.com
2019-03-21 11:50:11 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-03-21 11:50:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-03-21 11:50:11 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_START_DELAY': 1, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0, 'LOG_LEVEL': 'INFO', 'USER_AGENT': 'Wayback Machine Scraper/1.0.7 (+https://github.com/sangaline/scrapy-wayback-machine)'}
2019-03-21 11:50:11 [scrapy.extensions.telnet] INFO: Telnet Password: 
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-21 11:50:11 [scrapy.core.engine] INFO: Spider opened
2019-03-21 11:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-21 11:50:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-21 11:50:11 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.ycombinator.com>
Traceback (most recent call last):
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
    response = yield method(request=request, spider=spider)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy_wayback_machine\__init__.py", line 64, in process_request
    return self.build_cdx_request(request)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy_wayback_machine\__init__.py", line 91, in build_cdx_request
    cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
  File "c:\tools\anaconda3\envs\wayback\lib\nturl2path.py", line 65, in pathname2url
    raise OSError(error)
OSError: Bad path: http://news.ycombinator.com
2019-03-21 11:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-21 11:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/builtins.OSError': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 21, 16, 50, 12, 40860),
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/builtins.OSError': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2019, 3, 21, 16, 50, 11, 827644)}
2019-03-21 11:50:12 [scrapy.core.engine] INFO: Spider closed (finished)

Would it be possible to add a functionality to download a screenshot?

I would like to know if it would be possible to add a functionality that will download screenshots of each version of a website over the time, based on Wayback Machine data. It would be very useful for us, because checking out screenshots (by a human) is much faster than checking out HTML.

Inspired by warrick ?

does wayback-machine-scraper want to be a replacement for warrick ?

to what extent is it inspired or based on warrick ?
http://timetravel.mementoweb.org/

How can I use this to get the number of times a site is crawled by the wayback?

As shown here, the number of times, it was crawled.
https://web.archive.org/web/*/panda.com

ImportError: cannot import name timezone

I tried pip install datetime and pip3 install datetime and am still getting the error.

Following image links

Thanks so much for this scraper. It works so much better than the other wayback scraper tools I've found.

I'm trying to scrape all snapshots of an old site and I've noticed that this scraper doesn't grab images, though the images are definitely stored on archive.org in the wayback machine for each snapshot.

Is there any way to get this scraper to also grab images and other linked assets like CSS?

snapshot functionality for a full site at a given time?

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.

I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)

I've looked at the logic in the filter_snapshots function and it all makes sense - essentially it keeps each snapshot before time_range[0] in a holding variable initial_snapshot and if the filtered_snapshots list is still empty when the time_range[1] is reached then that goes into filtered_snapshot list as the only snapshot.

Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting DEPTH_LIMIT in __main.py__ and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.

Broken with Scrapy 2.x

This is an issue with sangaline/scrapy-wayback-machine#11 but it also breaks this project, so I thought it was worth mentioning here.

Please accept this pull request
sangaline/scrapy-wayback-machine#9

[Question] How to get latest crawl?

Is there a way that I can get the most recent version (a single version) of a full site crawl of a list of URLs?

'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file.

I installed it using PIP but this error appears

Compatibility?

It's a sad state of affairs, that I have to ask this for projects that don't make it explicitly clear. So, sorry! :)

I'm just wondering whether this is compatible with Python 3? Or, would it need some love and contribution to bring it up to date?

Thanks!

Error 429 + Scraper gives up

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.

( See discussion on similar project here buren/wayback_archiver#32 )

The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.

As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"

...
2023-11-09 22:09:57 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://web.archive.org/cdx/search/cdx?url=www.example.com/blog/stuff&output=json&fl=timestamp,original,statuscode,digest> (failed 3 times): 429 Unknown Status
2023-11-09 22:09:57 [scrapy.core.engine] INFO: Closing spider (finished)

I see two issues here:

Add a global rate limit (I don't think the concurrency flag covers this?)
1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.)
Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
(Unnecessary if previous points handled?) Increase retry limit from 3 to something much higher? Again, if we approach scraping with a "backoff"

TODO:

Find out exactly what the rate limit is: May be 5 per minute, or may be 15 per minute? (12 or 4s delay respectively.)
They seem to have changed it several times. Not sure if there are official numbers.
https://archive.org/details/toomanyrequests_20191110
This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
Find out if this project already does rate limiting. Edit: Sorta, but not entirely sufficient for this use case? (e.g. no 5-minute backoff on 429, autothrottle does not guarantee <X/minute, etc.)

Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency:
__main__.py

'AUTOTHROTTLE_START_DELAY': 4, # aiming for 15 per minute

and

parser.add_argument('-c', '--concurrency', default=1.0, help=(

This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.

More work needed. May report back later.

Edit: AutoThrottle docs say AUTOTHROTTLE_TARGET_CONCURRENCY represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.

'ExecutionEngine' object has no attribute 'schedule'

Command Run: wayback-machine-scraper -f 20231201 -t 20231220 http://breitbart.com/ads.txt

Output:

2024-01-20 09:40:43 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-01-20 09:40:43 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.13, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.12 (main, Apr  5 2022, 01:52:34) - [Clang 12.0.0 ], pyOpenSSL 22.0.0 (OpenSSL 1.1.1o  3 May 2022), cryptography 37.0.1, Platform macOS-13.4-arm64-arm-64bit
2024-01-20 09:40:43 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-20 09:40:43 [py.warnings] WARNING: /opt/miniconda3/lib/python3.9/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-01-20 09:40:43 [scrapy.extensions.telnet] INFO: Telnet Password: b9b190d843dfdca9
2024-01-20 09:40:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2024-01-20 09:40:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 1,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
 'LOG_LEVEL': 'INFO',
 'USER_AGENT': 'Wayback Machine Scraper/1.0.8 '
               '(+https://github.com/sangaline/scrapy-wayback-machine)'}
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Spider opened
2024-01-20 09:40:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-20 09:40:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-20 09:40:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://web.archive.org/cdx/search/cdx?url=http%3A//breitbart.com/ads.txt&output=json&fl=timestamp,original,statuscode,digest>
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/opt/miniconda3/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_response
    method(request=request, response=response, spider=spider)
  File "/opt/miniconda3/lib/python3.9/site-packages/scrapy_wayback_machine/__init__.py", line 83, in process_response
    self.crawler.engine.schedule(snapshot_request, spider)
AttributeError: 'ExecutionEngine' object has no attribute 'schedule'
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-20 09:40:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 370,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2366,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.54403,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 1, 20, 17, 40, 44, 757760, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 9831,
 'httpcompression/response_count': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 69681152,
 'memusage/startup': 69681152,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 1, 20, 17, 40, 44, 213730, tzinfo=datetime.timezone.utc)}
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Spider closed (finished)

Import Error: No module named request

Hi there

When running this I get the error:

Traceback (most recent call last):
File "/usr/local/bin/wayback-machine-scraper", line 5, in
from wayback_machine_scraper.main import main
File "/usr/local/lib/python2.7/site-packages/wayback_machine_scraper/main.py", line 7, in
from .mirror_spider import MirrorSpider
File "/usr/local/lib/python2.7/site-packages/wayback_machine_scraper/mirror_spider.py", line 7, in
from scrapy_wayback_machine import WaybackMachineMiddleware
File "/usr/local/lib/python2.7/site-packages/scrapy_wayback_machine/init.py", line 3, in
from urllib.request import pathname2url
ImportError: No module named request

It looks like it's pointing to Python 2.7...but when I type "python --version" I get Python 3.7. Shouldn't this run with either though? Thanks.

Crashes (includes fix)

Since I can't commit to your project, here are two fixes that I had to made in order to get the scraper to run:

In mirror_spider.py, line 50, there is no check whether the output path is valid. The URL can contain ? characters which causes the script to crash.

Here's my solution, it's just a quick fix and may require elaboration for different characters and Linux/Windows compatibility:

url_parts = response.url.split('://')[1].split('/')
parent_directory = os.path.join(self.directory, *url_parts)
parent_directory = parent_directory.replace("?", "_q_")  # normalize path

os.makedirs(parent_directory, exist_ok=True)

There is another bug in your other project scrapy_wayback_machine which is imported here, that causes a crash.

It's in init.py, line 91:

cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))

At this point, request.url is something like http://website.com.
But pathname2url will look for a colon : and require that anything before that is only one letter (since we are dealing with regular paths here, like C:\mypath.

When I removed the call to pathname2url, it worked for me, but I don't know which other cases may break:

cdx_url = self.cdx_url_template.format(url=request.url)

sangaline / wayback-machine-scraper Goto Github PK

wayback-machine-scraper's Issues

Error with setup

Seems to be non functional

Would it be possible to add a functionality to download a screenshot?

Inspired by warrick ?

How can I use this to get the number of times a site is crawled by the wayback?

ImportError: cannot import name timezone

Following image links

snapshot functionality for a full site at a given time?

Broken with Scrapy 2.x

[Question] How to get latest crawl?

'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file.

Compatibility?

Error 429 + Scraper gives up

'ExecutionEngine' object has no attribute 'schedule'

Import Error: No module named request

Crashes (includes fix)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent