sangaline / wayback-machine-scraper Goto Github PK

View Code? Open in Web Editor NEW

407.0 15.0 72.0 84 KB

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Home Page: http://sangaline.com/post/wayback-machine-scraper/

License: ISC License

Python 100.00%

web-scraping wayback-machine wayback-archiver archive-dot-org python command-line-tool

wayback-machine-scraper's Introduction

The Wayback Machine Scraper

The repository consists of a command-line utility wayback-machine-scraper that can be used to scrape or download website data as it appears in archive.org's Wayback Machine. It crawls through historical snapshots of a website and saves the snapshots to disk. This can be useful when you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow. It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.

The command-line utility is highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site. If you're interested in parsing data from the pages that are crawled then you might want to check out scrapy-wayback-machine instead. It's a downloader middleware that handles all of the tricky parts and passes normal response objects to your Scrapy spiders with archive timestamp information attached. The middleware is very unobtrusive and should work seamlessly with existing Scrapy middlewares, extensions, and spiders. It's what wayback-machine-scraper uses behind the scenes and it offers more flexibility for advanced use cases.

Installation

The package can be installed using pip.

pip install wayback-machine-scraper

Command-Line Interface

Writing a custom Scrapy spider and using the WaybackMachine middleware is the preferred way to use this project, but a command line interface for basic mirroring is also included. The usage information can be printed by running wayback-machine-scraper -h.

usage: wayback-machine-scraper [-h] [-o DIRECTORY] [-f TIMESTAMP]
                               [-t TIMESTAMP] [-a REGEX] [-d REGEX]
                               [-c CONCURRENCY] [-u] [-v]
                               DOMAIN [DOMAIN ...]

Mirror all Wayback Machine snapshots of one or more domains within a specified
time range.

positional arguments:
  DOMAIN                Specify the domain(s) to scrape. Can also be a full
                        URL to specify starting points for the crawler.

optional arguments:
  -h, --help            show this help message and exit
  -o DIRECTORY, --output DIRECTORY
                        Specify the domain(s) to scrape. Can also be a full
                        URL to specify starting points for the crawler.
                        (default: website)
  -f TIMESTAMP, --from TIMESTAMP
                        The timestamp for the beginning of the range to
                        scrape. Can either be YYYYmmdd, YYYYmmddHHMMSS, or a
                        Unix timestamp. (default: 10000101)
  -t TIMESTAMP, --to TIMESTAMP
                        The timestamp for the end of the range to scrape. Use
                        the same timestamp as `--from` to specify a single
                        point in time. (default: 30000101)
  -a REGEX, --allow REGEX
                        A regular expression that all scraped URLs must match.
                        (default: ())
  -d REGEX, --deny REGEX
                        A regular expression to exclude matched URLs.
                        (default: ())
  -c CONCURRENCY, --concurrency CONCURRENCY
                        Target concurrency for crawl requests.The crawl rate
                        will be automatically adjusted to match this
                        target.Use values less than 1 to be polite and higher
                        values to scrape more quickly. (default: 10.0)
  -u, --unix            Save snapshots as `UNIX_TIMESTAMP.snapshot` instead of
                        the default `YYYYmmddHHMMSS.snapshot`. (default:
                        False)
  -v, --verbose         Turn on debug logging. (default: False)

Examples

The usage can be perhaps be made more clear with a couple of concrete examples.

A Single Page Over Time

One of the key advantages of wayback-machine-scraper over other projects, such as wayback-machine-downloader, is that it offers the capability to download all available archive.org snapshots. This can be extremely useful if you're interested in analyzing how pages change over time.

For example, say that you would like to analyze many snapshots of the Hacker News front page as I did writing Reverse Engineering the Hacker News Algorithm. This can be done by running

wayback-machine-scraper -a 'news.ycombinator.com$' news.ycombinator.com

where the --allow regular expression news.ycombinator.com$ limits the crawl to the front page. This produces a file structure of

website/
└── news.ycombinator.com
    ├── 20070221033032.snapshot
    ├── 20070226001637.snapshot
    ├── 20070405032412.snapshot
    ├── 20070405175109.snapshot
    ├── 20070406195336.snapshot
    ├── 20070601184317.snapshot
    ├── 20070629033202.snapshot
    ├── 20070630222527.snapshot
    ├── 20070630222818.snapshot
    └── etc.

with each snapshot file containing the full HTML body of the front page.

A series of snapshots for any page can be obtained in this way as long as suitable regular expressions and start URLs are constructed. If we are interested in a page other than the homepage then we should use it as the start URL instead. To get all of the snapshots for a specific story we could run

wayback-machine-scraper -a 'id=13857086$' 'news.ycombinator.com/item?id=13857086'

which produces

website/
└── news.ycombinator.com
    └── item?id=13857086
        ├── 20170313225853.snapshot
        ├── 20170313231755.snapshot
        ├── 20170314043150.snapshot
        ├── 20170314165633.snapshot
        └── 20170320205604.snapshot

A Full Site Crawl at One Point In Time

If the goal is to take a snapshot of an entire site at once then this can also be easily achieved. Specifying both the --from and --to options as the same point in time will assure that only one snapshot is saved for each URL. Running

wayback-machine-scraper -f 20080623 -t 20080623 news.ycombinator.com

produces a file structure of

website
└── news.ycombinator.com
    ├── 20080621143814.snapshot
    ├── item?id=221868
    │   └── 20080622151531.snapshot
    ├── item?id=222157
    │   └── 20080622151822.snapshot
    ├── item?id=222341
    │   └── 20080620221102.snapshot
    └── etc.

with a single snapshot for each page in the crawl as it appeared on June 23, 2008.

wayback-machine-scraper's People

Contributors

Stargazers

Watchers

wayback-machine-scraper's Issues

Compatibility?

It's a sad state of affairs, that I have to ask this for projects that don't make it explicitly clear. So, sorry! :)

I'm just wondering whether this is compatible with Python 3? Or, would it need some love and contribution to bring it up to date?

Thanks!

Inspired by warrick ?

does wayback-machine-scraper want to be a replacement for warrick ?

to what extent is it inspired or based on warrick ?
http://timetravel.mementoweb.org/

Error 429 + Scraper gives up

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.

( See discussion on similar project here buren/wayback_archiver#32 )

The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.

As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"

...
2023-11-09 22:09:57 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://web.archive.org/cdx/search/cdx?url=www.example.com/blog/stuff&output=json&fl=timestamp,original,statuscode,digest> (failed 3 times): 429 Unknown Status
2023-11-09 22:09:57 [scrapy.core.engine] INFO: Closing spider (finished)

I see two issues here:

Add a global rate limit (I don't think the concurrency flag covers this?)
1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.)
Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
(Unnecessary if previous points handled?) Increase retry limit from 3 to something much higher? Again, if we approach scraping with a "backoff"

TODO:

Find out exactly what the rate limit is: May be 5 per minute, or may be 15 per minute? (12 or 4s delay respectively.)
They seem to have changed it several times. Not sure if there are official numbers.
https://archive.org/details/toomanyrequests_20191110
This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
Find out if this project already does rate limiting. Edit: Sorta, but not entirely sufficient for this use case? (e.g. no 5-minute backoff on 429, autothrottle does not guarantee <X/minute, etc.)

Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency:
__main__.py

'AUTOTHROTTLE_START_DELAY': 4, # aiming for 15 per minute

and

parser.add_argument('-c', '--concurrency', default=1.0, help=(

This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.

More work needed. May report back later.

Edit: AutoThrottle docs say AUTOTHROTTLE_TARGET_CONCURRENCY represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.

ImportError: cannot import name timezone

I tried pip install datetime and pip3 install datetime and am still getting the error.

Error with setup

when I run pip install wayback-machine-scraper

I get the error

error: command 'D:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit status 1158

Is it possible for you to include a pre-built wheel file, so we don't have to compile this project ourselves.

Import Error: No module named request

Hi there

When running this I get the error:

Traceback (most recent call last):
File "/usr/local/bin/wayback-machine-scraper", line 5, in
from wayback_machine_scraper.main import main
File "/usr/local/lib/python2.7/site-packages/wayback_machine_scraper/main.py", line 7, in
from .mirror_spider import MirrorSpider
File "/usr/local/lib/python2.7/site-packages/wayback_machine_scraper/mirror_spider.py", line 7, in
from scrapy_wayback_machine import WaybackMachineMiddleware
File "/usr/local/lib/python2.7/site-packages/scrapy_wayback_machine/init.py", line 3, in
from urllib.request import pathname2url
ImportError: No module named request

It looks like it's pointing to Python 2.7...but when I type "python --version" I get Python 3.7. Shouldn't this run with either though? Thanks.

Broken with Scrapy 2.x

This is an issue with sangaline/scrapy-wayback-machine#11 but it also breaks this project, so I thought it was worth mentioning here.

Please accept this pull request
sangaline/scrapy-wayback-machine#9

'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file.

I installed it using PIP but this error appears

'ExecutionEngine' object has no attribute 'schedule'

Command Run: wayback-machine-scraper -f 20231201 -t 20231220 http://breitbart.com/ads.txt

Output:

2024-01-20 09:40:43 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-01-20 09:40:43 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.13, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.12 (main, Apr  5 2022, 01:52:34) - [Clang 12.0.0 ], pyOpenSSL 22.0.0 (OpenSSL 1.1.1o  3 May 2022), cryptography 37.0.1, Platform macOS-13.4-arm64-arm-64bit
2024-01-20 09:40:43 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-20 09:40:43 [py.warnings] WARNING: /opt/miniconda3/lib/python3.9/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-01-20 09:40:43 [scrapy.extensions.telnet] INFO: Telnet Password: b9b190d843dfdca9
2024-01-20 09:40:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2024-01-20 09:40:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 1,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
 'LOG_LEVEL': 'INFO',
 'USER_AGENT': 'Wayback Machine Scraper/1.0.8 '
               '(+https://github.com/sangaline/scrapy-wayback-machine)'}
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-20 09:40:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Spider opened
2024-01-20 09:40:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-20 09:40:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-20 09:40:44 [scrapy.core.scraper] ERROR: Error downloading <GET https://web.archive.org/cdx/search/cdx?url=http%3A//breitbart.com/ads.txt&output=json&fl=timestamp,original,statuscode,digest>
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.9/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/opt/miniconda3/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_response
    method(request=request, response=response, spider=spider)
  File "/opt/miniconda3/lib/python3.9/site-packages/scrapy_wayback_machine/__init__.py", line 83, in process_response
    self.crawler.engine.schedule(snapshot_request, spider)
AttributeError: 'ExecutionEngine' object has no attribute 'schedule'
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-20 09:40:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 370,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2366,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.54403,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 1, 20, 17, 40, 44, 757760, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 9831,
 'httpcompression/response_count': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 69681152,
 'memusage/startup': 69681152,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 1, 20, 17, 40, 44, 213730, tzinfo=datetime.timezone.utc)}
2024-01-20 09:40:44 [scrapy.core.engine] INFO: Spider closed (finished)

Seems to be non functional

I have to admit I haven't spent any time troubleshooting, but it does look like this doesn't function as is anymore.

wayback-machine-scraper -f 20080623 -t 20080623 news.ycombinator.com
2019-03-21 11:50:11 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-03-21 11:50:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-03-21 11:50:11 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_START_DELAY': 1, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0, 'LOG_LEVEL': 'INFO', 'USER_AGENT': 'Wayback Machine Scraper/1.0.7 (+https://github.com/sangaline/scrapy-wayback-machine)'}
2019-03-21 11:50:11 [scrapy.extensions.telnet] INFO: Telnet Password: 
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-21 11:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-21 11:50:11 [scrapy.core.engine] INFO: Spider opened
2019-03-21 11:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-21 11:50:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-21 11:50:11 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.ycombinator.com>
Traceback (most recent call last):
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
    response = yield method(request=request, spider=spider)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy_wayback_machine\__init__.py", line 64, in process_request
    return self.build_cdx_request(request)
  File "c:\tools\anaconda3\envs\wayback\lib\site-packages\scrapy_wayback_machine\__init__.py", line 91, in build_cdx_request
    cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
  File "c:\tools\anaconda3\envs\wayback\lib\nturl2path.py", line 65, in pathname2url
    raise OSError(error)
OSError: Bad path: http://news.ycombinator.com
2019-03-21 11:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-21 11:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/builtins.OSError': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 21, 16, 50, 12, 40860),
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/builtins.OSError': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2019, 3, 21, 16, 50, 11, 827644)}
2019-03-21 11:50:12 [scrapy.core.engine] INFO: Spider closed (finished)

Following image links

Thanks so much for this scraper. It works so much better than the other wayback scraper tools I've found.

I'm trying to scrape all snapshots of an old site and I've noticed that this scraper doesn't grab images, though the images are definitely stored on archive.org in the wayback machine for each snapshot.

Is there any way to get this scraper to also grab images and other linked assets like CSS?

How can I use this to get the number of times a site is crawled by the wayback?

As shown here, the number of times, it was crawled.
https://web.archive.org/web/*/panda.com

snapshot functionality for a full site at a given time?

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.

I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)

I've looked at the logic in the filter_snapshots function and it all makes sense - essentially it keeps each snapshot before time_range[0] in a holding variable initial_snapshot and if the filtered_snapshots list is still empty when the time_range[1] is reached then that goes into filtered_snapshot list as the only snapshot.

Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting DEPTH_LIMIT in __main.py__ and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.

[Question] How to get latest crawl?

Is there a way that I can get the most recent version (a single version) of a full site crawl of a list of URLs?

Would it be possible to add a functionality to download a screenshot?

I would like to know if it would be possible to add a functionality that will download screenshots of each version of a website over the time, based on Wayback Machine data. It would be very useful for us, because checking out screenshots (by a human) is much faster than checking out HTML.

Crashes (includes fix)

Since I can't commit to your project, here are two fixes that I had to made in order to get the scraper to run:

In mirror_spider.py, line 50, there is no check whether the output path is valid. The URL can contain ? characters which causes the script to crash.

Here's my solution, it's just a quick fix and may require elaboration for different characters and Linux/Windows compatibility:

url_parts = response.url.split('://')[1].split('/')
parent_directory = os.path.join(self.directory, *url_parts)
parent_directory = parent_directory.replace("?", "_q_")  # normalize path

os.makedirs(parent_directory, exist_ok=True)

There is another bug in your other project scrapy_wayback_machine which is imported here, that causes a crash.

It's in init.py, line 91:

cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))

At this point, request.url is something like http://website.com.
But pathname2url will look for a colon : and require that anything before that is only one letter (since we are dealing with regular paths here, like C:\mypath.

When I removed the call to pathname2url, it worked for me, but I don't know which other cases may break:

cdx_url = self.cdx_url_template.format(url=request.url)