jsvine / waybackpack Goto Github PK

Download the entire Wayback Machine archive for a given URL.

License: MIT License

Python 98.80% Makefile 1.20%

waybackpack's Introduction

waybackpack

Waybackpack is a command-line tool that lets you download the entire Wayback Machine archive for a given URL.

For instance, to download every copy of the Department of Labor's homepage through 1996 (which happens to be the first year the site was archived), you'd run:

waybackpack http://www.dol.gov/ -d ~/Downloads/dol-wayback --to-date 1996

Result:

~/Downloads/dol-wayback/
├── 19961102145216
│   └── www.dol.gov
│       └── index.html
├── 19961103063843
│   └── www.dol.gov
│       └── index.html
├── 19961222171647
│   └── www.dol.gov
│       └── index.html
└── 19961223193614
    └── www.dol.gov
        └── index.html

Or, just to print the URLs of all archived snapshots:

waybackpack http://www.dol.gov/ --list

Installation

pip install waybackpack

Usage

usage: waybackpack [-h] [--version] (-d DIR | --list) [--raw] [--root ROOT]
                   [--from-date FROM_DATE] [--to-date TO_DATE]
                   [--user-agent USER_AGENT] [--follow-redirects]
                   [--uniques-only] [--collapse COLLAPSE] [--ignore-errors]
                   [--max-retries MAX_RETRIES] [--no-clobber] [--quiet]
                   [--progress] [--delay DELAY] [--delay-retry DELAY_RETRY]
                   url

positional arguments:
  url                   The URL of the resource you want to download.

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -d DIR, --dir DIR     Directory to save the files. Will create this
                        directory if it doesn't already exist.
  --list                Instead of downloading the files, only print the list
                        of snapshots.
  --raw                 Fetch file in its original state, without any
                        processing by the Wayback Machine or waybackpack.
  --root ROOT           The root URL from which to serve snapshotted
                        resources. Default: 'https://web.archive.org'
  --from-date FROM_DATE
                        Timestamp-string indicating the earliest snapshot to
                        download. Should take the format YYYYMMDDhhss, though
                        you can omit as many of the trailing digits as you
                        like. E.g., '201501' is valid.
  --to-date TO_DATE     Timestamp-string indicating the latest snapshot to
                        download. Should take the format YYYYMMDDhhss, though
                        you can omit as many of the trailing digits as you
                        like. E.g., '201604' is valid.
  --user-agent USER_AGENT
                        The User-Agent header to send along with your requests
                        to the Wayback Machine. If possible, please include
                        the phrase 'waybackpack' and your email address. That
                        way, if you're battering their servers, they know who
                        to contact. Default: 'waybackpack'.
  --follow-redirects    Follow redirects.
  --uniques-only        Download only the first version of duplicate files.
  --collapse COLLAPSE   An archive.org `collapse` parameter. Cf.: https://gith
                        ub.com/internetarchive/wayback/blob/master/wayback-
                        cdx-server/README.md#collapsing
  --ignore-errors       Don't crash on non-HTTP errors e.g., the requests
                        library's ChunkedEncodingError. Instead, log error and
                        continue. Cf.
                        https://github.com/jsvine/waybackpack/issues/19
  --max-retries MAX_RETRIES
                        How many times to try accessing content with 4XX or
                        5XX status code before skipping?
  --no-clobber          If a file is already present (and >0 filesize), don't
                        download it again.
  --quiet               Don't log progress to stderr.
  --progress            Print a progress bar. Mutes the default logging.
                        Requires `tqdm` to be installed.
  --delay DELAY         Sleep X seconds between each fetch.
  --delay-retry DELAY_RETRY
                        Sleep X seconds between each post-error retry.

Support

Waypackback is written in pure Python, depends only on requests, and should work wherever Python works. Requires Python 3.3+.

Thanks

Many thanks to the following users for catching bugs, fixing typos, and proposing useful features:

waybackpack's People

Contributors

Stargazers

Watchers

Forkers

ndsteve goofwear uslic001 adamg tspannhw john-kuo kubatyszko majkelo schollz rlugojr whitcrrd cclauss wflk security-geeks lucabongiorni pkthebud oliveiracwb mediaeater drkarl szydan nicholerhodes dmfury chigga102 designtips kalanicraig zihua sillyjw landsurveyorsunited adammendoza acamtech techcollections luisriverag lovromar fhriley devslashnull mongonauta gee-gee manosx lngvietthang nhittt fgregg pombredanne phenri samyoyo chrisemoulton ksmaheshkumar brandontict darg0001 digideskio ognz shekkbuilder awesome-archive jeffreychung twyyk javarockstar modulexcite anukat2015 k9team3 olivierh59500 atanida todun quikilr land-surveyors-united marcdacosta elnuno samatt alexxnica swwest e-orlov gsdu8g9 mgcfish stacyannj therealnixxer jwsmithers lokkdokk netsharec unforeseenocean h0r57 sumdp klark1kent www3838438 ianmadlenya mutchako seabreg leucotic dvlop sar2901 ykankaya afcarl totalgood chouaibhm bbhunter alexewd basic612 sahwar rand338 njbooher github-userx shalevy1 bradparks

waybackpack's Issues

Status code 403 causes JSONDecodeError

$ waybackpack --list http://lcamtuf.coredump.cx/
INFO:waybackpack.session: HTTP status code: 403
Traceback (most recent call last):
  File "/home/jwilk/.local/bin/waybackpack", line 11, in <module>
    load_entry_point('waybackpack==0.3.5', 'console_scripts', 'waybackpack')()
  File "/home/jwilk/.local/lib/python3.5/site-packages/waybackpack/cli.py", line 86, in main
    collapse=args.collapse
  File "/home/jwilk/.local/lib/python3.5/site-packages/waybackpack/cdx.py", line 19, in search
    "collapse": collapse
  File "/home/jwilk/.local/lib/python3.5/site-packages/requests/models.py", line 888, in json
    self.content.decode(encoding), **kwargs
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I guess waybackpack shouldn't try to parse error pages as JSON…

generate list failed with `to_json': source sequence is illegal/malformed utf-8

/build/lib/wayback_machine_downloader.rb:182:in `to_json': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)
        from /build/lib/wayback_machine_downloader.rb:182:in `block in list_files'
        from /build/lib/wayback_machine_downloader.rb:181:in `each'
        from /build/lib/wayback_machine_downloader.rb:181:in `list_files'
        from /build/bin/wayback_machine_downloader:70:in `<top (required)>'
        from /usr/local/bundle/bin/wayback_machine_downloader:17:in `load'
        from /usr/local/bundle/bin/wayback_machine_downloader:17:in `<main>'

seems some url contain some invalid % escape sequence

Multi-threaded downloading to combat very limited speeds of archive server

Give an option to use external download manager like aria2c or an option to convinvce us. Without multipart downloading archive is too slow

'Invalid Syntax Error'

Hi! I am having an issue just running the simplest command - I think I am doing something wrong that is so stupid I can't find any resources for help...

import waybackpack
import requests
waybackpack https://www.metopera.org/season/2017-18-season/elektra-strauss-tickets -d ~/Downloads --user-agent

Gives me an invalid Syntax error. I'm working in a .py file in Atom Mac OS with Python 3. Any suggestions?

Saving url query string in filename

Currently if you try to save a resource with the template of:

www.site.com/news?385

It will save it as merely "news" instead of something like news@385 (like what wget does).

I looked through the code and couldn't find the part that is handling the url query, but if one is saving a large amounts of files in that format, it becomes less userfriendly to simply have a thousand files labeled "news".

Awesome program by the way.

Is there a way to limit the downloads to just zip files?

There is a site I want to download, but I don't want any of the site pages - just the zip files therein. Is there a way to limit the results?

It seems like the script only downloads the index.html pages. I'm looking for site///*.zip

Update readme version #

Housekeeping: Readme reflects wrong version number. (v0.3.4 vs v0.3.5)

Is the project still alive?

The question is why there have been no updates for more than 3 years, besides when trying to download a site just download an empty index.html.
As a replacement I found wayback-machine-scraper although it seems that it only downloads the html forgetting the images or CSS. There are also other projects about Wayback Machine in this topic but only a few are for saving snapshots.

TypeError: a bytes-like object is required, not 'NoneType' on 5xx error

Getting an error when downloading something and getting a server error (I think). Even with --ignore-errors it still attempts to write a None object to the file.

Traceback (most recent call last):
  File "C:\Users\Astrid\scoop\apps\python\current\Scripts\waybackpack-script.py", line 11, in <module>
    load_entry_point('waybackpack==0.3.6', 'console_scripts', 'waybackpack')()
  File "c:\users\astrid\scoop\apps\python\3.8.5\lib\site-packages\waybackpack\cli.py", line 104, in main
    pack.download_to(
  File "c:\users\astrid\scoop\apps\python\3.8.5\lib\site-packages\waybackpack\pack.py", line 91, in download_to
    f.write(content)
TypeError: a bytes-like object is required, not 'NoneType'

Adding if content: around the last with block in pack.py fixes this, but I'm not sure whether this is the correct solution or not.

Follow redirects

Hey there. Using waybackpack for shots and loving it!

I think there should be a way to follow redirects, at the very least as an option. I'd enable it by default though.

Empty index.html with "found capture at xxxxxxxxxx"

All downloaded index.html files are have this single line:
found capture at xxxxxx

$ waybackpack -d . www.masrawy.com
INFO:waybackpack.pack: Fetching www.masrawy.com @ 20000621155312
INFO:waybackpack.session: HTTP status code: 302
INFO:waybackpack.pack: Writing to ./20000621155312/www.masrawy.com/index.html

INFO:waybackpack.pack: Fetching www.masrawy.com @ 20000711031156
INFO:waybackpack.session: HTTP status code: 302
INFO:waybackpack.pack: Writing to ./20000711031156/www.masrawy.com/index.html

INFO:waybackpack.pack: Fetching www.masrawy.com @ 20001018073034
INFO:waybackpack.session: HTTP status code: 302
INFO:waybackpack.pack: Writing to ./20001018073034/www.masrawy.com/index.html

INFO:waybackpack.pack: Fetching www.masrawy.com @ 20001109072800
INFO:waybackpack.session: HTTP status code: 302
INFO:waybackpack.pack: Writing to ./20001109072800/www.masrawy.com/index.html

Not able to download a website from wayback

Hi,

I have installed this software and I am trying to first list and then download a website from wayback machine.

D:\Projects\wayb1\waybackpack-master\waybackpack-master>waybackpack www.amrutpha
rm.co.in --list --user-agent waybackpack

I am getting the following error.
INFO:waybackpack.session: HTTP status code: 503
INFO:waybackpack.session: Waiting 1 second before retrying.

Can someone please tell me a solution to resolve this problem?

Thanks
Sachin

Download HTML with archive.org URLs for assets

Is it possible to add an argument to be able to download the HTML of an archived page from archive.org, with the https://web.archive.org full path for assets?

For example, right now, an image is being referenced in the HTML as:

<img src="/assets/image-sprites.png">

as it only downloads the .html, that image is broken.

What I am asking is to be able to download the HTML with full URL paths such as:

https://web.archive.org/web/20130909175810im_/https://domain.com/assets/image-sprites.png

This would apply to all the external resources: Images, CSS, JS etc.

That way, I would be able to see the page, exactly as I am seeing it from Wayback Machine page.

INFO:waybackpack.session: HTTP status code: 302

INFO:waybackpack.pack: Fetching acasaredonda.com.br @ 20200812061150
INFO:waybackpack.session: HTTP status code: 302
INFO:waybackpack.pack: Writing to /home/vz/acr_wbm_snapshot/20200812061150/acasaredonda.com.br/index.html

Only recreates 0-bytes index.html files in directories for each snapshot and returns the 302 HTTP STATUS CODE.

Wildcard url queries

Thanks for the great library.

I hacked up a way to download all the content prefixed by certain path. I.e.

http://www.oalj.dol.gov/PUBLIC/ARB/REFERENCES/CASELIST/*

This is obviously related to #8, but it's not quite the same thing. I wanted to see if you would be interested in seeing a PR for this?

Download URL only

It’d be pretty cool to have an option that allows downloading only the URL specified and not the entire archive for all sites under that domain. So for example if I want to download https://www.cnn.com/2022/02/02/us/confederate-monuments-removed-2021-whose-heritage/index.html, I just want that page and that page only, not all CNN urls.

thank you!

question: able to download a website historically while only saving the 1st successful page?

any change to get a feature where we can download a site from a range of dates? for example 2015-Today to try and get every copy of a URL, but only save the most successful download?

the use case is im trying to get a website, but some pages are "blocked by cloudflare" on certain versions of archive.org

thanks!

Waybackpack + matchType

Wayback API had matchType option, example:
https://web.archive.org/cdx/search/cdx?url=https://twitter.com/jack/statuses&matchType=prefix

Which returns:

com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20121223123338 https://twicom,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20121223123338 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 text/html 404 VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 5296
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130203195805 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 1042
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130312144230 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 1035
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130326132131 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 text/html 404 BMAXRTF3OVX3HL22WUMYLBYT2UJV3HT3 9317
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130402123359 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - BMAXRTF3OVX3HL22WUMYLBYT2UJV3HT3 1030tter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 text/html 404 VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 5296
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130203195805 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 1042
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130312144230 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - VNL4UHLBLX2UYNDIOZZ7ZR3CFYURIVND 1035
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130326132131 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 text/html 404 BMAXRTF3OVX3HL22WUMYLBYT2UJV3HT3 9317
com,twitter)/jack/statuses/"/antarnisti/status/245078986827386880" 20130402123359 https://twitter.com/jack/statuses/%22/Antarnisti/status/245078986827386880%22 warc/revisit - BMAXRTF3OVX3HL22WUMYLBYT2UJV3HT3 1030

Is it possible to download all of this urls? Because waybackpack will trim url based on cli input.

I have try to add new matchType parametr to the cdx file, i get valid response, but waybackpack still trim url based on cli input

I need to download all SWF files from Archive.

Hi, I installed Waybackpack in hopes I could download all the swf files located here. "http://flashvortex.com/images/"
In the Wayback Machine, I can access them here. "https://web.archive.org/web/*/http://flashvortex.com/images/*"
This is all great, but the waybackpack only saves HTML files. Anyway I can download all the SWF files? Thanks!
P.S. This is critical for me to download these.

Short argument names

Thank you for the tool.
Do you think we should add short names for arguments used frequently?
Like:
--list -> -l
--uniques-only] -> -u

Short sleep on 503

IA sends 503s if their servers are overloaded. If I understand your code correctly, you'll continue rapidly sending requests in that case. Better to sleep for a second. It's probably an even better idea to do the sleep for any 5xx return code.

Downloading one snapshot per day

Hi,

Is there a way to do it, e.g. download only the first or the last snapshot of the day?

Connection errors aren't caught

requests.exceptions.ChunkedEncodingError is raised all too often.

Traceback (most recent call last):
  File "~/virt_env/bin/waybackpack", line 9, in <module>
    load_entry_point('waybackpack==0.3.2', 'console_scripts', 'waybackpack')()
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/cli.py", line 88, in main
    root=args.root,
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/pack.py", line 63, in download_to
    root=root
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/asset.py", line 45, in fetch
    res = session.get(url)
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/session.py", line 20, in get
    **kwargs
  File "~/virt_env/lib/python2.7/site-packages/requests/api.py", line 71, in get
    return request('get', url, params=params, **kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/api.py", line 57, in request
    return session.request(method=method, url=url, **kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/sessions.py", line 617, in send
    r.content
  File "~/virt_env/lib/python2.7/site-packages/requests/models.py", line 741, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "~/virt_env/lib/python2.7/site-packages/requests/models.py", line 667, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

503 Service Unavailable

Getting snapshot pagesC:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:369:in `open_http': 503 Service Unavailable (OpenURI::HTTPError)
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:760:in `buffer_open'
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:214:in `block in open_loop'
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:212:in `catch'
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:212:in `open_loop'
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:153:in `open_uri'
        from C:/Ruby32-x64/lib/ruby/3.2.0/open-uri.rb:740:in `open'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader/archive_api.rb:13:in `get_raw_list_from_api'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:88:in `get_all_snapshots_to_consider'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:105:in `get_file_list_curated'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:164:in `get_file_list_by_timestamp'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:309:in `file_list_by_timestamp'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:192:in `download_files'
        from C:/Ruby32-x64/lib/ruby/gems/3.2.0/gems/wayback_machine_downloader-2.3.1/bin/wayback_machine_downloader:72:in `<top (required)>'
        from C:/Ruby32-x64/bin/wayback_machine_downloader:32:in `load'
        from C:/Ruby32-x64/bin/wayback_machine_downloader:32:in `<main>'

Tried on both Linux (ubuntu) and windows, whenever I try to download a website it shows me this error, any ideas?

Blank files

Hey,

I'm running a somewhat simple command:

wayback_machine_downloader absglobal.com --all-timestamps --from 20110101000000 --to 20221231235959 --concurrency 5 --only "/(\/$|\.(html|htm|aspx)$)/i" --all

The downloader somewhat works. I get quite a few errors like so:

Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for "web.archive.org" port 443)

I accept this. But the real problem seems to be:

I get a folder structure like so: 20110101152320/contact-us/europe/index.html but the html is just blank?

Uniques only

Feature suggestions:

only retrieve unique snapshots
only retrieve snapshots closest to a particular set of dates (e.g. 1 July of each year).

Keep up the good work.

IndexError: list index out of range with v0.3.0

waybackpack dol.gov -d /Users/scoffman/shots/sobig/resources/shots/pages --to-date 1996

Results in:

INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org
Traceback (most recent call last):
  File "/Users/scoffman/.virtualenvs/shots/bin/waybackpack", line 11, in <module>
    sys.exit(main())
  File "/Users/scoffman/.virtualenvs/shots/lib/python2.7/site-packages/waybackpack/cli.py", line 73, in main
    collapse=args.collapse
  File "/Users/scoffman/.virtualenvs/shots/lib/python2.7/site-packages/waybackpack/cdx.py", line 21, in search
    fields = cdx[0]
IndexError: list index out of range

I am using Python 2.7.10 with waybackpack=0.3.0 and requests=2.10.0

Only Download Index.html file

Hi jsvine,
when i want to download the spesific snapshot of whole pages of my site it only download the index.html file and the folder of the snapshot, the command is: waybackpack example.com -d downfolder/ --follow-redirects --from-date 201901

Ability download extensions based

Can download everything from https://example.com but if it is possible to download only the files for give extension with a flag like;
-f .js

https://example.com/new/jobs/file.js
https://example.com/staffs/fired.js

downloaded all the js files from example.com

http 302 error

Hi, I am not sure why I keep getting HTTP 302 error for any website I try:

INFO:waybackpack.pack: Fetching cnn.com @ 20190101001034
INFO:waybackpack.session: HTTP status code: 302

Here's the command I used:
waybackpack cnn.com -d ~/Downloads/cnn --from-date 201901 --to-date 202007 --user-agent 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0'

directory traversal via crafted timestamps

Waybackpack does not validate timestamps it receives from the Wayback Machine.
If the server went rogue, it could put "../" sequences in the timestamp, tricking waybackpack into writing outside the destination directory.

Download URL only

thank you!

Date argument not working correctly

sample code

rss_url="https://indianexpress.com/section/lifestyle/health/feed/"
waybackpack $rss_url -d $download_dir --from-date 2020

This downloads rss files from the year 2014, instead of 2020.

Support other sources?

archive.is?

Tags 0.6.1, 0.6.0 missing

The latest versions on https://pypi.org/project/waybackpack/ are 0.6.1 and 0.6.0, but there are no such tags in this git repo.

pip deprecation warning:

DEPRECATION: waybackpack is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559

Save all HTML files in one place

Hello jsvine, hope your health is okay.

I noticed that waybackpack saves HTML files in nested folders. I think it's pretty neat, and it has its set of advantages (esp. with organization). However, it can also be cumbersome and add another step to viewing these HTML files.

Is there a way to save all HTML files in an event in the same folder where waybackpack was ran?

Thank you very much.

Download URL only

thank you!

generate list failed with `to_json': source sequence is illegal/malformed utf-8

/build/lib/wayback_machine_downloader.rb:182:in `to_json': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)
        from /build/lib/wayback_machine_downloader.rb:182:in `block in list_files'
        from /build/lib/wayback_machine_downloader.rb:181:in `each'
        from /build/lib/wayback_machine_downloader.rb:181:in `list_files'
        from /build/bin/wayback_machine_downloader:70:in `<top (required)>'
        from /usr/local/bundle/bin/wayback_machine_downloader:17:in `load'
        from /usr/local/bundle/bin/wayback_machine_downloader:17:in `<main>'

seems some url contain some invalid % escape sequence

URLs with operators and hyphens

Hi, I noticed that when I try to print the URLs of all archived snapshots for the SecureWorks RSS feed link I get an error. The link is causing waybackpack to think I am passing wrong arguments. I guess this happens because the link includes operators such as '&':

waybackpack https://www.secureworks.com/rss?feed=research&category=threat-analysis --list >> ./wbp/secureworks.txt

and the error message:

waybackpack: error: one of the arguments -d/--dir --list is required
'category' is not recognized as an internal or external command,
operable program or batch file.

Is there a way to avoid this error and get results for an URL like that?

How to run waybackpack in Python shell?

Hello jsvine, I hope you are feeling well.

How do I run waybackpack in Python shell? I would like to call it, but unfortunately the only way I can do that is with os.system, which isn't great. Can I run waybackpack in a way similar to requests or tqdm, and if so, how?

I tinkered around with the following, and I got an error saying the module is not callable.

    waybackpack("https://cnn.com")
TypeError: 'module' object is not callable

Thank you.

`--ignore-errors` and `--max-retries` do nothing against `ConnectionRefusedError`

Hello,

I've tried Waybackpack, and the Internet Archive sometimes refuses connexions for apparently no reason, as the refused link can be openned in the browser.

I hoped that --ignore-errors would help Waybackpack ignore that error; and that --max-retries would retry to connect; but to no avail.

Best regards.

Request returning no error code and only Index.html file

I'm trying to execute the wayback backpack command to download the 3rd July 2023 snapshot of the "https://projects.ttlexceeded.com/" web page, with no success. The command returns no errors and only downloads a single index.html. When visiting the snapshot on the browser through Web archive I can see the full web page perfectly. Can you help me out? I'm using the '--follow-redirects' switch and don't understand what's happening. Thanks!!

Inserting a sleep between each fetch request

I would like to be nice to the Wayback Machine and space out my requests.

An option to insert a delay of x seconds between fetching each page would allow me to reduce the load.

It looks like the Wayback Machine does have a rate limiter, which causes the current non-delayed fetch to grind to a halt.

Syntax error in Collab and Jupyter?

Attempting to run in either Google Colab or a Jupyter notebook. It installs, but returns a syntax error with any URL:

File "<ipython-input-10-e826c76cce5b>", line 1 waybackpack dol.gov -d ~/Downloads/dol-wayback --to-date 2020 ^ SyntaxError: invalid syntax

Download whole site

Add a param to download whole site with assets, not just pages
(as of right now it only captures html page of the site)

Handle 503 when not from server overload

Currently cannot download all files for this url: https://www.amazon.com/Art-Gathering-How-Meet-Matters/dp/1594634920

The problem is that some of the pages return a 503 (I assume because amazon returned that robot check) so the bot just gets stuck with them.

It would be good to add a flag about how to handle 503 or something of the sort. Either skip 503s, download anyway, or only retry X times. Something like that could help with this.

Here's an example of the page that will always be 503: https://web.archive.org/web/20190506092829/https://www.amazon.com/Art-Gathering-How-Meet-Matters/dp/1594634920

Thanks for making this awesome project :)

jsvine / waybackpack Goto Github PK

waybackpack's Introduction

waybackpack

Installation

Usage

Support

Thanks

waybackpack's People

Contributors

Stargazers

Watchers

Forkers

waybackpack's Issues

Recommend Projects

Recommend Topics

Recommend Org