bellingcat / auto-archiver Goto Github PK

View Code? Open in Web Editor NEW

490.0 19.0 53.0 5.46 MB

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

Home Page: https://pypi.org/project/auto-archiver/

License: MIT License

Python 92.53% Dockerfile 0.48% HTML 6.99%

archive docker open-source-research python service scraping web-archiving

auto-archiver's People

Stargazers

Watchers

auto-archiver's Issues

Add Browsertrix support when using docker image

Issue: browsertrix-crawler is executed via docker (docker run ...) and it uses volumes to

pass the profile.tar.gz file
save the results of its execution

If the auto-archiver is running inside docker, we have a docker-in-docker situation and that can be nefarious.
One workaround is to share the daemon of the host machine with the auto-archiver docker container via /var/run/docker.sock, for example:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/secrets:/app/secrets -e SHARED_PATH=$PWD/secrets/crawls aa --config secrets/config-docker.yaml

However, doing this means that the -v volumes passed when doing docker run -v... browsertrix-crawler will only share volumes with the host (and not the docker container running the archiver) meaning the profile file and the results of extraction are host path dependent which adds a layer of complexity.

Additionally, using /var/run/docker.sock is generally undesireable for security as it gives a lot of permissions to the code in the container.

Challenge: can we find a secure and easy to use (both via docker and outside docker) for browsertrix-crawler? Would that be a docker-compose with 2 services communicating? a new service that responds to browsertrix-crawl requests?

Youtube videos only contain upload date, not time

Is there another method of extracting this data that could be used to fill in the Upload timestamp column more completely?

Generate a keyframe contact sheet for each video

This could be rendered as a static webpage hosted in the same S3 bucket. A link would be added to the Google Sheet in the appropriate column.

Use Entry Number for the folder in Google Storage

This is probably an issue for me to implement in the google drive storage.

eg instead of a folder name: https-www-youtube-com-watch-v-wlahzurxrjy-list-pl7a55eb715fbb2940-index-7

I like it to be the entry number eg AA001 which is taken from the good spreadsheet.

Perhaps patch in via

filename_generator: static to be filename_generator: entry_number

Insert a thumbnail image from the video into the Google Sheet

After a video (or an image) has been archived, a thumbnail could be inserted into the Google Sheet (if an appropriate column was detected, see #2).

Project should be easier to set up and run

Currently, running this project in an automated way requires creating a Digital Ocean Spaces bucket and manually managing cron jobs on a Linux server. Ideally, this would be simpler to deploy so that a new archiving spreadsheet could be set up in a user friendly way, even for non-programmers.

One promising possibility for moving in this direction is as a Google Sheets Add On, but other ideas can also be explored and evaluated.

"failed: no archiver" in google sheet although can download the screenshot.

I attempted to set up the auto-archiver by following this instructional video (https://www.youtube.com/watch?v=VfAhcuV2tLQ).

Initially, the code was running and the archive status in the Google sheet showed "Archive in progress," but at the end, it displayed "failed: no archiver".

The logs indicate that I have successfully scraped some data. I also downloaded some screenshots (YouTube and video), and the YouTube video in webm format. However, I am unsure why the data cannot update in the google Sheet.

Also, is it necessary to utilize the browsertrix-crawler? I have downloaded the Docker desktop, and my machine can run the browsertrix-crawler, but the error persists."

The error messages are as below:
ERROR | main:process_sheet:138 - Got unexpected error in row 2 with twitter for url='https://twitter.com/anwaribrahim/status/1642750503422685187?cxt=HHwWhsDTsaK4nMwtAAAA': [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'
Traceback (most recent call last):
File "/Users/usr/Documents/python/archiver/auto_archive.py", line 133, in process_sheet
result = archiver.download(url, check_if_exists=c.check_if_exists)
File "/Users/usr/Documents/python/archiver/archivers/twitter_archiver.py", line 42, in download
wacz = self.get_wacz(url)
File "/Users/usr/Documents/python/archiver/archivers/base_archiver.py", line 234, in get_wacz
shutil.copyfile(self.browsertrix.profile, os.path.join(browsertrix_home, "profile.tar.gz"))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'

2023-04-13 14:47:20.800 | SUCCESS | main:process_sheet:167 - Finished worksheet Sheet1

Auto Tweet the Hash

After every successful archive, do a Tweet with the Hash so that we can prove that on this day the picture/video was as it is.

Need to translate this older code to the new codebase. Use enricher probably.

https://github.com/djhmateer/auto-archiver/blob/main/auto_archive.py#L183

This code uses a SQL database to act as a queue, and another service runs every x minutes to poll the queue to see if anything should be tweeted. Checks the NextTweetTime so as not to bombard the API. Been working for a few months on the free Twitter API which gives 1500 Tweets per month.

Test and Demo spreadsheet with urls which test all aspects

A first step to testable code could be a spreadsheet which has expected output columns. This would make it easy to see if anything isn't working (regression testing).

I've got something already and will improve and post here.

This would also be useful for a demo purposes too, so users can see what is happening (and what should happen)

In order of usefulness to clients / most common links

Twitter

1 image (should work)
multiple images (should work)
tweet with media sensitive image(s)
tweet that brings a login prompt (trick is to get rid of part of the url)
check tweet image size is max resolution
tweet that contain a non twitter video URL as intent is probably to get images from tweet

then video:

1 video
multiple videos will not work

Facebook

1 image - will not work
multiple images - will not work

then video:

1 video. Handled by youtubedl
multiple videos - unusual and wont work

etc...

In order of developers

TelethonArchiver (Telegrams API)
TikTokArchiver (always getting invalid URL so far)
TwitterAPIArchiver (handles all tweets if API key is there)
YoutubeDLArchiver (handles youtube, and facebook video)
TelegramArchiver (backup if telethon doesn't work which is common)
TwitterArchiver (only if Twitter API not working)
VkArchiver
WaybackArchiver

Page load timeout

Nice work on fixes in the PR 398f296

I'm using a larger page load timeout of 120 seconds for some tricky long running queries ones (specifically telegram) which seems to work.

# I think 120 is seconds - the docs are ambiguous
driver.set_page_load_timeout(120)

https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html#selenium.webdriver.remote.webdriver.WebDriver.set_page_load_timeout

TikTok downloader stalling when video is unavailable - can't reproduce

Bug that I can't reliably reproduce but sometimes stalls the whole archiver for many hours until restarting the archiver.

Given this URL:
https://www.tiktok.com/@jusscomfyyy/video/7090483393586089222

The tiktok downloader stalls.

https://github.com/krypton-byte/tiktok-downloader

The test app https://tkdown.herokuapp.com/ correctly throws an invalid url.

class TiktokArchiver(Archiver):
    name = "tiktok"

    def download(self, url, check_if_exists=False):
        if 'tiktok.com' not in url:
            return False

        status = 'success'

        try:
            # really slow for some videos here 25minutes plus or stalls
            info = tiktok_downloader.info_post(url)
            key = self.get_key(f'{info.id}.mp4')

Improvement suggestions for WaybackArchiver

Hi!
I'm a member of Team Wayback at the Internet Archive.
I have some improvement suggestions for

auto-archiver/archivers/wayback_archiver.py

Line 11 in 0bdd06f

class WaybackArchiver(Archiver):

You could use the Wayback Machine Availability API to easily get capture info about a captured URL https://archive.org/help/wayback_api.php. https://web.archive.org/web/<URL> is not recommended because its purpose is to playback the latest capture. You don't need to load the whole data of the latest capture of a URL, you just need to know if its available or not.
Save Page Now API has a lot of useful options https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

if_not_archived_within=<timedelta> should be useful in your case.

Capture web page only if the latest existing capture at the Archive is older than the limit. Its format could be any datetime expression like “3d 5h 20m” or just a number of seconds, e.g. “120”. If there is a capture within the defined timedelta, SPN2 returns that as a recent capture. The default system is 30 min.

Cheers!

Refactor to remove duplicate code

The code contains duplicate statements and could stand to be DRYed out a little.

Google Drive OAuth implement

Have a PR coming which allows a token rather than a service account to be used. Token refreshes handled too.

Using OAuth to get around 15GB limit of service accounts on drive storage:

https://davemateer.com/2022/04/28/google-drive-with-python

UX bug: archiving fails if the "url" is replaced with its linked title text

When pasting a url in Sheets, a helpful little dialog appears and suggests that you "Replace URL with its title" as linked text (see image below). Auto archiver doesn't know how to handle this format and returns "nothing archived". However, it should be possible to detect and extract the url when the cell value is a link.

Add option to limit number of thumbnails for thumbnail_enricher

Long videos can lead to https://github.com/bellingcat/auto-archiver/blob/main/src/auto_archiver/enrichers/thumbnail_enricher.py generating hundreds of thumbnails, an option should be given to curb the number of thumbnails and then have a representative sample evenly distributed throughout the video timeline.

Facebook image archiving

Archiving of image(s) on Facebook is not supported yet and would be very useful.

Placeholder Issue to put in ideas of potentially how it could be done.

Background

https://github.com/djhmateer/auto-archiver#archive-logic has a list of what works and doesn't. Facebook video works using youtube_dlp

In fork above to get a Facebook screenshot I am using using automation to click on the accept cookies page as we don't want the cookie popup in the screenshot.

To get a Facebook post link

"Each Facebook post has a timestamp on the top (it may be something like Just now, 3 mins or Yesterday). This timestamp contains the link to your post. So, to copy it, simply hover your mouse over the timestamp, right click, then copy link address"

Example

As an example of Facebook images which we would like to archive:

https://www.facebook.com/chelseymateerbeautician/posts/pfbid0mhimrwfeBpWKwBUFna28Q3RfaEK8HETcEpk1QXoEeFXHVwaa7oxLxKTHbBqu5nPpl

https://gist.github.com/pcardune/1332911 - potentially this may help.

#26 - @msramalho talked about the potential of https://archive.ph/

include exif metadata

Add an additional data point containing exif metadata where available, for now can only think of telegram media.

Support scraping of images/other content from VK links

new enhancer: store HTTP headers and SSL certificate

Create a new enhancer that can

(optionally) store the HTTP request headers -need to think of use-cases and how to be comprehensive
(optionally) store the HTTP response headers - same as above
(optionally) store the SSL certificate of https connections

The display in the html_formatter should probably be initially hidden.

Another option is to conceive a high-level activity log that captures all actions and logs and appends them to the end of the html report.

Google service account credentials usage doesn't match README instructions

Hi there! The README says the Google service account credentials should be placed at ~/.config/gspread/service_account.json, which is where gspread looks by default. However, line 17 of auto_auto_archive.py specifies a credentials file in the same directory as the script, so the script won't run unless you drop service_account.json in the same directory.

So it seems that either auto_auto_archive.py should be changed to call gspread.service_account without a filename parameter; or the README should be changed to specify that service_account.json be created in the application root. If you have a preference, I can open a pull request and change it either way.

Google Drive bug with leading /

In telethon if there are subdirectories wanting creating, sometimes a key is passed with a leading / character which confuses the join.

I've worked around it by adding a catch, but need to clean up

find the root cause
keep catch in logging to error log so we know if it happens again

https://github.com/bellingcat/auto-archiver/blob/main/storages/gd_storage.py

    def uploadf(self, file: str, key: str, **_kwargs):
        """
        1. for each sub-folder in the path check if exists or create
        2. upload file to root_id/other_paths.../filename
        """
        # doesn't work if key starts with / which can happen from telethon todo fix
        if key.startswith('/'):
            # remove first character ie /
            key = key[1:]

A example is: https://t.me/witnessdaily/169265

Archive non-video media (images and sound)

Currently, auto-archiver relies on youtube-dl to download media, which only finds video sources. It would be a significant improvement to download images, and possibly audio as well.

Tiktok-downloader repository missing

The dependant repository:

https://github.com/msramalho/tiktok-downloader

Is not there anymore causing the auto-archiver pipenv update to fail. Even with tiktok-downloader = "*" in the Pipfile (which I know is being worked on).

Causing builds to fail at the moment.

YoutubeDL archiver doesn't provide upload date for youtube videos

yt-dlp fix for facebook videos

I'm now running yt-dlp from the latest master branch to get a fix for facebook videos.

Thought I'd put a comment here in case it helps. Otherwise we just need to wait until it comes into the next release.

My notes on how I did it:

https://davemateer.com/2023/09/06/yt-dlp-running-master

Create an authentication system for downloading private archived content

Where to host - Azure, EC2, OVH, Hetzner

Am currently hosting on my own bare metal server with https://www.proxmox.com/en/ out of my home office

Any recommendations as to where not to host . I'm guessing that OVH / Hetzner will be more blocked that anything else.

https://www.ovhcloud.com/en-gb/bare-metal/

I'm testing Azure VM's (West Europe) currently and will report back any differences I find. They are more expensive.

YouTube playlists - probably not intentional from user

https://podrobnosti.ua/2443817-na-kivschin-cherez-vorozhij-obstrl-vinikla-pozhezha.html

This site contains a live link at the top which is a link to a YouTube playlist with 1 item which is a live stream.

Currently the 'is_live' check doesn't catch it as it is a playlist, and will proceed to download 3.7GB of stream, then create 1000's of thumbnails.

I propose a simple fix in youtubedl_archiver.py to stop downloading of playlists which stuck me as probably not what the user would want.

        if info.get('is_live', False):
            logger.warning("Live streaming media, not archiving now")
            return ArchiveResult(status="Streaming media")

       # added this catch below
        infotype = info.get('_type', False)
        if infotype is not False:
            if 'playlist' in infotype:
                logger.info('found a youtube playlist - this probably is not intended. Have put in this as edge case of a live stream which is a single item in a playlist')
                return ArchiveResult(status="Playlist")

There is probably a much more elegant way to express this!

Can submit a PR if you agree.

Multiple instances of auto-archiver and Proxmox / Azure

I'm hosting 3 instances of the auto-archiver on 3 separate VM's. I've allocated 4GB of RAM to each and the systems work well.

Is anyone running multiple instances on a single VM, and found any issues with simultaneous calls to ffmpeg / Gecko drivers. That is what concerns me the most.

Running python in a virtual env eg pipenv run python auto_archive.py should segregate that side I guess.

Scrape Youtube comments, livechats

youtube-dl, or at least yt-dlp, is capable of downloading and dumping livechat data, machine and manual transcripts of videos, and so on. Youtube comments can be grabbed with another official API with some difficulty, and this project relatively easily. I've found having all of these around useful for some OSINT tasks.

archive facebook with archive.ph

https://archive.ph/ does not have an API like the Internet Web Archive Wayback machine, although it can archive facebook pages and IWA cannot. Could we use selenium to submit links in the archive.ph UI and thus successfully archive links?

add docker-compose and document wacz-enricher changes

follow up for https://github.com/bellingcat/auto-archiver/pull/93/files

Twitter API changes

The free tier of Twitter API does not allow searching of Tweets anymore.

This means that the Twitter API archiver here needs a paid tier to work, however the snscrape archiver works fine.

Just a placeholder issue for clarity. https://davemateer.com/2023/06/29/tweepy-twitter-python-library gives some more detail and links.

Whisper enhancer should return link when transcription task is too large

Currently the transcription task will timeout and not provide anything to the user.

Instead, it should include a link to read the job results from the job UUID. This would require that this endpoint not require authentication. However, this should be okay.

generating WACZ without Docker - wacz not working

Getting a proxy connection failed on the wacz_archiver_enricher on all urls.

First time I've set this up, so probably something simple / maybe I've missed something.

Next step for me is to setup a local dev version and debug it.. but this issue may be useful for others at the same stage as me.

I have the profile setup in secrets/profile.tar.gz which I did via

# create a new profile
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/"

Output of the run is:

docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml

2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:111 - FEEDER: gsheet_feeder
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:112 - ENRICHERS: ['hash_enricher', 'wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:113 - ARCHIVERS: ['wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:114 - DATABASES: ['gsheet_db']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:115 - STORAGES: ['local_storage']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:116 - FORMATTER: html_formatter
2023-08-22 10:50:24.319 | INFO     | auto_archiver.feeders.gsheet_feeder:__iter__:48 - Opening worksheet ii=0: wks.title='Sheet1' header=1
2023-08-22 10:50:26.275 | WARNING  | auto_archiver.databases.gsheet_db:started:28 - STARTED Metadata(status='no archiver', metadata={'_processed_at': datetime.datetime(2023, 8, 22, 10, 50, 26, 274503), 'url': 'https://twitter.com/dave_mateer/status/1505876265504546817'}, media=[])
2023-08-22 10:50:26.916 | INFO     | auto_archiver.core.orchestrator:archive:85 - Trying archiver wacz_archiver_enricher for https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:50:26.916 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:50:26.916 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection 5e60e6e9 --id 5e60e6e9 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.263Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.269Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.270Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:50:30.206Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:50:28.119Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:32.373Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:02.380Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.386Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:02.489 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/5e60e6e9/5e60e6e9.wacz'
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:31 - calculating media hashes for url='https://twitter.com/dave_mateer/status/1505876265504546817' (using SHA3-512)
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:02.490 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection c851aa3f --id c851aa3f --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.099Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.703Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:51:03.654Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:51:03.157Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.392Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.398Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.399Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:05.399Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.409Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.542Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:05.549 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/c851aa3f/c851aa3f.wacz'
2023-08-22 10:51:05.549 | DEBUG    | auto_archiver.formatters.html_formatter:format:37 - [SKIP] FORMAT there is no media or metadata to format: url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:05.549 | SUCCESS  | auto_archiver.databases.gsheet_db:done:46 - DONE https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:51:06.365 | SUCCESS  | auto_archiver.feeders.gsheet_feeder:__iter__:79 - Finished worksheet Sheet1

and orchestation.yaml is:

steps:
  # only 1 feeder allowed
  feeder: gsheet_feeder # defaults to cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    # - telegram_archiver
    # - twitter_archiver
    #- twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    # - tiktok_archiver
    # - youtubedl_archiver
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
  enrichers:
    - hash_enricher
    # - metadata_enricher
    # - screenshot_enricher
    # - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
    # - pdq_hash_enricher # if you want to calculate hashes for thumbnails, include this after thumbnail_enricher
  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    #- console_db
    # - csv_db
    - gsheet_db
    # - mongo_db

configurations:
  gsheet_feeder:
    sheet: "AA Demo Main"
    header: 1
    service_account: "secrets/service_account.json"
    # allow_worksheets: "only parse this worksheet"
    # block_worksheets: "blocked sheet 1,blocked sheet 2"
    use_sheet_names_in_stored_paths: false
    columns:
      url: link
      status: archive status
      folder: destination folder
      archive: archive location
      date: archive date
      thumbnail: thumbnail
      timestamp: upload timestamp
      title: upload title
      text: textual content
      screenshot: screenshot
      hash: hash
      pdq_hash: perceptual hashes
      wacz: wacz
      replaywebpage: replaywebpage
  instagram_tbot_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
  telethon_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
    join_channels: false
    channel_invites: # if you want to archive from private channels
      - invite: https://t.me/+123456789
        id: 0000000001
      - invite: https://t.me/+123456788
        id: 0000000002

  twitter_api_archiver:
    # either bearer_token only
    # bearer_token: "TWITTER_BEARER_TOKEN"
   

  instagram_archiver:
    username: "INSTAGRAM_USERNAME"
    password: "INSTAGRAM_PASSWORD"
    # session_file: "secrets/instaloader.session"

  vk_archiver:
    username: "or phone number"
    password: "vk pass"
    session_file: "secrets/vk_config.v2.json"

  screenshot_enricher:
    width: 1280
    height: 2300
  wayback_archiver_enricher:
    timeout: 10
    key: "wayback key"
    secret: "wayback secret"
  hash_enricher:
    algorithm: "SHA3-512" # can also be SHA-256
  wacz_archiver_enricher:
    profile: secrets/profile.tar.gz
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat
  s3_storage:
    bucket: your-bucket-name
    region: reg1
    key: S3_KEY
    secret: S3_SECRET
    endpoint_url: "https://{region}.digitaloceanspaces.com"
    cdn_url: "https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}"
    # if private:true S3 urls will not be readable online
    private: false
    # with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
    key_path: random
  gdrive_storage:
    path_generator: url
    filename_generator: random
    root_folder_id: folder_id_from_url
    oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
    service_account: "secrets/service_account.json"
  csv_db:
    csv_file: "./local_archive/db.csv"

Support live-streaming content

This feature was dropped in the refactor, starting an issue as a reminder

Detect columns from headers

Rather than specifying columns to use for archive URL, timestamp, etc as command line flags, these should be determined from headers in the Google Sheet itself.

Wayback Machine archiver should be run in background with up to 5 jobs

We can have up to 5 simultaneous capture requests open with the Wayback Machine Save Page Now API. Since this is one of the most time consuming archivers, it would be good to take advantage of this.

Whitelist and Blacklist of Worksheet

We have a spreadsheet with multiple worksheets and I'd like to whitelist or blacklist based on the title.

The reason is that one of the worksheets is an exact copy of the worksheet that I want archived, with the same column names. So the archiver picks it up when we don't want it.

I propose adding 2 extra config items, something like this:

execution:
  # spreadsheet name - can be overwritten with CMD --sheet=
  sheet: "Test Hashing"

  # worksheet to blacklist. Leave blank which is default for none. Useful if users want a MASTERSHEET exact copy of the 
  # working worksheet
  worksheet_blacklist: MASTERSHEET
  # only check this worksheet rather than iterating through all worksheets in the spreadsheet. If whitelist is used 
  # then blacklist is ignored as whitelist is most restrictive.
  worksheet_whitelist: Sheet1

I only need a single items in the 'lists'.

Happy to code this up and do a PR.

Support Instagram image posts

It would be helpful to have support for archiving Instagram image posts as well as videos (which are currently archived with youtube-dl). This would likely require additional authentication credentials, like a cookie, to be specified in the configuration file.

too many dangling docker containers due to browsertrix-crawler

By always running docker run instead of an initial docker run followed by docker stop and subsequent docker start a lot of containers are accumulated.

When .html is in the path screenshot saves as .html

http://brokenlinkcheckerchecker.com/pagea.html

The screenshot would save the png as wayback_pageb-2022-11-11t10-55-09-277235.html

A simple fix is in base_archiver.py - commented out a line at the bottom of the function:

    def _get_key_from_url(self, url, with_extension: str = None, append_datetime: bool = False):
        """
        Receives a URL and returns a slugified version of the URL path
        if a string is passed in @with_extension the slug is appended with it if there is no "." in the slug
        if @append_date is true, the key adds a timestamp after the URL slug and before the extension
        """
        url_path = urlparse(url).path
        path, ext = os.path.splitext(url_path)
        slug = slugify(path)
        if append_datetime:
            slug += "-" + slugify(datetime.datetime.utcnow().isoformat())
        if len(ext):
            slug += ext
        if with_extension is not None:
            # I have a url with .html in the path, and want the screenshot to be .png
            # eg http://brokenlinkcheckerchecker.com/pageb.html
            # am happy with .html.png as a file extension
            # commented out the follow line to fix
            # unsure as to why this is here 
            # if "." not in slug:
                slug += with_extension
        return self.get_key(slug)

which then gives wayback_pageb-2022-11-11t10-55-09-277235.html.png

Happy to do a PR if I've not missed any understanding here.

Add a timestamp authority client Step

Following information from this timestamp-authority repo ( RFC3161 Timestamp Authority) implement a Step which connects to a timestamp authority server, one example that can be tested write away is
https://freetsa.org/index_en.php

taken from there a full example with sha512 is:

###########################################################
# 1. create a tsq file (SHA 512)
###########################################################
openssl ts -query -data file.png -no_nonce -sha512 -out file.tsq

# Option -cert: FreeTSA is expected to include its signing certificate (Root + Intermediate Certificates) in the response. (Optional)
# If the tsq was created with the option "-cert", its verification does not require "-untrusted".
#$ openssl ts -query -data file.png -no_nonce -sha512 -cert -out file.tsq


# How to make Timestamps of many files?

# To timestamp multiple files, create a text file with all their SHA-512 hashes and timestamp it.
# Alternatively, you may pack all the files to be timestamped in a zip/rar/img/tar, etc file and timestamp it.

# Generate a text file with all the hashes of the /var/log/ files
$ find /var/log/ -type f -exec sha512sum {} + > compilation.txt

###########################################################
# 2. cURL Time Stamp Request Input (HTTP / HTTPS)
###########################################################

# HTTP 2.0 in cURL: Get the latest cURL release and use this command: curl --http2.
curl -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://freetsa.org/tsr > file.tsr

# Using the Tor-network.
#$ curl -k --socks5-hostname 127.0.0.1:9050 -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://4bvu5sj5xok272x6cjx4uurvsbsdigaxfmzqy3n3eita272vfopforqd.onion/tsr > file.tsr

# tsget is very useful to stamp multiple time-stamp-queries: https://www.openssl.org/docs/manmaster/apps/tsget.html
#$ tsget -h https://freetsa.org/tsr file1.tsq file2.tsq file3.tsq

###########################################################
# 3. Verify tsr file
###########################################################

wget https://freetsa.org/files/tsa.crt
wget https://freetsa.org/files/cacert.pem

# Timestamp Information.
openssl ts -reply -in file.tsr -text

# Verify (two diferent ways).
# openssl ts -verify -data file -in file.tsr -CAfile cacert.pem -untrusted tsa.crt 
openssl ts -verify -in file.tsr -queryfile file.tsq -CAfile cacert.pem -untrusted tsa.crt
# Verification: OK

Discussion topics

If we do this for a final document like the HTML report, it's enough to do it once but then it cannot be saved within the HTML report as it would break the hashing (unless we create a link to the tsq and tsr files that are only created after the html report is written and hashed) - it does not matter if the files don't exist yet they can be created later since their content is the actual verification and they don't need hashes since they contain the HTML hash even though it references them
If we do it for each file it can lead to a lot of overhead, does not sound like a great approach

Given the cyclical definition of this, I wonder what is the best way to implement it as it needs to run after the HtmlFormatter which can only happen as a database meaning the formatter should only include/display the links if they actually exist.

Archive links from the Discord server

Giancarlo spoke to the Bellingcat Community Discord yesterday, including about how Bellingcat has an auto-archiver that works with links dropped manually into a Google Sheet. This seems like it could be easily extended to auto-archive any link posted in specific channels on the Discord. This might also turn out to be a faster way to use the auto-archiver, for any researchers who are using Discord in whatever capacity. The idea is that it gobbles up any URL posted anywhere in a message in an entire channel that matches one of the configured archivers.

If you want this and it wouldn't cost too much to run on specific channels, then I'm happy to build it. There are a few options for implementation that someone might be able to provide some guidance on.

Bot vs batch

I've started creating a bot. I figure that's probably better in terms of not exceeding the Discord API limits by doing a "get everything on this channel" frequently, but depending on how you guys like to run these archivers, there may be disadvantages in that it kinda has to stay running all the time. I suppose there'd be no harm in doing both, a bot that on startup reads the channel histories up to a max # of messages, and then waits quietly for new messages. Where do you stand on that?

How to trigger the actual archiving

You could:

Add it to a Google Sheet, and that's it. Let the existing scheduled archiver take care of any added URLs. A fair bit simpler.
On message receive with a link in it, add it to the sheet and schedule an archive to occur in the same discord_archive.py program. This is cooler because then a bot could whack a little react emoji on messages to indicate the archive status.

Deduplication

Considering this will be adding a bunch of new links to the archives, I would be worried about whether it's going to clobber previously archived pages in the S3 backend. This is something the archivers themselves are meant to detect, right? I don't think the Twitter one does this. Does DigitalOcean's S3 support version history, just in case? And the archivers don't overwrite anything if they hit a 404, right?

discordbot.mov

YouTube many videos download on /user/ReutersVideo

If you use this link

https://www.youtube.com/user/ReutersVideo

The archiver will try to download 33,681 videos.

Will dig into this more.

YoutubeDL can return non-video content

Currently this isn't handled well by the archiver. For example, if YoutubeDL returns a PDF, it will attempt to run ffmpeg on it, resulting in an error. We should be ready to handle non-video content coming from YoutubeDL (and more generally, ref #3)

Handle pages with multiple videos

Youtube-dl supports pages that contains multiple videos, but the auto-archiver does not. Instead, these URLs will be skipped over with a notification to the user that pages with multiple videos are not currently supported.

In the current user interface, there is a 1:1 relationship between a row in the spreadsheet and a file to be archived. This needs to be generalized to 1:many in order to support pages with multiple videos. One possible way to do this would be to generate HTML index pages that link/include all videos archived for a page, similar to the way that thumbnail contact sheets are generated.

AttributeError: 'HashEnricher' object has no attribute 'algorithm'

I'm running into an interesting error when archiving a simple URL locally on my computer (macOS) (command is python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land", version is 0.4.4): AttributeError: 'HashEnricher' object has no attribute 'algorithm'.

Here is my config:

steps:
  # only 1 feeder allowed
  feeder: cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    - telegram_archiver
    - twitter_archiver
    # - twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    - tiktok_archiver
    - youtubedl_archiver
    # - wayback_archiver_enricher
  enrichers:
    - hash_enricher
    - screenshot_enricher
    - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_enricher

  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    - console_db
    - csv_db
    # - gsheet_db
    # - mongo_db

configurations:
  screenshot_enricher:
    width: 1280
    height: 2300
  hash_enricher:
    algorithm: "SHA-256" # can also be SHA-256
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat

And here is the full log:

% python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land"
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:108 - FEEDER: cli_feeder
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:109 - ENRICHERS: ['hash_enricher', 'screenshot_enricher', 'thumbnail_enricher', 'wacz_enricher']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:110 - ARCHIVERS: ['telegram_archiver', 'twitter_archiver', 'tiktok_archiver', 'youtubedl_archiver']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:111 - DATABASES: ['console_db', 'csv_db']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:112 - STORAGES: ['local_storage']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:113 - FORMATTER: html_formatter
2023-03-14 11:52:44.363 | DEBUG    | auto_archiver.feeders.cli_feeder:__iter__:28 - Processing https://miles.land
2023-03-14 11:52:44.364 | DEBUG    | auto_archiver.core.orchestrator:archive:66 - result.rearchivable=True for url='https://miles.land'
2023-03-14 11:52:44.364 | WARNING  | auto_archiver.databases.console_db:started:22 - STARTED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[], rearchivable=True)
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver telegram_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver twitter_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver tiktok_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver youtubedl_archiver for https://miles.land
[generic] Extracting URL: https://miles.land
[generic] miles: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] miles: Extracting information
ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.archivers.youtubedl_archiver:download:37 - No video - Youtube normal control flow: ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:30 - calculating media hashes for url='https://miles.land' (using SHA-256)
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.screenshot_enricher:enrich:27 - Enriching screenshot for url='https://miles.land'
2023-03-14 11:52:53.272 | DEBUG    | auto_archiver.enrichers.thumbnail_enricher:enrich:23 - generating thumbnails
2023-03-14 11:52:53.273 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:35 - generating WACZ for url='https://miles.land'
2023-03-14 11:52:53.273 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:61 - Running browsertrix-crawler: docker run --rm -v /Users/miles/Desktop/tmpjdhyhj5x:/crawls/ webrecorder/browsertrix-crawler crawl --url https://miles.land --scopeType page --generateWACZ --text --collection dd5fef44 --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 90 --timeout 90
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.832Z","context":"general","message":"Page context being used with 1 worker","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Set netIdleWait to 15 seconds","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Seeds","details":[{"url":"https://miles.land/","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":99999}]}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.094Z","context":"state","message":"Storing state in memory","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.416Z","context":"general","message":"Text Extraction: Enabled","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.515Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{\"url\":\"https://miles.land/\",\"seedId\":0,\"depth\":0,\"started\":\"2023-03-14T18:52:54.448Z\"}"]}}
{"logLevel":"error","timestamp":"2023-03-14T18:52:58.314Z","context":"general","message":"Invalid Seed \"mailto:[email protected]\" - URL must start with http:// or https://","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.338Z","context":"behavior","message":"Behaviors started","details":{"behaviorTimeout":90,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.339Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.340Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"pageStatus","message":"Page finished","details":{"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.396Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.398Z","context":"general","message":"Num WARC Files: 8","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.700Z","context":"general","message":"Validating passed pages.jsonl file\nReading and Indexing All WARCs\nWriting archives...\nWriting logs...\nGenerating page index from passed pages...\nHeader detected in the passed pages.jsonl file\nGenerating datapackage.json\nGenerating datapackage-digest.json\n","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.737Z","context":"general","message":"Crawl status: done","details":{}}
2023-03-14 11:52:58.886 | ERROR    | auto_archiver.core.orchestrator:feed_item:44 - Got unexpected error on item Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True): 'HashEnricher' object has no attribute 'algorithm'
Traceback (most recent call last):
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 37, in feed_item
    return self.archive(item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 110, in archive
    s.store(m, result)  # modifies media
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 46, in store
    self.set_key(media, item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 78, in set_key
    he = HashEnricher({"algorithm": "SHA-256", "chunksize": 1.6e7})
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/enrichers/hash_enricher.py", line 18, in __init__
    assert self.algorithm in algo_choices, f"Invalid hash algorithm selected, must be one of {algo_choices} (you selected {self.algorithm})."
AttributeError: 'HashEnricher' object has no attribute 'algorithm'

2023-03-14 11:52:58.887 | ERROR    | auto_archiver.databases.console_db:failed:25 - FAILED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True)
2023-03-14 11:52:58.887 | SUCCESS  | auto_archiver.feeders.cli_feeder:__iter__:30 - Processed 1 URL(s)

I can try to investigate and submit a PR, but figured I'd open the issue just to have.

Specify hash algorithm in config

I needed to specify SHA3_521 rather than SHA256. Have a PR coming which passes this through.

  # in base_archiver use SHA256 or SHA3_512
  hash_algorithm: SHA3_512
  # hash_algorithm: SHA256