Giter Club home page Giter Club logo

auto-archiver's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

auto-archiver's Issues

Add Browsertrix support when using docker image

Issue: browsertrix-crawler is executed via docker (docker run ...) and it uses volumes to

  1. pass the profile.tar.gz file
  2. save the results of its execution

If the auto-archiver is running inside docker, we have a docker-in-docker situation and that can be nefarious.
One workaround is to share the daemon of the host machine with the auto-archiver docker container via /var/run/docker.sock, for example:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/secrets:/app/secrets -e SHARED_PATH=$PWD/secrets/crawls aa --config secrets/config-docker.yaml

However, doing this means that the -v volumes passed when doing docker run -v... browsertrix-crawler will only share volumes with the host (and not the docker container running the archiver) meaning the profile file and the results of extraction are host path dependent which adds a layer of complexity.

Additionally, using /var/run/docker.sock is generally undesireable for security as it gives a lot of permissions to the code in the container.

Challenge: can we find a secure and easy to use (both via docker and outside docker) for browsertrix-crawler? Would that be a docker-compose with 2 services communicating? a new service that responds to browsertrix-crawl requests?

Use Entry Number for the folder in Google Storage

This is probably an issue for me to implement in the google drive storage.

eg instead of a folder name: https-www-youtube-com-watch-v-wlahzurxrjy-list-pl7a55eb715fbb2940-index-7

I like it to be the entry number eg AA001 which is taken from the good spreadsheet.

Perhaps patch in via

filename_generator: static to be filename_generator: entry_number

Project should be easier to set up and run

Currently, running this project in an automated way requires creating a Digital Ocean Spaces bucket and manually managing cron jobs on a Linux server. Ideally, this would be simpler to deploy so that a new archiving spreadsheet could be set up in a user friendly way, even for non-programmers.

One promising possibility for moving in this direction is as a Google Sheets Add On, but other ideas can also be explored and evaluated.

"failed: no archiver" in google sheet although can download the screenshot.

I attempted to set up the auto-archiver by following this instructional video (https://www.youtube.com/watch?v=VfAhcuV2tLQ).

Initially, the code was running and the archive status in the Google sheet showed "Archive in progress," but at the end, it displayed "failed: no archiver".

The logs indicate that I have successfully scraped some data. I also downloaded some screenshots (YouTube and video), and the YouTube video in webm format. However, I am unsure why the data cannot update in the google Sheet.

Also, is it necessary to utilize the browsertrix-crawler? I have downloaded the Docker desktop, and my machine can run the browsertrix-crawler, but the error persists."

The error messages are as below:
ERROR | main:process_sheet:138 - Got unexpected error in row 2 with twitter for url='https://twitter.com/anwaribrahim/status/1642750503422685187?cxt=HHwWhsDTsaK4nMwtAAAA': [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'
Traceback (most recent call last):
File "/Users/usr/Documents/python/archiver/auto_archive.py", line 133, in process_sheet
result = archiver.download(url, check_if_exists=c.check_if_exists)
File "/Users/usr/Documents/python/archiver/archivers/twitter_archiver.py", line 42, in download
wacz = self.get_wacz(url)
File "/Users/usr/Documents/python/archiver/archivers/base_archiver.py", line 234, in get_wacz
shutil.copyfile(self.browsertrix.profile, os.path.join(browsertrix_home, "profile.tar.gz"))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'

2023-04-13 14:47:20.800 | SUCCESS | main:process_sheet:167 - Finished worksheet Sheet1

Auto Tweet the Hash

After every successful archive, do a Tweet with the Hash so that we can prove that on this day the picture/video was as it is.

Need to translate this older code to the new codebase. Use enricher probably.

https://github.com/djhmateer/auto-archiver/blob/main/auto_archive.py#L183

This code uses a SQL database to act as a queue, and another service runs every x minutes to poll the queue to see if anything should be tweeted. Checks the NextTweetTime so as not to bombard the API. Been working for a few months on the free Twitter API which gives 1500 Tweets per month.

Test and Demo spreadsheet with urls which test all aspects

A first step to testable code could be a spreadsheet which has expected output columns. This would make it easy to see if anything isn't working (regression testing).

I've got something already and will improve and post here.

This would also be useful for a demo purposes too, so users can see what is happening (and what should happen)

In order of usefulness to clients / most common links

Twitter

  • 1 image (should work)
  • multiple images (should work)
  • tweet with media sensitive image(s)
  • tweet that brings a login prompt (trick is to get rid of part of the url)
  • check tweet image size is max resolution
  • tweet that contain a non twitter video URL as intent is probably to get images from tweet

then video:

  • 1 video
  • multiple videos will not work

Facebook

  • 1 image - will not work
  • multiple images - will not work

then video:

  • 1 video. Handled by youtubedl
  • multiple videos - unusual and wont work

etc...

In order of developers

  • TelethonArchiver (Telegrams API)
  • TikTokArchiver (always getting invalid URL so far)
  • TwitterAPIArchiver (handles all tweets if API key is there)
  • YoutubeDLArchiver (handles youtube, and facebook video)
  • TelegramArchiver (backup if telethon doesn't work which is common)
  • TwitterArchiver (only if Twitter API not working)
  • VkArchiver
  • WaybackArchiver

TikTok downloader stalling when video is unavailable - can't reproduce

Bug that I can't reliably reproduce but sometimes stalls the whole archiver for many hours until restarting the archiver.

Given this URL:
https://www.tiktok.com/@jusscomfyyy/video/7090483393586089222

The tiktok downloader stalls.

https://github.com/krypton-byte/tiktok-downloader

The test app https://tkdown.herokuapp.com/ correctly throws an invalid url.

class TiktokArchiver(Archiver):
    name = "tiktok"

    def download(self, url, check_if_exists=False):
        if 'tiktok.com' not in url:
            return False

        status = 'success'

        try:
            # really slow for some videos here 25minutes plus or stalls
            info = tiktok_downloader.info_post(url)
            key = self.get_key(f'{info.id}.mp4')

Improvement suggestions for WaybackArchiver

Hi!
I'm a member of Team Wayback at the Internet Archive.
I have some improvement suggestions for

class WaybackArchiver(Archiver):

  1. You could use the Wayback Machine Availability API to easily get capture info about a captured URL https://archive.org/help/wayback_api.php. https://web.archive.org/web/<URL> is not recommended because its purpose is to playback the latest capture. You don't need to load the whole data of the latest capture of a URL, you just need to know if its available or not.
  2. Save Page Now API has a lot of useful options https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

if_not_archived_within=<timedelta> should be useful in your case.

Capture web page only if the latest existing capture at the Archive is older than the limit. Its format could be any datetime expression like “3d 5h 20m” or just a number of seconds, e.g. “120”. If there is a capture within the defined timedelta, SPN2 returns that as a recent capture. The default system is 30 min.

Cheers!

UX bug: archiving fails if the "url" is replaced with its linked title text

When pasting a url in Sheets, a helpful little dialog appears and suggests that you "Replace URL with its title" as linked text (see image below). Auto archiver doesn't know how to handle this format and returns "nothing archived". However, it should be possible to detect and extract the url when the cell value is a link.

Screenshot 2023-08-01 at 16 43 13

Facebook image archiving

Archiving of image(s) on Facebook is not supported yet and would be very useful.

Placeholder Issue to put in ideas of potentially how it could be done.

Background

https://github.com/djhmateer/auto-archiver#archive-logic has a list of what works and doesn't. Facebook video works using youtube_dlp

In fork above to get a Facebook screenshot I am using using automation to click on the accept cookies page as we don't want the cookie popup in the screenshot.

To get a Facebook post link

"Each Facebook post has a timestamp on the top (it may be something like Just now, 3 mins or Yesterday). This timestamp contains the link to your post. So, to copy it, simply hover your mouse over the timestamp, right click, then copy link address"

Example

As an example of Facebook images which we would like to archive:

https://www.facebook.com/chelseymateerbeautician/posts/pfbid0mhimrwfeBpWKwBUFna28Q3RfaEK8HETcEpk1QXoEeFXHVwaa7oxLxKTHbBqu5nPpl

https://gist.github.com/pcardune/1332911 - potentially this may help.

#26 - @msramalho talked about the potential of https://archive.ph/

include exif metadata

Add an additional data point containing exif metadata where available, for now can only think of telegram media.

new enhancer: store HTTP headers and SSL certificate

Create a new enhancer that can

  • (optionally) store the HTTP request headers -need to think of use-cases and how to be comprehensive
  • (optionally) store the HTTP response headers - same as above
  • (optionally) store the SSL certificate of https connections

The display in the html_formatter should probably be initially hidden.

Another option is to conceive a high-level activity log that captures all actions and logs and appends them to the end of the html report.

Google service account credentials usage doesn't match README instructions

Hi there! The README says the Google service account credentials should be placed at ~/.config/gspread/service_account.json, which is where gspread looks by default. However, line 17 of auto_auto_archive.py specifies a credentials file in the same directory as the script, so the script won't run unless you drop service_account.json in the same directory.

So it seems that either auto_auto_archive.py should be changed to call gspread.service_account without a filename parameter; or the README should be changed to specify that service_account.json be created in the application root. If you have a preference, I can open a pull request and change it either way.

Google Drive bug with leading /

In telethon if there are subdirectories wanting creating, sometimes a key is passed with a leading / character which confuses the join.

I've worked around it by adding a catch, but need to clean up

  • find the root cause
  • keep catch in logging to error log so we know if it happens again

https://github.com/bellingcat/auto-archiver/blob/main/storages/gd_storage.py

    def uploadf(self, file: str, key: str, **_kwargs):
        """
        1. for each sub-folder in the path check if exists or create
        2. upload file to root_id/other_paths.../filename
        """
        # doesn't work if key starts with / which can happen from telethon todo fix
        if key.startswith('/'):
            # remove first character ie /
            key = key[1:]

A example is: https://t.me/witnessdaily/169265

Archive non-video media (images and sound)

Currently, auto-archiver relies on youtube-dl to download media, which only finds video sources. It would be a significant improvement to download images, and possibly audio as well.

YouTube playlists - probably not intentional from user

https://podrobnosti.ua/2443817-na-kivschin-cherez-vorozhij-obstrl-vinikla-pozhezha.html

This site contains a live link at the top which is a link to a YouTube playlist with 1 item which is a live stream.

Currently the 'is_live' check doesn't catch it as it is a playlist, and will proceed to download 3.7GB of stream, then create 1000's of thumbnails.

I propose a simple fix in youtubedl_archiver.py to stop downloading of playlists which stuck me as probably not what the user would want.

        if info.get('is_live', False):
            logger.warning("Live streaming media, not archiving now")
            return ArchiveResult(status="Streaming media")

       # added this catch below
        infotype = info.get('_type', False)
        if infotype is not False:
            if 'playlist' in infotype:
                logger.info('found a youtube playlist - this probably is not intended. Have put in this as edge case of a live stream which is a single item in a playlist')
                return ArchiveResult(status="Playlist")

There is probably a much more elegant way to express this!

Can submit a PR if you agree.

Multiple instances of auto-archiver and Proxmox / Azure

I'm hosting 3 instances of the auto-archiver on 3 separate VM's. I've allocated 4GB of RAM to each and the systems work well.

Is anyone running multiple instances on a single VM, and found any issues with simultaneous calls to ffmpeg / Gecko drivers. That is what concerns me the most.

Running python in a virtual env eg pipenv run python auto_archive.py should segregate that side I guess.

Scrape Youtube comments, livechats

youtube-dl, or at least yt-dlp, is capable of downloading and dumping livechat data, machine and manual transcripts of videos, and so on. Youtube comments can be grabbed with another official API with some difficulty, and this project relatively easily. I've found having all of these around useful for some OSINT tasks.

archive facebook with archive.ph

https://archive.ph/ does not have an API like the Internet Web Archive Wayback machine, although it can archive facebook pages and IWA cannot. Could we use selenium to submit links in the archive.ph UI and thus successfully archive links?

generating WACZ without Docker - wacz not working

Getting a proxy connection failed on the wacz_archiver_enricher on all urls.

First time I've set this up, so probably something simple / maybe I've missed something.

Next step for me is to setup a local dev version and debug it.. but this issue may be useful for others at the same stage as me.

I have the profile setup in secrets/profile.tar.gz which I did via

# create a new profile
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/"

Output of the run is:

docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml

2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:111 - FEEDER: gsheet_feeder
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:112 - ENRICHERS: ['hash_enricher', 'wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:113 - ARCHIVERS: ['wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:114 - DATABASES: ['gsheet_db']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:115 - STORAGES: ['local_storage']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:116 - FORMATTER: html_formatter
2023-08-22 10:50:24.319 | INFO     | auto_archiver.feeders.gsheet_feeder:__iter__:48 - Opening worksheet ii=0: wks.title='Sheet1' header=1
2023-08-22 10:50:26.275 | WARNING  | auto_archiver.databases.gsheet_db:started:28 - STARTED Metadata(status='no archiver', metadata={'_processed_at': datetime.datetime(2023, 8, 22, 10, 50, 26, 274503), 'url': 'https://twitter.com/dave_mateer/status/1505876265504546817'}, media=[])
2023-08-22 10:50:26.916 | INFO     | auto_archiver.core.orchestrator:archive:85 - Trying archiver wacz_archiver_enricher for https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:50:26.916 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:50:26.916 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection 5e60e6e9 --id 5e60e6e9 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.263Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.269Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.270Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:50:30.206Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:50:28.119Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:32.373Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:02.380Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.386Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:02.489 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/5e60e6e9/5e60e6e9.wacz'
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:31 - calculating media hashes for url='https://twitter.com/dave_mateer/status/1505876265504546817' (using SHA3-512)
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:02.490 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection c851aa3f --id c851aa3f --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.099Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.703Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:51:03.654Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:51:03.157Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.392Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.398Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.399Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:05.399Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.409Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.542Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:05.549 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/c851aa3f/c851aa3f.wacz'
2023-08-22 10:51:05.549 | DEBUG    | auto_archiver.formatters.html_formatter:format:37 - [SKIP] FORMAT there is no media or metadata to format: url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:05.549 | SUCCESS  | auto_archiver.databases.gsheet_db:done:46 - DONE https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:51:06.365 | SUCCESS  | auto_archiver.feeders.gsheet_feeder:__iter__:79 - Finished worksheet Sheet1

and orchestation.yaml is:

steps:
  # only 1 feeder allowed
  feeder: gsheet_feeder # defaults to cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    # - telegram_archiver
    # - twitter_archiver
    #- twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    # - tiktok_archiver
    # - youtubedl_archiver
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
  enrichers:
    - hash_enricher
    # - metadata_enricher
    # - screenshot_enricher
    # - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
    # - pdq_hash_enricher # if you want to calculate hashes for thumbnails, include this after thumbnail_enricher
  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    #- console_db
    # - csv_db
    - gsheet_db
    # - mongo_db

configurations:
  gsheet_feeder:
    sheet: "AA Demo Main"
    header: 1
    service_account: "secrets/service_account.json"
    # allow_worksheets: "only parse this worksheet"
    # block_worksheets: "blocked sheet 1,blocked sheet 2"
    use_sheet_names_in_stored_paths: false
    columns:
      url: link
      status: archive status
      folder: destination folder
      archive: archive location
      date: archive date
      thumbnail: thumbnail
      timestamp: upload timestamp
      title: upload title
      text: textual content
      screenshot: screenshot
      hash: hash
      pdq_hash: perceptual hashes
      wacz: wacz
      replaywebpage: replaywebpage
  instagram_tbot_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
  telethon_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
    join_channels: false
    channel_invites: # if you want to archive from private channels
      - invite: https://t.me/+123456789
        id: 0000000001
      - invite: https://t.me/+123456788
        id: 0000000002

  twitter_api_archiver:
    # either bearer_token only
    # bearer_token: "TWITTER_BEARER_TOKEN"
   

  instagram_archiver:
    username: "INSTAGRAM_USERNAME"
    password: "INSTAGRAM_PASSWORD"
    # session_file: "secrets/instaloader.session"

  vk_archiver:
    username: "or phone number"
    password: "vk pass"
    session_file: "secrets/vk_config.v2.json"

  screenshot_enricher:
    width: 1280
    height: 2300
  wayback_archiver_enricher:
    timeout: 10
    key: "wayback key"
    secret: "wayback secret"
  hash_enricher:
    algorithm: "SHA3-512" # can also be SHA-256
  wacz_archiver_enricher:
    profile: secrets/profile.tar.gz
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat
  s3_storage:
    bucket: your-bucket-name
    region: reg1
    key: S3_KEY
    secret: S3_SECRET
    endpoint_url: "https://{region}.digitaloceanspaces.com"
    cdn_url: "https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}"
    # if private:true S3 urls will not be readable online
    private: false
    # with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
    key_path: random
  gdrive_storage:
    path_generator: url
    filename_generator: random
    root_folder_id: folder_id_from_url
    oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
    service_account: "secrets/service_account.json"
  csv_db:
    csv_file: "./local_archive/db.csv"

Detect columns from headers

Rather than specifying columns to use for archive URL, timestamp, etc as command line flags, these should be determined from headers in the Google Sheet itself.

Whitelist and Blacklist of Worksheet

We have a spreadsheet with multiple worksheets and I'd like to whitelist or blacklist based on the title.

The reason is that one of the worksheets is an exact copy of the worksheet that I want archived, with the same column names. So the archiver picks it up when we don't want it.

I propose adding 2 extra config items, something like this:

execution:
  # spreadsheet name - can be overwritten with CMD --sheet=
  sheet: "Test Hashing"

  # worksheet to blacklist. Leave blank which is default for none. Useful if users want a MASTERSHEET exact copy of the 
  # working worksheet
  worksheet_blacklist: MASTERSHEET
  # only check this worksheet rather than iterating through all worksheets in the spreadsheet. If whitelist is used 
  # then blacklist is ignored as whitelist is most restrictive.
  worksheet_whitelist: Sheet1

I only need a single items in the 'lists'.

Happy to code this up and do a PR.

Support Instagram image posts

It would be helpful to have support for archiving Instagram image posts as well as videos (which are currently archived with youtube-dl). This would likely require additional authentication credentials, like a cookie, to be specified in the configuration file.

When .html is in the path screenshot saves as .html

http://brokenlinkcheckerchecker.com/pagea.html

The screenshot would save the png as wayback_pageb-2022-11-11t10-55-09-277235.html

A simple fix is in base_archiver.py - commented out a line at the bottom of the function:

    def _get_key_from_url(self, url, with_extension: str = None, append_datetime: bool = False):
        """
        Receives a URL and returns a slugified version of the URL path
        if a string is passed in @with_extension the slug is appended with it if there is no "." in the slug
        if @append_date is true, the key adds a timestamp after the URL slug and before the extension
        """
        url_path = urlparse(url).path
        path, ext = os.path.splitext(url_path)
        slug = slugify(path)
        if append_datetime:
            slug += "-" + slugify(datetime.datetime.utcnow().isoformat())
        if len(ext):
            slug += ext
        if with_extension is not None:
            # I have a url with .html in the path, and want the screenshot to be .png
            # eg http://brokenlinkcheckerchecker.com/pageb.html
            # am happy with .html.png as a file extension
            # commented out the follow line to fix
            # unsure as to why this is here 
            # if "." not in slug:
                slug += with_extension
        return self.get_key(slug)

which then gives wayback_pageb-2022-11-11t10-55-09-277235.html.png

Happy to do a PR if I've not missed any understanding here.

Add a timestamp authority client Step

Following information from this timestamp-authority repo ( RFC3161 Timestamp Authority) implement a Step which connects to a timestamp authority server, one example that can be tested write away is
https://freetsa.org/index_en.php

taken from there a full example with sha512 is:

###########################################################
# 1. create a tsq file (SHA 512)
###########################################################
openssl ts -query -data file.png -no_nonce -sha512 -out file.tsq

# Option -cert: FreeTSA is expected to include its signing certificate (Root + Intermediate Certificates) in the response. (Optional)
# If the tsq was created with the option "-cert", its verification does not require "-untrusted".
#$ openssl ts -query -data file.png -no_nonce -sha512 -cert -out file.tsq


# How to make Timestamps of many files?

# To timestamp multiple files, create a text file with all their SHA-512 hashes and timestamp it.
# Alternatively, you may pack all the files to be timestamped in a zip/rar/img/tar, etc file and timestamp it.

# Generate a text file with all the hashes of the /var/log/ files
$ find /var/log/ -type f -exec sha512sum {} + > compilation.txt

###########################################################
# 2. cURL Time Stamp Request Input (HTTP / HTTPS)
###########################################################

# HTTP 2.0 in cURL: Get the latest cURL release and use this command: curl --http2.
curl -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://freetsa.org/tsr > file.tsr

# Using the Tor-network.
#$ curl -k --socks5-hostname 127.0.0.1:9050 -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://4bvu5sj5xok272x6cjx4uurvsbsdigaxfmzqy3n3eita272vfopforqd.onion/tsr > file.tsr

# tsget is very useful to stamp multiple time-stamp-queries: https://www.openssl.org/docs/manmaster/apps/tsget.html
#$ tsget -h https://freetsa.org/tsr file1.tsq file2.tsq file3.tsq

###########################################################
# 3. Verify tsr file
###########################################################

wget https://freetsa.org/files/tsa.crt
wget https://freetsa.org/files/cacert.pem

# Timestamp Information.
openssl ts -reply -in file.tsr -text

# Verify (two diferent ways).
# openssl ts -verify -data file -in file.tsr -CAfile cacert.pem -untrusted tsa.crt 
openssl ts -verify -in file.tsr -queryfile file.tsq -CAfile cacert.pem -untrusted tsa.crt
# Verification: OK

Discussion topics

  • If we do this for a final document like the HTML report, it's enough to do it once but then it cannot be saved within the HTML report as it would break the hashing (unless we create a link to the tsq and tsr files that are only created after the html report is written and hashed) - it does not matter if the files don't exist yet they can be created later since their content is the actual verification and they don't need hashes since they contain the HTML hash even though it references them
  • If we do it for each file it can lead to a lot of overhead, does not sound like a great approach

Given the cyclical definition of this, I wonder what is the best way to implement it as it needs to run after the HtmlFormatter which can only happen as a database meaning the formatter should only include/display the links if they actually exist.

Archive links from the Discord server

Giancarlo spoke to the Bellingcat Community Discord yesterday, including about how Bellingcat has an auto-archiver that works with links dropped manually into a Google Sheet. This seems like it could be easily extended to auto-archive any link posted in specific channels on the Discord. This might also turn out to be a faster way to use the auto-archiver, for any researchers who are using Discord in whatever capacity. The idea is that it gobbles up any URL posted anywhere in a message in an entire channel that matches one of the configured archivers.

If you want this and it wouldn't cost too much to run on specific channels, then I'm happy to build it. There are a few options for implementation that someone might be able to provide some guidance on.

Bot vs batch

I've started creating a bot. I figure that's probably better in terms of not exceeding the Discord API limits by doing a "get everything on this channel" frequently, but depending on how you guys like to run these archivers, there may be disadvantages in that it kinda has to stay running all the time. I suppose there'd be no harm in doing both, a bot that on startup reads the channel histories up to a max # of messages, and then waits quietly for new messages. Where do you stand on that?

How to trigger the actual archiving

You could:

  • Add it to a Google Sheet, and that's it. Let the existing scheduled archiver take care of any added URLs. A fair bit simpler.
  • On message receive with a link in it, add it to the sheet and schedule an archive to occur in the same discord_archive.py program. This is cooler because then a bot could whack a little react emoji on messages to indicate the archive status.

Deduplication

Considering this will be adding a bunch of new links to the archives, I would be worried about whether it's going to clobber previously archived pages in the S3 backend. This is something the archivers themselves are meant to detect, right? I don't think the Twitter one does this. Does DigitalOcean's S3 support version history, just in case? And the archivers don't overwrite anything if they hit a 404, right?


discordbot.mov

YoutubeDL can return non-video content

Currently this isn't handled well by the archiver. For example, if YoutubeDL returns a PDF, it will attempt to run ffmpeg on it, resulting in an error. We should be ready to handle non-video content coming from YoutubeDL (and more generally, ref #3)

Handle pages with multiple videos

Youtube-dl supports pages that contains multiple videos, but the auto-archiver does not. Instead, these URLs will be skipped over with a notification to the user that pages with multiple videos are not currently supported.

In the current user interface, there is a 1:1 relationship between a row in the spreadsheet and a file to be archived. This needs to be generalized to 1:many in order to support pages with multiple videos. One possible way to do this would be to generate HTML index pages that link/include all videos archived for a page, similar to the way that thumbnail contact sheets are generated.

AttributeError: 'HashEnricher' object has no attribute 'algorithm'

I'm running into an interesting error when archiving a simple URL locally on my computer (macOS) (command is python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land", version is 0.4.4): AttributeError: 'HashEnricher' object has no attribute 'algorithm'.

Here is my config:

steps:
  # only 1 feeder allowed
  feeder: cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    - telegram_archiver
    - twitter_archiver
    # - twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    - tiktok_archiver
    - youtubedl_archiver
    # - wayback_archiver_enricher
  enrichers:
    - hash_enricher
    - screenshot_enricher
    - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_enricher

  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    - console_db
    - csv_db
    # - gsheet_db
    # - mongo_db

configurations:
  screenshot_enricher:
    width: 1280
    height: 2300
  hash_enricher:
    algorithm: "SHA-256" # can also be SHA-256
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat

And here is the full log:

% python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land"
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:108 - FEEDER: cli_feeder
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:109 - ENRICHERS: ['hash_enricher', 'screenshot_enricher', 'thumbnail_enricher', 'wacz_enricher']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:110 - ARCHIVERS: ['telegram_archiver', 'twitter_archiver', 'tiktok_archiver', 'youtubedl_archiver']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:111 - DATABASES: ['console_db', 'csv_db']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:112 - STORAGES: ['local_storage']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:113 - FORMATTER: html_formatter
2023-03-14 11:52:44.363 | DEBUG    | auto_archiver.feeders.cli_feeder:__iter__:28 - Processing https://miles.land
2023-03-14 11:52:44.364 | DEBUG    | auto_archiver.core.orchestrator:archive:66 - result.rearchivable=True for url='https://miles.land'
2023-03-14 11:52:44.364 | WARNING  | auto_archiver.databases.console_db:started:22 - STARTED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[], rearchivable=True)
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver telegram_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver twitter_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver tiktok_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver youtubedl_archiver for https://miles.land
[generic] Extracting URL: https://miles.land
[generic] miles: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] miles: Extracting information
ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.archivers.youtubedl_archiver:download:37 - No video - Youtube normal control flow: ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:30 - calculating media hashes for url='https://miles.land' (using SHA-256)
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.screenshot_enricher:enrich:27 - Enriching screenshot for url='https://miles.land'
2023-03-14 11:52:53.272 | DEBUG    | auto_archiver.enrichers.thumbnail_enricher:enrich:23 - generating thumbnails
2023-03-14 11:52:53.273 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:35 - generating WACZ for url='https://miles.land'
2023-03-14 11:52:53.273 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:61 - Running browsertrix-crawler: docker run --rm -v /Users/miles/Desktop/tmpjdhyhj5x:/crawls/ webrecorder/browsertrix-crawler crawl --url https://miles.land --scopeType page --generateWACZ --text --collection dd5fef44 --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 90 --timeout 90
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.832Z","context":"general","message":"Page context being used with 1 worker","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Set netIdleWait to 15 seconds","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Seeds","details":[{"url":"https://miles.land/","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":99999}]}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.094Z","context":"state","message":"Storing state in memory","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.416Z","context":"general","message":"Text Extraction: Enabled","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.515Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{\"url\":\"https://miles.land/\",\"seedId\":0,\"depth\":0,\"started\":\"2023-03-14T18:52:54.448Z\"}"]}}
{"logLevel":"error","timestamp":"2023-03-14T18:52:58.314Z","context":"general","message":"Invalid Seed \"mailto:[email protected]\" - URL must start with http:// or https://","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.338Z","context":"behavior","message":"Behaviors started","details":{"behaviorTimeout":90,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.339Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.340Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"pageStatus","message":"Page finished","details":{"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.396Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.398Z","context":"general","message":"Num WARC Files: 8","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.700Z","context":"general","message":"Validating passed pages.jsonl file\nReading and Indexing All WARCs\nWriting archives...\nWriting logs...\nGenerating page index from passed pages...\nHeader detected in the passed pages.jsonl file\nGenerating datapackage.json\nGenerating datapackage-digest.json\n","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.737Z","context":"general","message":"Crawl status: done","details":{}}
2023-03-14 11:52:58.886 | ERROR    | auto_archiver.core.orchestrator:feed_item:44 - Got unexpected error on item Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True): 'HashEnricher' object has no attribute 'algorithm'
Traceback (most recent call last):
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 37, in feed_item
    return self.archive(item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 110, in archive
    s.store(m, result)  # modifies media
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 46, in store
    self.set_key(media, item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 78, in set_key
    he = HashEnricher({"algorithm": "SHA-256", "chunksize": 1.6e7})
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/enrichers/hash_enricher.py", line 18, in __init__
    assert self.algorithm in algo_choices, f"Invalid hash algorithm selected, must be one of {algo_choices} (you selected {self.algorithm})."
AttributeError: 'HashEnricher' object has no attribute 'algorithm'

2023-03-14 11:52:58.887 | ERROR    | auto_archiver.databases.console_db:failed:25 - FAILED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True)
2023-03-14 11:52:58.887 | SUCCESS  | auto_archiver.feeders.cli_feeder:__iter__:30 - Processed 1 URL(s)

I can try to investigate and submit a PR, but figured I'd open the issue just to have.

Specify hash algorithm in config

I needed to specify SHA3_521 rather than SHA256. Have a PR coming which passes this through.

  # in base_archiver use SHA256 or SHA3_512
  hash_algorithm: SHA3_512
  # hash_algorithm: SHA256

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.