promyloph / crocoite Goto Github PK

View Code? Open in Web Editor NEW

42.0 8.0 7.0 434 KB

Web archiving using Google Chrome

Home Page: https://6xq.net/crocoite/

License: MIT License

Python 94.60% JavaScript 4.96% CSS 0.04% HTML 0.40%

warc chrome-browser archiving devtools

crocoite's People

Contributors

Stargazers

Watchers

Forkers

hubprojects flashfire42 asgdev backwardn ra2003 heroku-miraheze yusuf81

crocoite's Issues

Crashing when request is sent twice

It seems that Chrome sometime sends a request twice and then messes up the order in which it sends events, i.e. requestWillBeSent, requestWillBeSent (same id), loadingFinished. This results in the following crash:

Traceback (most recent call last):
  File "[…]/lib/python3.6/site-packages/crocoite/cli.py", line 136, in single
    loop.run_until_complete(run)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "[…]/lib/python3.6/site-packages/crocoite/controller.py", line 259, in run
    handle.result ()
  File "[…]/lib/python3.6/site-packages/crocoite/controller.py", line 198, in processQueue
    async for item in l:
  File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 348, in __aiter__
    result = t.result ()
  File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 448, in _loadingFinished
    item.fromLoadingFinished (kwargs)
  File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 213, in fromLoadingFinished
    self.response.bytesReceived = data['encodedDataLength']
AttributeError: 'NoneType' object has no attribute 'bytesReceived'

This happens frequently on Twitter, when two iFrames load the same resource.

Fetch all images from <picture>

Both <picture> and <img> support resolution-based image loading. We’d like to fetch all images instead of just the one Google Chrome picked for us based on the current resolution. Right now there’s the EmulateScreenMetrics behavior script. Make sure it works and fetches every single image.

Is crocoite Linux-only?

Dear developers, it seems to me that the binary google-chrome-stable is hardcoded in crocoite and I believe that binary does not exist on macOS. I installed the Homebrew package chrome-cli and symlinked google-chrome-stable to chrome-cli, however, crocoite still won't run on my Mac. ("Exception: Chrome died on us.\n").

Will crocoite run at all on a Mac or is it Linux-only?

Thanks.

Site screenshots do not work when document is not scrollable

Some sites, like Twitter use scrollable <div>, which the screenshot behavior script cannot handle right now. Chrome’s Page.captureScreenshot operates on the page’s surface, which is not expanded in this scenario, resulting in truncated screenshots like this one:

On the other hand there are sites with a floating banner at the top or bottom. If we’d scroll these and capture just the viewport (not the surface), stitching images would be impossible (or at least ugly).

shutils.rmtree() can fail

We should handle those, even though they should not happen (chrome exits before we attempt to rmtree)

Traceback (most recent call last):
  File "/crocoite/crocoite/cli.py", line 102, in single
    loop.run_until_complete(run)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/crocoite/crocoite/controller.py", line 234, in run
    handle.cancel ()
  File "/crocoite/crocoite/devtools.py", line 326, in __aexit__
    shutil.rmtree (self.userDataDir)
  File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 480, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 422, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 420, in _rmtree_safe_fd
    os.rmdir(name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'Default'

Youtube without polymer is never idle

This URL will never change to idle state and thus never time out due to idleing:
https://www.youtube.com/channel/UCYIrN60b1wvM27mOmGje89A/videos?disable_polymer=1

Preliminary investigation indicates that idle VarChangeEvent is properly fired and propagated to other threads, but not to idleProc. Thus the controller never returns from asyncio.wait and thus never processes the event. Very strange.

Turns out asyncio.ensure_future will not actually run a task, but merely schedule them, so idle.wait() might be called long after we created the future, which means it might miss an idle event.

Dump websocket packets

It’s unlikely we’ll ever have a method to replay these, but it might be good to capture them nontheless.

Inject cookies/cookie jar

Add commandline flag either for individual cookies or cookie jar similar to curl’s options -b and -c. The Set-Cookie format should be fine.
Add predefined cookie sets. Sites like Reddit, Twitter and pretty much everything from Oath hide content unless you accept to be confronted with the reality of this world or their ToS.
reddit: quarantined community: https://www.reddit.com/r/debatefascism
reddit: 18+: https://www.reddit.com/r/gonewild

Also download M3U8 content

In some websites that embed videos, the video is loaded using HTTP Live Streaming (HLS).

The process from the point of view of the web site creator is quite simple, they create a M3U8 playlist that describes the different segments of the video, and use the playlist as source in the HTML video element. The first step is usually automatically done by the encoder, FFMPEG has support for this.

There are some characteristics to take in to account, such as the M3U8 playlist describes a live stream or VOD and different quality levels available for adaptability.

As a minimum and by default, download the highest quality should be an acceptable implementation.
But a better approach would be giving option to download all qualities, highest only or don't download the M3U8 playlist referenced media.

This would bring support for all websites that have content distributed using HLS, Twitter posts with videos are an example.

Indirect mode for IRC

Add a command/mode that reads an external URL list and passes each URL to -recursive, like archivebot’s !a <

Replace warcio

The API is not exactly pretty and it’s easy to mess things up. There are no plausibility checks and no validation. We want:

A nice/clean API that separates WARC and its payloads. warcio mixes WARC/HTTP
Relaxed parsing (read broken files)
Strict validation based on specs before writing (writing records violating the specs should not be possible)
~~read(write()) should be identity function (easier testing)~~ (see webrecorder/warcio#57)

Unit-testable click.js

click.js should be unit-testable, so we can easily identify breaking layout changes.

Move the sites object to the Python world
Add tests that check for selector existence and whether an event handler is attached or not
(opional) Check that a click does indeed result in the expected action (network request, element replaced, …)

Switch single mode to asyncio

Add asyncio communication with Google Chrome
switch everything else to asyncio

behavior click: Support matching text

Matching CSS selectors is becoming increasingly difficult, as big sites usually obfuscate their CSS class names/use random names. Matching text could provide a way out. Additionally matching text could be used as a heuristic: Look for “load more”, if it’s clickable and at the end of a container (say, <div>), click it, revert (i.e. reject navigation request) if necessary.

behavior click: Support software matching

Support matching hosted software like Disqus. It can run on any domain, so whitelisting them is not an option.

Ignore sets

It should be possible to add ignored URL pattern (regex) to recursive crawls. They should be updateable (add/remove pattern) while the job is running.

Click "Show more replies" on individual Tweets

Tweets can now apparently have a button that reads "Show more replies". I've never seen that before yesterday.

For example, scrolling to the end on this Tweet:

The button can be matched using button.ThreadedConversation-showMoreThreadsButton.

Replace logging with state change dumping

Instead of logging some information here and there we should dump (selected) internal state, so the dashboard can recover the current state more easily. Right now it just replays the last n messages on startup, but “forgets” everything before that.

Handle sites using onhashchange

Old sites (before history.pushState was invented) use this for navigation. Decide whether this is recursion (probably yes) and how to handle it.

Examples:

behavior click: Allow passing custom click selectors

Add a command-line option that allows a) replacing the default click settings (click.yaml) and b) adding more of them at runtime.

--click-data=click.yaml
--click-match="^example\.com"
--click-selector="div.foo span.bar"

Proper URL handling

Use a library for all the URL-related stuff. yarl is used by aiohttp and looks reasonable.

behavior: Ignore invalid URLs when extracting

Otherwise the whole grab will fail.

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 659, in _encode_host
    ip = ip_address(ip)
  File "/usr/lib64/python3.6/ipaddress.py", line 54, in ip_address
    address)
ValueError: 'neue_preise_f' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 662, in _encode_host
    host = idna.encode(host, uts46=True).decode("ascii")
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 358, in encode
    s = alabel(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 270, in alabel
    ulabel(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 304, in ulabel
    check_label(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 5 of 'neue_preise_f%c3%bcr_zahnimplantate_k%c3%b6nnten_sie_%c3%bcberraschen' not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib64/python3.6/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite/crocoite/cli.py", line 102, in single
    loop.run_until_complete(run)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/data/home/chromebot/crocoite/crocoite/controller.py", line 223, in run
    async for item in b.onfinish ():
  File "/data/home/chromebot/crocoite/crocoite/behavior.py", line 351, in onfinish
    yield ExtractLinksEvent (list (set (map (URL, result['result']['value']))))
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 168, in __new__
    val.username, val.password, host, port, encode=True
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 676, in _make_netloc
    ret = cls._encode_host(host)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 664, in _encode_host
    host = host.encode("idna").decode("ascii")
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

Process gets stuck sometimes

Sometimes crocoite-single simply gets stuck for no apparent reason. In theory it should still time out, but it does not. Trace using [*map(asyncio.Task.print_stack, asyncio.Task.all_tasks())]: trace.txt
trace2.txt

Content-Type “encoding” should be “charset”

Nothing in the WARC specs, but see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type instead.

Errata required.

Errata handling

Bugs happen, no amount of testing can avoid that completely. And if we mess up, we need a way to tell which WARCs produced by crocoite (and tools) are affected by a bug and – possibly – fix it. For that we need:

-merge should write an additional warcinfo record, containing the same information as -grab, so we know which tool modified the WARC.
Add a tool which uses an internal errata database (yaml?) and checks warcinfo crocoite version+dependencies+chrome version against known bugs.
The same tool should be able to fix the WARC (if possible) and write a new one, indicating it was modified.

Link extraction may fail

Running extract-links.js can fail, resulting in this traceback:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/cli.py", line 136, in single
    loop.run_until_complete(run)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 279, in run
    await behavior.finish ()
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 159, in finish
    await self._runon ('finish')
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 165, in _runon
    async for item in f ():
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/behavior.py", line 416, in onfinish
    yield ExtractLinksEvent (list (set (mapOrIgnore (URL, result['result']['value']))))
KeyError: 'value'

Apparently a site reload/redirect to a downloadable resource triggers this issue. The current browsing frame is cleared (i.e. injected scripts are removed), but no frameNavigated event is sent (because it fails) and thus the browser frame just stays empty.

Reinject behavior scripts when site is reloaded

When a site reloads itself (<meta> tag, window.location change) behavior scripts must be injected again.