promyloph / crocoite Goto Github PK
View Code? Open in Web Editor NEWWeb archiving using Google Chrome
Home Page: https://6xq.net/crocoite/
License: MIT License
Web archiving using Google Chrome
Home Page: https://6xq.net/crocoite/
License: MIT License
It seems that Chrome sometime sends a request twice and then messes up the order in which it sends events, i.e. requestWillBeSent, requestWillBeSent (same id), loadingFinished. This results in the following crash:
Traceback (most recent call last):
File "[…]/lib/python3.6/site-packages/crocoite/cli.py", line 136, in single
loop.run_until_complete(run)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "[…]/lib/python3.6/site-packages/crocoite/controller.py", line 259, in run
handle.result ()
File "[…]/lib/python3.6/site-packages/crocoite/controller.py", line 198, in processQueue
async for item in l:
File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 348, in __aiter__
result = t.result ()
File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 448, in _loadingFinished
item.fromLoadingFinished (kwargs)
File "[…]/lib/python3.6/site-packages/crocoite/browser.py", line 213, in fromLoadingFinished
self.response.bytesReceived = data['encodedDataLength']
AttributeError: 'NoneType' object has no attribute 'bytesReceived'
This happens frequently on Twitter, when two iFrames load the same resource.
Both <picture>
and <img>
support resolution-based image loading. We’d like to fetch all images instead of just the one Google Chrome picked for us based on the current resolution. Right now there’s the EmulateScreenMetrics behavior script. Make sure it works and fetches every single image.
Dear developers, it seems to me that the binary google-chrome-stable
is hardcoded in crocoite and I believe that binary does not exist on macOS. I installed the Homebrew package chrome-cli
and symlinked google-chrome-stable
to chrome-cli
, however, crocoite still won't run on my Mac. ("Exception: Chrome died on us.\n").
Will crocoite run at all on a Mac or is it Linux-only?
Thanks.
Some sites, like Twitter use scrollable <div>
, which the screenshot behavior script cannot handle right now. Chrome’s Page.captureScreenshot operates on the page’s surface, which is not expanded in this scenario, resulting in truncated screenshots like this one:
On the other hand there are sites with a floating banner at the top or bottom. If we’d scroll these and capture just the viewport (not the surface), stitching images would be impossible (or at least ugly).
We should handle those, even though they should not happen (chrome exits before we attempt to rmtree)
Traceback (most recent call last):
File "/crocoite/crocoite/cli.py", line 102, in single
loop.run_until_complete(run)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "/crocoite/crocoite/controller.py", line 234, in run
handle.cancel ()
File "/crocoite/crocoite/devtools.py", line 326, in __aexit__
shutil.rmtree (self.userDataDir)
File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 480, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 422, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/crocoite-sandbox/lib64/python3.6/shutil.py", line 420, in _rmtree_safe_fd
os.rmdir(name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'Default'
This URL will never change to idle state and thus never time out due to idleing:
https://www.youtube.com/channel/UCYIrN60b1wvM27mOmGje89A/videos?disable_polymer=1
Preliminary investigation indicates that idle VarChangeEvent is properly fired and propagated to other threads, but not to idleProc. Thus the controller never returns from asyncio.wait and thus never processes the event. Very strange.
Turns out asyncio.ensure_future will not actually run a task, but merely schedule them, so idle.wait() might be called long after we created the future, which means it might miss an idle event.
It’s unlikely we’ll ever have a method to replay these, but it might be good to capture them nontheless.
curl
’s options -b
and -c
. The Set-Cookie format should be fine.In some websites that embed videos, the video is loaded using HTTP Live Streaming (HLS).
The process from the point of view of the web site creator is quite simple, they create a M3U8 playlist that describes the different segments of the video, and use the playlist as source in the HTML video element. The first step is usually automatically done by the encoder, FFMPEG has support for this.
There are some characteristics to take in to account, such as the M3U8 playlist describes a live stream or VOD and different quality levels available for adaptability.
As a minimum and by default, download the highest quality should be an acceptable implementation.
But a better approach would be giving option to download all qualities, highest only or don't download the M3U8 playlist referenced media.
This would bring support for all websites that have content distributed using HLS, Twitter posts with videos are an example.
Add a command/mode that reads an external URL list and passes each URL to -recursive, like archivebot’s !a <
The API is not exactly pretty and it’s easy to mess things up. There are no plausibility checks and no validation. We want:
click.js should be unit-testable, so we can easily identify breaking layout changes.
sites
object to the Python worldMatching CSS selectors is becoming increasingly difficult, as big sites usually obfuscate their CSS class names/use random names. Matching text could provide a way out. Additionally matching text could be used as a heuristic: Look for “load more”, if it’s clickable and at the end of a container (say, <div>
), click it, revert (i.e. reject navigation request) if necessary.
Support matching hosted software like Disqus. It can run on any domain, so whitelisting them is not an option.
It should be possible to add ignored URL pattern (regex) to recursive crawls. They should be updateable (add/remove pattern) while the job is running.
Tweets can now apparently have a button that reads "Show more replies". I've never seen that before yesterday.
For example, scrolling to the end on this Tweet:
The button can be matched using button.ThreadedConversation-showMoreThreadsButton
.
Instead of logging some information here and there we should dump (selected) internal state, so the dashboard can recover the current state more easily. Right now it just replays the last n messages on startup, but “forgets” everything before that.
Old sites (before history.pushState was invented) use this for navigation. Decide whether this is recursion (probably yes) and how to handle it.
Examples:
Add a command-line option that allows a) replacing the default click settings (click.yaml) and b) adding more of them at runtime.
--click-data=click.yaml
--click-match="^example\.com"
--click-selector="div.foo span.bar"
Use a library for all the URL-related stuff. yarl is used by aiohttp and looks reasonable.
Otherwise the whole grab will fail.
Traceback (most recent call last):
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 659, in _encode_host
ip = ip_address(ip)
File "/usr/lib64/python3.6/ipaddress.py", line 54, in ip_address
address)
ValueError: 'neue_preise_f' does not appear to be an IPv4 or IPv6 address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 662, in _encode_host
host = idna.encode(host, uts46=True).decode("ascii")
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 358, in encode
s = alabel(label)
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 270, in alabel
ulabel(label)
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 304, in ulabel
check_label(label)
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 261, in check_label
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 5 of 'neue_preise_f%c3%bcr_zahnimplantate_k%c3%b6nnten_sie_%c3%bcberraschen' not allowed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/home/chromebot/crocoite-sandbox/lib64/python3.6/encodings/idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/home/chromebot/crocoite/crocoite/cli.py", line 102, in single
loop.run_until_complete(run)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "/data/home/chromebot/crocoite/crocoite/controller.py", line 223, in run
async for item in b.onfinish ():
File "/data/home/chromebot/crocoite/crocoite/behavior.py", line 351, in onfinish
yield ExtractLinksEvent (list (set (map (URL, result['result']['value']))))
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 168, in __new__
val.username, val.password, host, port, encode=True
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 676, in _make_netloc
ret = cls._encode_host(host)
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 664, in _encode_host
host = host.encode("idna").decode("ascii")
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
Sometimes crocoite-single simply gets stuck for no apparent reason. In theory it should still time out, but it does not. Trace using [*map(asyncio.Task.print_stack, asyncio.Task.all_tasks())]
: trace.txt
trace2.txt
Nothing in the WARC specs, but see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type instead.
Errata required.
Bugs happen, no amount of testing can avoid that completely. And if we mess up, we need a way to tell which WARCs produced by crocoite (and tools) are affected by a bug and – possibly – fix it. For that we need:
Running extract-links.js
can fail, resulting in this traceback:
Traceback (most recent call last):
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/cli.py", line 136, in single
loop.run_until_complete(run)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 279, in run
await behavior.finish ()
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 159, in finish
await self._runon ('finish')
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/controller.py", line 165, in _runon
async for item in f ():
File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/crocoite/behavior.py", line 416, in onfinish
yield ExtractLinksEvent (list (set (mapOrIgnore (URL, result['result']['value']))))
KeyError: 'value'
Apparently a site reload/redirect to a downloadable resource triggers this issue. The current browsing frame is cleared (i.e. injected scripts are removed), but no frameNavigated event is sent (because it fails) and thus the browser frame just stays empty.
When a site reloads itself (<meta> tag, window.location change) behavior scripts must be injected again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.