Comments (7)
That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how --parser=auto
should work (and make it work that way).
Still thinking about the complicated interactions of --parser
, --depth
, input from stdin
, and URLs vs. files on the command line. Right now I'm thinking that it may make sense to add some more explicit commands that don't try to be so clever and then add
can be as messy as it wants without cluttering up the underlying implementation as much.
from archivebox.
(Thanks for the sponsorship, it was very unexpected!)
This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this:
$ echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
$ cat '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.1
I pushed test cases2 that implement all of the scenarios you outlined in this comment, and with the current dev branch, only 4 of the 19 pass. After fixing parser == 'auto'
to favor the first successful parser, that only improves to 9 of 19. Making the change to interpret --parser
as I outline below gets us to 15 of 19. The four that still fail are due to the ambiguous behavior above.3 (I haven't committed that change because I'm not happy with it yet.)
Here's how I am thinking about it:
We start with an empty list of URLs to archive.4
We get a list of zero or more filenames or URLs on the command line. That's args = Optional[list[str]]
.
If not args
, we read stdin and run it through the parser. It could be RSS, JSON, or just a blob of text with URLs that the generic_txt
parser will figure out when parser = "auto"
, and so we add those to our list of URLs to archive.5
Now for arg in args
, we have to figure out what each argument is.
If arg
is a file, we don't have a URL so it can't be added to our list of URLs, so we will parse the contents and add the URLs contained within to our list of URLs to archive.
If arg
looks like a URL, now we have to figure out the user's intent. It could either be a resource that the user wanted to archive, or something like an RSS file where they want us to parse it and archive all the URLs mentioned.
$ archivebox add https://example.com/feed.rss https://example.com/content.html
The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks.
I don't think we want to always download and parse the resource, because with --parser=auto
we'll probably always find something in any URL we're given, but for most URLs the user probably really intends to archive that specific resource. (This also would conflict with --index-only
which I think shows an intent to not download URLs absent other signals.)
That means that archivebox add https://example.com/index.rss
will archive https://example.com/index.rss
and not parse the RSS file (and archive the URLs it finds).
However, I think when the user has specified a parser (parser != "auto"
), we can use that as a signal that we really want to run every URL in args
through the parser. So when --parser=whatever
is specified, we don't archive each URL, we download it and run it through the parser.7
Rough code outline without error checking or lots of other option handling:
urls_to_archive = []
if not len(args):
urls_to_archive += parse_file(stdin, parser)
for arg in args:
if parser == "auto" and re.findall(URL_REGEX, arg):
urls_to_archive += [ arg ]
else:
urls_to_archive += parse_url_or_filename(arg, parser)
for url in urls_to_archive:
queue_for_archive(url)
Note that --depth
isn't part of this yet. To add that, thinking about it recursively makes more sense to me. Let's wrap all of the above in a function called add(stdin: Optional[IO] = None, args: List[str], depth: int=0, parser: str="auto")
.
Now we can turn our last loop into:
for url in urls_to_archive:
if already_queued(url):
continue # skip URLs we've already queued
queue_for_archive(url)
if depth > 0:
add(args= parse_url_or_filename(url), depth= depth - 1, parser="auto")
We can also use depth
as an additional signal of intent from the user on what to do with arguments they originally gave on the command line, so we can expand that to re.findall(URL_REGEX, arg) and (parser != "auto" or depth > 0)
.
This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do archivebox --depth=1 https://example.org/test.rss
we don't want to archive the RSS page, but when we do archivebox --depth=1 https://example.org/page1.html
we probably do, or pushing things out, when we do archivebox add --depth=2 https://example.org/subscriptions.opml
(where that's an OPML file of RSS subscriptions), we probably are okay with just archiving the depth = 0
links, but if we were to do archivebox add --depth=2 https://example.org/
we probably want archive every page, so maybe we need an option something like --archive-depth=N
.
I haven't really captured the problematic interplay of --depth
and --parser
here, either and how it works with stdin. Because under this sketch, if you run echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=json"
if will fall over because stdin is not in JSON format. I think that parser
wants to be a stack. With that, I could do something like archivebox add --depth=2 --parser=opml,rss https://example.com/feeds.opml
to archive every link in each RSS file listed in the OPML file. Or echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=url_list,json"
.
Footnotes
-
The
save_text_as_source
function is part of the culprit here. As it works now, it gets stdin as a string, splits it into lines, reads the contents of any line that has a path that exists, and then writes back out a file that is stdin with every file it read appended. This explains that hack in the JSON parser. But this is just madness and means multiple files on the command line or read via stdin just can't work, or a mix of files or URLs, or whatever. I haven't yet added test cases that show these failures. ↩ -
Very much a work in progress. In particular, the tests really should be making better use of fixtures to simplify the individual test cases and dump more useful information when they fail. ↩
-
And one (
test_add_file_json
) only succeeds because of the combined hacks ofsave_text_as_source
and the JSON parser handling a first line of junk. If that test case listed two JSON files on the command line, it would fail. ↩ -
In my thinking about this, and in tests, I'm using
archivebox add --index-only
because I think ofarchivebox add
as two steps: collect the URLs to add to the archive, and then add them to a queue of URLs that need the actual snapshot to be taken or updated. That means I don't have to think about--overwrite
,--update
,--update-all
, and--extract
because those all impact the second step, not the first. ↩ -
If we want to also handle filenames via stdin we could add a parser that is only used when parsing stdin that is like
url_list
for files. (This is kind of what happens now insidesave_text_as_source
but I would say it only really works accidentally in narrow cases.) Call itfile_list
orurl_or_file_list
if you want to handle both URLs and filenames on stdin at the same time. But the more I think about it, the more I think handling filenames on stdin is not worth the headaches.6 ↩ -
Filenames can have spaces. And imagine
cat compromised-file | archivebox add
where someone has figured out how to inject/etc/passwd
or some other should-be-secure-path into some automated pipeline. It seems like gratuitously dangerous behavior to leave as a default. ↩ -
We could distinguish between the unspecified parser and
--parser=auto
so thatarchivebox add --parser=auto http://example.com/resource
added the list of URLs parsed from thehttp://example.com/resource
but not that URL.8 ↩ -
This gets back to my idea that this really should be two commands.
archivebox archive [url]
which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," andarchivebox import [url_or_filename_or_stdin]
which is like what I'm suggestingarchivebox add --parser=auto
could do.9 ↩ -
Another idea to reduce ambiguity - don't implicitly read stdin when we have no arguments, require
-
to be supplied as a filename. ↩
from archivebox.
Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases.
(I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!)
from archivebox.
Another behavior I question that I ran across in writing tests. --parser=auto
tries all of the parsers and picks the one that returns the most URLs, but because the generic_txt
parser just hoovers up anything that looks like a URL, it gets used even if a more specific parser is successful. This contradicts the following comment in the code:
ArchiveBox/archivebox/parsers/__init__.py
Lines 138 to 141 in f02b279
This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive.
Looks like this behavior was introduced with the commit for the new generic_html parser.
I would have thought it would be problematic there, too, since generic_txt
would pick up anything URL-like in the HTML file which will be more than generic_html
will find in just the <a href="">
tags.
from archivebox.
Here's how it should work:
--depth=0
+ --parser=None
+ passed blob containing URL(s) via stdin/args
# treats provided args/stdin as a single any-format blob containing URLs
archivebox add --depth=0 'https://website.one' 'https://website.two'
echo -e 'https://website.one\nhttps://website.two' | archivebox add --depth=0
echo -e '<rss>...some urls</rss>' | archivebox add --depth=0
>>> archivebox.main.add('https://website.one\nhttps://website.two', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... parses sources/import_stdin.txt using autodetected best parser
... then archive each url found as a new Snapshot
--depth=0
+ --parser=None
+ passed local file path(s) via stdin/args
# parse provided args/stdin as file paths pointing to any-format blobs containing URLs
archivebox add --depth=0 ~/Downloads/some/rss.xml
echo '~/Downloads/some/rss.xml' | archivebox add --depth=0
>>> archivebox.main.add('~/Downloads/some/rss.xml', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... copies each local file into its own sources/some_file.txt
... for each new sources/ file, parse using autodetected best parser
... then archive each url found as a new Snapshot
--depth=0
+ --parser=rss
+ passed URL(s) or local file path(s) via stdin/args
# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)
archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=0 --parser=rss '~/Downloads/some/rss/feed.xml'
echo 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
... saves raw args/stdin string into a local `sources/import__stdin.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g.
`sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
--depth=0
+ --parser=rss
+ passed a single raw RSS blob directly via stdin
# parse provided args/stdin as RSS directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=0 --parser=rss
# not allowed: archivebox add --depth=0 --parser=rss '<rss>...</rss>'
>>> archivebox.main.add('<rss>...</rss>', depth=0, parser=rss)
... saves stdin into sources/import_stdin.txt file
... parses sources/import_stdin.txt as rss to get list of URLs
... creates a Snapshot for each found URL
--depth=1
+ --parser=rss
+ passed reference(s) to RSS file(s) via stdin/args
# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)
archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=1 --parser=rss ~/Downloads/some/rss/feed.xml
echo 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=1 --parser=rss
>>> archivebox.main.add('https://example.com/some/rss/feed.xml', depth=1, parser=rss)
... saves raw args/stdin string into a local `sources/import__urls.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g.
`sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs
--depth=1
+ --parser=rss
+ passed a single raw RSS blob directly via stdin
# parse provided stdin as RSS blob directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=1 --parser=rss
# not allowed: archivebox add --depth=1 --parser=rss '<rss>...</rss>' (this should fail)
>>> archivebox.main.add('<rss xmlns:content...>...</rss>', depth=1, parser='rss')
... saves stdin into a local file first sources/stdin_urls.rss
... parse local file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs
This code is brittle and has had many fixes over the years. There are many different edge cases around weird ways users can pass things (e.g. passing local XML file paths inside a single TXT file who's path is passed via STDIN to archivebox).
I should really document all the behaviors and update the test cases at some point... I wrote some tests for it but at this point some are outdated / missing coverage of some edge cases.
I intend to clean it up someday when I add --depth=2
and --filter=<filter pattern> e.g. 'url.netloc == 'example.com'
support.
from archivebox.
Ah yeah, I know about this shortcoming of generic_txt
superseding everything else, but at the time we were getting more issues about URLs failing to parse than too many URLs being parsed, so this was the result of a hasty fuckit, parse everything at all costs
bad decision.
Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here.
I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach.
Assumptions to verify:
- a parser returning 0 URLs should count as a failure, even if it didn't throw an Exception
- there are no cases where a parser might only return a subset of the URLs without erroring, while a later parser might find more (excluding garbage atom/schema URLs, etc. found in XML headers)
- parser list is sorted from most-specific to least-specific
- manually specifying a parser should try only that one, with no fallbacks
from archivebox.
After thinking about this a bunch I agree with all of it, especially these:
We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.8 ↩
This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.9 ↩
Firstly I think I'm going to banish all filesystem paths being used as input unless they start with file://data/sources/
or file://data/archive/
. This allows us to keep piping filesystem paths as an internal mechanism without the security risk of importing from anywhere in the filesystem. If people want to import other paths they can always pipe them in manually using <
stdin redirection.
I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic.
I've written a new extractor called out_links
that parses the hrefs/urls in a bytes blob to out_links.json during archiving, allowing a snapshot of a URL to be the starting point to get its contents instead of using a helper func to download remote files. We can move all the parsers to be extractors based on this base extractor (e.g. rss_outlinks, md_outlinks, jsonl_outlinks).
I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes:
-
archivebox snapshot [url1] [url2] [url3]
echo -e 'https://example.com/a\nhttps://example.com/b\n' | archivebox snapshot
Takes only url_list as args or JSONL†/url_list via stdin, creates Snapshots (and Tags if needed as side-effect) for each, and echos out one created snapshot id(s) per line. -
archivebox extract --extractors=[all]|screenshot|pdf|...|rss_outlinks|html_outlinks|... --csv=url,timestamp,title,tags [snapshot_id2] [snapshot_id2] [snapshot_id3]
Takes a single snapshot_id or list of snapshot ids as args or stdin, runs the specified extractors on them, and saves the output files into the snapshot directory for each (creates an ArchiveResult for each output file). Echos out each snapshot id after running extractors on it (so it can be easily chained with other commands). -
archivebox crawl [--parser=auto|html|rss|jsonl|...] [url1|snapshot_id1] [url2|snapshot_id2]
Takes a single starting point url (or snapshot_id) as an arg, returns outlink URLs. Runs:
# if url is passed instead of snapshot_id, create a snapshot for it:
hop_0_snapshot_id=$(archivebox snapshot [url] --csv=id)
# if parsed outlinks are not present in snapshot dir, extract them using chosen parser:
archivebox extract --extract=auto_outlinks $hop_0_snapshot_id
# return parsed outlinks as jsonl or url_list
cat ./archive/$hop_0_snapshot_id/auto_outlinks.jsonl
# does not handle recursing, only echos out URLs one hop away from starting point
This lets us keep archivebox add
but make it just a shortcut for running combos of the above commands.
For example:
archivebox add 'http://example.com'
can become:
archivebox snapshot 'https://example.com' > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_id.txt
# aka: archivebox snapshot 'https://example.com' | archivebox extract
archivebox add --depth=1 'http://example.com'
can become:
archivebox crawl --parser=auto 'https://example.com' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
# aka: archivebox crawl 'https://example.com' | archivebox snapshot | archivebox extract
archivebox add --parser=rss 'https://example.com/some/feed.xml'
can become:
archivebox crawl --parser=rss 'https://example.com/some/feed.xml' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
# aka: archivebox crawl --parser=rss 'https://example.com/some/feed.xml' | archivebox snapshot | archivebox extract
archivebox add --parser=rss --depth=1 < some_feed.xml
can become:
# save stdin to file://data/sources/add_stdin_rss.txt
archivebox crawl --parser=rss file://data/sources/add_stdin_rss.txt > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
for snapshot_id in $(cat hop_0_snapshot_ids.txt);
archivebox crawl file://data/archive/$snapshot_id/outlinks.json > hop_1_outlinks.txt
archivebox snapshot < hop_1_outlinks.txt > hop_1_snapshot_ids.txt
archivebox extract --extracotrs=all < hop_1_snapshot_ids.txt
done
# aka:
archivebox crawl --parser=rss 'file://data/sources/add_stdin_rss.txt' # outputs urls found in file
| archivebox snapshot # creates a snapshot for each first hop url
| archivebox extract # saves all the content for each first hop, outputs snapshot ids
| archivebox crawl # gets outlinks urls from first hop snapshots
| archivebox snapshot # creates snapshots for each outlink
| archivebox extract # saves the rest of the content for each outlink
Does that sound reasonable?
†: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line:
{"url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc"}
{"type": "Snapshot", "url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc", "id": "124235353425"}
{"type": "Tag", "name": "sometagabc"}
A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with jq
.
from archivebox.
Related Issues (20)
- Support: singlefile & readability fail to work HOT 3
- Bug: Enter a valid URL. HOT 2
- Bug: AttributeError: 'PosixPath' object has no attribute 'split' / ImportError: attempted relative import beyond top-level package HOT 7
- New Feature: Provide deeper `mitmproxy` integration out-of-the-box in Docker HOT 1
- Bug: upgrading Docker image from 0.7.2 to 0.7.4 - The 0.7.4 version doesn't work HOT 3
- a bug of urllib.parse.urljoin HOT 2
- Feature Request: Create an ArchiveBox ingestion Slack bot
- Fix Docker image builds CI messing up `:latest`, `:stable`, and `:dev` tags HOT 14
- Feature Request: OIDC oauth2 sign in / registration HOT 3
- Question: How to recursively archive directories containing many PDF download links HOT 3
- Make `Could not find profile "Default" in CHROME_USER_DATA_DIR` a warning instead of an error, and move to new PERSONAS_DIR system HOT 8
- Bug: Docker build failing HOT 4
- Feature Request: Set default extraction methods
- Support: How to restore accidentally deleted `docker-compose.yml` file HOT 7
- Bug: container-image on github is wrongly named archivebox/archivebox/archivebox HOT 5
- Question: How can I archive a page with expandable comments? HOT 1
- How to use CLI to set config values with JSON / nested quotes `SINGLEFILE_ARGS=["--example"]` HOT 1
- Chrome Browser Profile / Cookies not applying to SingleFile in v0.7.2? HOT 2
- Bug: can't set `CSRF_TRUSTED_ORIGINS`, preventing login when behind a load balancer HOT 3
- Bug: cannot generate API key HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archivebox.