When you run: archivebox add --parser=rss

Another behavior I question that I ran across in writing tests. <code class="notransla

Here's how it should work: --depth=0</cod

parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win,about archivebox/archivebox

Comments (7)

jimwins commented on June 13, 2024 2

That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how --parser=auto should work (and make it work that way).

Still thinking about the complicated interactions of --parser, --depth, input from stdin, and URLs vs. files on the command line. Right now I'm thinking that it may make sense to add some more explicit commands that don't try to be so clever and then add can be as messy as it wants without cluttering up the underlying implementation as much.

from archivebox.

jimwins commented on June 13, 2024 2

(Thanks for the sponsorship, it was very unexpected!)

This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this:

$ echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
$ cat '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss

Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.¹

I pushed test cases² that implement all of the scenarios you outlined in this comment, and with the current dev branch, only 4 of the 19 pass. After fixing parser == 'auto' to favor the first successful parser, that only improves to 9 of 19. Making the change to interpret --parser as I outline below gets us to 15 of 19. The four that still fail are due to the ambiguous behavior above.³ (I haven't committed that change because I'm not happy with it yet.)

Here's how I am thinking about it:

We start with an empty list of URLs to archive.⁴

We get a list of zero or more filenames or URLs on the command line. That's args = Optional[list[str]].

If not args, we read stdin and run it through the parser. It could be RSS, JSON, or just a blob of text with URLs that the generic_txt parser will figure out when parser = "auto", and so we add those to our list of URLs to archive.⁵

Now for arg in args, we have to figure out what each argument is.

If arg is a file, we don't have a URL so it can't be added to our list of URLs, so we will parse the contents and add the URLs contained within to our list of URLs to archive.

If arg looks like a URL, now we have to figure out the user's intent. It could either be a resource that the user wanted to archive, or something like an RSS file where they want us to parse it and archive all the URLs mentioned.

$ archivebox add https://example.com/feed.rss https://example.com/content.html

The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks.

I don't think we want to always download and parse the resource, because with --parser=auto we'll probably always find something in any URL we're given, but for most URLs the user probably really intends to archive that specific resource. (This also would conflict with --index-only which I think shows an intent to not download URLs absent other signals.)

That means that archivebox add https://example.com/index.rss will archive https://example.com/index.rss and not parse the RSS file (and archive the URLs it finds).

However, I think when the user has specified a parser (parser != "auto"), we can use that as a signal that we really want to run every URL in args through the parser. So when --parser=whatever is specified, we don't archive each URL, we download it and run it through the parser.⁷

Rough code outline without error checking or lots of other option handling:

urls_to_archive = []

if not len(args):
    urls_to_archive += parse_file(stdin, parser)

for arg in args:
    if parser == "auto" and re.findall(URL_REGEX, arg):
        urls_to_archive += [ arg ]
    else:
        urls_to_archive += parse_url_or_filename(arg, parser)

for url in urls_to_archive:
    queue_for_archive(url)

Note that --depth isn't part of this yet. To add that, thinking about it recursively makes more sense to me. Let's wrap all of the above in a function called add(stdin: Optional[IO] = None, args: List[str], depth: int=0, parser: str="auto").

Now we can turn our last loop into:

for url in urls_to_archive:
    if already_queued(url):
        continue # skip URLs we've already queued
    queue_for_archive(url)
    if depth > 0:
        add(args= parse_url_or_filename(url), depth= depth - 1, parser="auto")

We can also use depth as an additional signal of intent from the user on what to do with arguments they originally gave on the command line, so we can expand that to re.findall(URL_REGEX, arg) and (parser != "auto" or depth > 0).

This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do archivebox --depth=1 https://example.org/test.rss we don't want to archive the RSS page, but when we do archivebox --depth=1 https://example.org/page1.html we probably do, or pushing things out, when we do archivebox add --depth=2 https://example.org/subscriptions.opml (where that's an OPML file of RSS subscriptions), we probably are okay with just archiving the depth = 0 links, but if we were to do archivebox add --depth=2 https://example.org/ we probably want archive every page, so maybe we need an option something like --archive-depth=N.

I haven't really captured the problematic interplay of --depth and --parser here, either and how it works with stdin. Because under this sketch, if you run echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=json" if will fall over because stdin is not in JSON format. I think that parser wants to be a stack. With that, I could do something like archivebox add --depth=2 --parser=opml,rss https://example.com/feeds.opml to archive every link in each RSS file listed in the OPML file. Or echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=url_list,json".

The save_text_as_source function is part of the culprit here. As it works now, it gets stdin as a string, splits it into lines, reads the contents of any line that has a path that exists, and then writes back out a file that is stdin with every file it read appended. This explains that hack in the JSON parser. But this is just madness and means multiple files on the command line or read via stdin just can't work, or a mix of files or URLs, or whatever. I haven't yet added test cases that show these failures. ↩
Very much a work in progress. In particular, the tests really should be making better use of fixtures to simplify the individual test cases and dump more useful information when they fail. ↩
And one (test_add_file_json) only succeeds because of the combined hacks of save_text_as_source and the JSON parser handling a first line of junk. If that test case listed two JSON files on the command line, it would fail. ↩
In my thinking about this, and in tests, I'm using archivebox add --index-only because I think of archivebox add as two steps: collect the URLs to add to the archive, and then add them to a queue of URLs that need the actual snapshot to be taken or updated. That means I don't have to think about --overwrite, --update, --update-all, and --extract because those all impact the second step, not the first. ↩
If we want to also handle filenames via stdin we could add a parser that is only used when parsing stdin that is like url_list for files. (This is kind of what happens now inside save_text_as_source but I would say it only really works accidentally in narrow cases.) Call it file_list or url_or_file_list if you want to handle both URLs and filenames on stdin at the same time. But the more I think about it, the more I think handling filenames on stdin is not worth the headaches.⁶ ↩
Filenames can have spaces. And imagine cat compromised-file | archivebox add where someone has figured out how to inject /etc/passwd or some other should-be-secure-path into some automated pipeline. It seems like gratuitously dangerous behavior to leave as a default. ↩
We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.⁸ ↩
This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.⁹ ↩
Another idea to reduce ambiguity - don't implicitly read stdin when we have no arguments, require - to be supplied as a filename. ↩

from archivebox.

jimwins commented on June 13, 2024 1

Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases.

(I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!)

from archivebox.

jimwins commented on June 13, 2024 1

Another behavior I question that I ran across in writing tests. --parser=auto tries all of the parsers and picks the one that returns the most URLs, but because the generic_txt parser just hoovers up anything that looks like a URL, it gets used even if a more specific parser is successful. This contradicts the following comment in the code:

ArchiveBox/archivebox/parsers/__init__.py

Lines 138 to 141 in f02b279

 except Exception as err: # noqa 

 # Parsers are tried one by one down the list, and the first one 

 # that succeeds is used. To debug why a certain parser was not used 

 # due to python error or format incompatibility, uncomment this line:

This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive.

Looks like this behavior was introduced with the commit for the new generic_html parser.

I would have thought it would be problematic there, too, since generic_txt would pick up anything URL-like in the HTML file which will be more than generic_html will find in just the <a href=""> tags.

from archivebox.

pirate commented on June 13, 2024

Here's how it should work:

`--depth=0` + `--parser=None` + passed blob containing URL(s) via stdin/args

# treats provided args/stdin as a single any-format blob containing URLs
archivebox add --depth=0 'https://website.one' 'https://website.two'
echo -e 'https://website.one\nhttps://website.two' | archivebox add --depth=0
echo -e '<rss>...some urls</rss>' | archivebox add --depth=0
>>> archivebox.main.add('https://website.one\nhttps://website.two', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... parses sources/import_stdin.txt using autodetected best parser
... then archive each url found as a new Snapshot

`--depth=0` + `--parser=None` + passed local file path(s) via stdin/args

# parse provided args/stdin as file paths pointing to any-format blobs containing URLs
archivebox add --depth=0 ~/Downloads/some/rss.xml
echo '~/Downloads/some/rss.xml' | archivebox add --depth=0
>>> archivebox.main.add('~/Downloads/some/rss.xml', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... copies each local file into its own sources/some_file.txt
... for each new sources/ file, parse using autodetected best parser
... then archive each url found as a new Snapshot

`--depth=0` + `--parser=rss` + passed URL(s) or local file path(s) via stdin/args

# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)

archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=0 --parser=rss '~/Downloads/some/rss/feed.xml'
echo 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
... saves raw args/stdin string into a local `sources/import__stdin.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. 
    `sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot

`--depth=0` + `--parser=rss` + passed a single raw RSS blob directly via stdin

# parse provided args/stdin as RSS directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=0 --parser=rss
# not allowed: archivebox add --depth=0 --parser=rss '<rss>...</rss>'
>>> archivebox.main.add('<rss>...</rss>', depth=0, parser=rss)
... saves stdin into sources/import_stdin.txt file
... parses sources/import_stdin.txt as rss to get list of URLs
... creates a Snapshot for each found URL

`--depth=1` + `--parser=rss` + passed reference(s) to RSS file(s) via stdin/args

# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)
archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=1 --parser=rss ~/Downloads/some/rss/feed.xml
echo 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=1 --parser=rss
>>> archivebox.main.add('https://example.com/some/rss/feed.xml', depth=1, parser=rss)
... saves raw args/stdin string into a local `sources/import__urls.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. 
    `sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs

`--depth=1` + `--parser=rss` + passed a single raw RSS blob directly via stdin

# parse provided stdin as RSS blob directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=1 --parser=rss
# not allowed: archivebox add --depth=1 --parser=rss '<rss>...</rss>'  (this should fail)
>>> archivebox.main.add('<rss xmlns:content...>...</rss>', depth=1, parser='rss')
... saves stdin into a local file first sources/stdin_urls.rss
... parse local file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs

This code is brittle and has had many fixes over the years. There are many different edge cases around weird ways users can pass things (e.g. passing local XML file paths inside a single TXT file who's path is passed via STDIN to archivebox).

I should really document all the behaviors and update the test cases at some point... I wrote some tests for it but at this point some are outdated / missing coverage of some edge cases.

I intend to clean it up someday when I add --depth=2 and --filter=<filter pattern> e.g. 'url.netloc == 'example.com' support.

from archivebox.

pirate commented on June 13, 2024

Ah yeah, I know about this shortcoming of generic_txt superseding everything else, but at the time we were getting more issues about URLs failing to parse than too many URLs being parsed, so this was the result of a hasty fuckit, parse everything at all costs bad decision.

Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here.

I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach.

Assumptions to verify:

a parser returning 0 URLs should count as a failure, even if it didn't throw an Exception
there are no cases where a parser might only return a subset of the URLs without erroring, while a later parser might find more (excluding garbage atom/schema URLs, etc. found in XML headers)
parser list is sorted from most-specific to least-specific
manually specifying a parser should try only that one, with no fallbacks

from archivebox.

pirate commented on June 13, 2024

After thinking about this a bunch I agree with all of it, especially these:

We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.8 ↩

This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.9 ↩

Firstly I think I'm going to banish all filesystem paths being used as input unless they start with file://data/sources/ or file://data/archive/. This allows us to keep piping filesystem paths as an internal mechanism without the security risk of importing from anywhere in the filesystem. If people want to import other paths they can always pipe them in manually using < stdin redirection.

I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic.

I've written a new extractor called out_links that parses the hrefs/urls in a bytes blob to out_links.json during archiving, allowing a snapshot of a URL to be the starting point to get its contents instead of using a helper func to download remote files. We can move all the parsers to be extractors based on this base extractor (e.g. rss_outlinks, md_outlinks, jsonl_outlinks).

I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes:

archivebox snapshot [url1] [url2] [url3]
echo -e 'https://example.com/a\nhttps://example.com/b\n' | archivebox snapshot
Takes only url_list as args or JSONL†/url_list via stdin, creates Snapshots (and Tags if needed as side-effect) for each, and echos out one created snapshot id(s) per line.
archivebox extract --extractors=[all]|screenshot|pdf|...|rss_outlinks|html_outlinks|... --csv=url,timestamp,title,tags [snapshot_id2] [snapshot_id2] [snapshot_id3]
Takes a single snapshot_id or list of snapshot ids as args or stdin, runs the specified extractors on them, and saves the output files into the snapshot directory for each (creates an ArchiveResult for each output file). Echos out each snapshot id after running extractors on it (so it can be easily chained with other commands).
archivebox crawl [--parser=auto|html|rss|jsonl|...] [url1|snapshot_id1] [url2|snapshot_id2]
Takes a single starting point url (or snapshot_id) as an arg, returns outlink URLs. Runs:
# if url is passed instead of snapshot_id, create a snapshot for it:
hop_0_snapshot_id=$(archivebox snapshot [url] --csv=id)
# if parsed outlinks are not present in snapshot dir, extract them using chosen parser:
archivebox extract --extract=auto_outlinks $hop_0_snapshot_id
# return parsed outlinks as jsonl or url_list
cat ./archive/$hop_0_snapshot_id/auto_outlinks.jsonl
# does not handle recursing, only echos out URLs one hop away from starting point

This lets us keep archivebox add but make it just a shortcut for running combos of the above commands.

For example:

archivebox add 'http://example.com' can become:

archivebox snapshot 'https://example.com' > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_id.txt

# aka: archivebox snapshot 'https://example.com' | archivebox extract

archivebox add --depth=1 'http://example.com' can become:

archivebox crawl --parser=auto 'https://example.com' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

# aka: archivebox crawl 'https://example.com' | archivebox snapshot | archivebox extract

archivebox add --parser=rss 'https://example.com/some/feed.xml' can become:

archivebox crawl --parser=rss 'https://example.com/some/feed.xml' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

# aka: archivebox crawl --parser=rss 'https://example.com/some/feed.xml' | archivebox snapshot | archivebox extract

archivebox add --parser=rss --depth=1 < some_feed.xml can become:

# save stdin to file://data/sources/add_stdin_rss.txt
archivebox crawl --parser=rss file://data/sources/add_stdin_rss.txt > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

for snapshot_id in $(cat hop_0_snapshot_ids.txt); 
    archivebox crawl file://data/archive/$snapshot_id/outlinks.json > hop_1_outlinks.txt
    archivebox snapshot < hop_1_outlinks.txt > hop_1_snapshot_ids.txt
    archivebox extract --extracotrs=all < hop_1_snapshot_ids.txt
done

# aka:
archivebox crawl --parser=rss 'file://data/sources/add_stdin_rss.txt'  # outputs urls found in file
  | archivebox snapshot   # creates a snapshot for each first hop url
  | archivebox extract    # saves all the content for each first hop, outputs snapshot ids
  | archivebox crawl      # gets outlinks urls from first hop snapshots
  | archivebox snapshot   # creates snapshots for each outlink
  | archivebox extract    # saves the rest of the content for each outlink

Does that sound reasonable?

†: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line:

{"url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc"}
{"type": "Snapshot", "url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc", "id": "124235353425"}
{"type": "Tag", "name": "sometagabc"}

A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with jq.

from archivebox.

parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win about archivebox HOT 7 OPEN

Comments (7)

`--depth=0` + `--parser=None` + passed blob containing URL(s) via stdin/args

`--depth=0` + `--parser=None` + passed local file path(s) via stdin/args

`--depth=0` + `--parser=rss` + passed URL(s) or local file path(s) via stdin/args

`--depth=0` + `--parser=rss` + passed a single raw RSS blob directly via stdin

`--depth=1` + `--parser=rss` + passed reference(s) to RSS file(s) via stdin/args

`--depth=1` + `--parser=rss` + passed a single raw RSS blob directly via stdin

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	except Exception as err: # noqa
	# Parsers are tried one by one down the list, and the first one
	# that succeeds is used. To debug why a certain parser was not used
	# due to python error or format incompatibility, uncomment this line:

Comments (7)

Footnotes

--depth=0 + --parser=None + passed blob containing URL(s) via stdin/args

--depth=0 + --parser=None + passed local file path(s) via stdin/args

--depth=0 + --parser=rss + passed URL(s) or local file path(s) via stdin/args

--depth=0 + --parser=rss + passed a single raw RSS blob directly via stdin

--depth=1 + --parser=rss + passed reference(s) to RSS file(s) via stdin/args

--depth=1 + --parser=rss + passed a single raw RSS blob directly via stdin

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

`--depth=0` + `--parser=None` + passed blob containing URL(s) via stdin/args

`--depth=0` + `--parser=None` + passed local file path(s) via stdin/args

`--depth=0` + `--parser=rss` + passed URL(s) or local file path(s) via stdin/args

`--depth=0` + `--parser=rss` + passed a single raw RSS blob directly via stdin

`--depth=1` + `--parser=rss` + passed reference(s) to RSS file(s) via stdin/args

`--depth=1` + `--parser=rss` + passed a single raw RSS blob directly via stdin