internetarchive / warctools Goto Github PK

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

License: MIT License

Python 98.36% Shell 1.64%

warctools's Introduction

Warctools

WARC (Web ARChive) file tools for python 2/3 based on the WARC 1.0 spec and compatible with the Internet Archive's ARC File Format originally developed by Hanzo Archives.

Install

pip install warctools

Python Usage

from hanzo import warctools

Python Examples

Write a WARC file:

import os

from hanzo import warctools


def write():
    headers = [
        (b'WARC-Type', b'warcinfo'),
        (b'WARC-Date', b'2019-11-19T23:08:51.182451Z'),
        (b'WARC-Filename', b'CRAWL-20191119230851-00000-hostname.warc.gz'),
        (b'WARC-Record-ID', b'<urn:uuid:8cc5dcae-0b21-11ea-842b-525476278032>')
    ]
    content_type = b'application/warc-fields'
    content = 'This\nis\nonly\na\ntest\n'.encode()
    fname = 'test.warc.gz'

    mode = 'ab'
    if not os.path.exists(fname):
        mode = 'wb'

    with open(fname, mode) as _fh:
        content = (content_type, content)
        record = warctools.WarcRecord(headers=headers, content=content)
        record.write_to(_fh, gzip="record")

Command-line Usage

warcvalid

Returns 0 if the arguments are all valid W/ARC files, non-zero on error.

[warctools] $ warcvalid -h
Usage: warcvalid [options] warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
  -L LOG_LEVEL, --log-level=LOG_LEVEL

warcdump

Writes human readable summary of warcfiles. Autodetects input format when filenames are passed, i.e recordgzip vs plaintext, WARC vs ARC. Assumes uncompressed warc on stdin if no args.

[warctools] $ warcdump -h
Usage: warcdump [options] warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
  -L LOG_LEVEL, --log-level=LOG_LEVEL

warcfilter

Searches all headers for regex pattern. Autodetects and stdin like warcdump. Prints out a WARC format by default. Use -i to invert search. Use -U to constrain to url. Use -T to constrain to record type. Use -C to constrain to content-type.

$ warcfilter -h
Usage: warcfilter [options] pattern warc warc warc

Options:
  -h, --help            show this help message and exit
  -l LIMIT, --limit=LIMIT
                        limit (ignored)
  -I INPUT_FORMAT, --input=INPUT_FORMAT
                        input format (ignored)
  -i, --invert          invert match
  -U, --url             match on url
  -T, --type            match on (warc) record type
  -C, --content-type    match on (warc) record content type
  -H, --http-content-type
                        match on http payload content type
  -D, --warc-date       match on WARC-Date header
  -L LOG_LEVEL, --log-level=LOG_LEVEL
                        log level(ignored)

warc2warc

Autodetects compression on file args. Assumes uncompressed stdin if none. Use -Z to write compressed output, i.e warc2warc -Z input > input.gz. Should ignore buggy records in input.

[warctools] $ warc2warc -h
Usage: warc2warc [options] url (url ...)

Options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output=OUTPUT
                        output warc file
  -l LIMIT, --limit=LIMIT
  -I INPUT_FORMAT, --input=INPUT_FORMAT
                        (ignored)
  -Z, --gzip            compress output, record by record
  -D, --decode_http     decode http messages (strip chunks, gzip)
  -L LOG_LEVEL, --log-level=LOG_LEVEL
  --wget-chunk-fix      skip transfer-encoding headers in http records, when
                        decoding them (-D)

arc2warc

Creates a crappy WARC file from arc files on input. A handful of headers are preserved. Use -Z to write compressed output, i.e arc2warc -Z input.arc > input.warc.gz

[warctools] $ arc2warc -h
Usage: arc2warc [options] arc (arc ...)

Options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output=OUTPUT
                        output warc file
  -l LIMIT, --limit=LIMIT
  -Z, --gzip            compress
  -L LOG_LEVEL, --log-level=LOG_LEVEL
  --description=DESCRIPTION
  --operator=OPERATOR
  --publisher=PUBLISHER
  --audience=AUDIENCE
  --resource=RESOURCE
  --response=RESPONSE

warcindex

DEPRECATED, use CDX-writer branch.

#WARC-filename offset warc-type warc-subject-uri warc-record-id content-type content-length
warccrap/mywarc.warc 1196018 request /images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd1255a8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=request 193
warccrap/mywarc.warc 1196631 response http://www.hanzoarchives.com/images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd2614f8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=response 3279474

Notes

arc2warc uses the conversion rules from the earlier arc2warc.c as a starter for converting the headers
I haven't profiled the code yet (and don't plan to until it falls over)
Warcvalid barely skirts some of the iso standard, missing things:
- strict whitespace
- required headers check
- mime quoted printable header encoding
- treating headers as utf8

ToDo

Lots more testing
Support pre-1.0 WARC files
Add more documentation
Support more commandline options for output and filenames
S3 urls

Credits

Originally developed by "tef" [email protected].

@internetarchive

warctools's People

Contributors

Stargazers

Watchers

warctools's Issues

record dumper assumes content type and content length

As stated in [http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#content-type]

All records with a non-empty block (non-zero Content-Length), except ‘continuation’ records, should have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader may attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the reader should treat it as type “application/octet-stream”.

This is a should not a must. The record dumper should not assume that a record has a content type or content length. It currently crashes on such records but should be able to handle such cases.

warcfilter.py needs to import logging

It's used in parse_http_response. Could you update asap? We are using it but it fails because it has not imported logging.

Drop hanzo namespace as not to collide with original hanzo-warc-tools.

Suggestions for new namespace? I am happy with warctools

TypeError (str/bytes) in warc.py error path

In production at IA, probably caused by petabox downtime or network error, I got a the following exception and stack trace:

TypeError: sequence item 0: expected str instance, bytes found
  File "extraction_ungrobided.py", line 272, in <module>
    MRExtractUnGrobided.run()
  File "mrjob/job.py", line 424, in run
    mr_job.execute()
  File "mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "extraction_ungrobided.py", line 228, in mapper
    info, status = self.extract(info)
  File "extraction_ungrobided.py", line 143, in extract
    info['file:cdx']['c_size'])
  File "extraction_ungrobided.py", line 126, in fetch_warc_content
    gwb_record = rstore.load_resource(warc_uri, offset, c_size)
  File "wayback/resourcestore.py", line 65, in load_resource
    return create_resource(loader.load_block(bstart, blen))
  File "wayback/resource.py", line 583, in create_resource
    record, errors, offset = parser.parse(rs, 0, line)
  File "hanzo/warctools/warc.py", line 223, in parse
    % (",".join(self.KNOWN_VERSIONS)),

self.KNOWN_VERSIONS is defined as bytes at https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/warc.py#L177, but is being joined with a string.

One fix, though i'm not sure it would work in Python 2.7, would be:

(",".join([s.decode('utf-8') for s in self.KNOWN_VERSIONS])

There's probably a more idiomatic way, but I can submit a patch for that.

While we're at it, might want to make it a join on ", ", not ","?

Cannot use warc library to open WAT files

Hello,
I am trying to use the warctools library to open a WAT file, which is a special type of WARC file that contains metadata. When I try to read in a file I get this:

('Reading file: ', 'sampleWAT.wat')
Traceback (most recent call last):
File "createReports.py", line 104, in
main(sys.argv[1])
File "createReports.py", line 101, in main
readWARC(argv)
File "createReports.py", line 18, in readWARC
warcFile = warc.open(fileName)
File "/usr/local/lib/python2.7/dist-packages/warc/init.py", line 38, in open
raise IOError("Don't know how to open '%s' files"%format)
IOError: Don't know how to open 'unknown' files

Is there a way to be able to open files with *.wat extension without it throwing this error or having to change the filename extension to *.wat? Perhaps a flag which turns off file validation? Would appreciate this.

ArcParser raises exception instead of returning error info as WarcParser does

ArcParser raises exception when it encounters malformatted record:

Traceback (most recent call last):
  File "../../fixarc.py", line 30, in <module>
    offset, record, errors = a._read_record(True)
  File "/home/kenji/projects/wbm/hanzo/warctools/stream.py", line 125, in _read_record
    self.record_parser.parse(self.gz, offset=None)
  File "/home/kenji/projects/wbm/hanzo/warctools/arc.py", line 140, in parse
    headers = self.parse_header_list(line)
  File "/home/kenji/projects/wbm/hanzo/warctools/arc.py", line 185, in parse_header_list
    raise StandardError('missing headers %s %s'%(",".join(values), ",".join(self.headers)))
StandardError: missing headers http://rservicespb.ru/robots.txt,91.218.228.14,20130908181048,08Sep201318:10-1:47GMT URL,IP-address,Archive-date,Content-type,Archive-length

on the other hand, WarcParser returns tuple with error info in second element and record offset in 3rd, which is useful for locating the troubling record (and possibly automating repair process).

It'd be useful if ArcParser behaves the same way as WarcParser on error.

Use tox instead of nose in setup.py

$ python3 setup.py test
running test
WARNING: Testing via this command is deprecated and will be removed in a future version. 
Users looking for a generic test entry point independent of test runner are encouraged to use tox.
Searching for nose
Best match: nose 1.3.7

Python 3.10 / Ubuntu 22.04

Hello,

Apologies if I just missed this answered elsewhere or if there's an easy solution I'm just not skilled enough to figure out, but it seems that since Ubuntu whatever-the-hell versions and/or python 3.something made such drastic changes to the site/dist packages, these tools have a cascading series of modules that aren't found. I tried just updating those that had the same functionality, but changed names, but there are some that are changed completely, and that I lack the skills to fix. I ended up just copying the entirety of the old library over to my new system, but it was not graceful, so if this project is still live, it would be great to get a version that natively works well with the latest.

Thanks!

warcpayload.py missing from entry_points

warcpayload.py is missing from entry_points in setup.py

Extract entire WARC file?

I have several multi-gigabyte WARC files that I need to completely unpack

Is this possible or not?

Running warcextract on the file just gives me some basic information about the WARC

Apparently it wants an offset but I want to extract the entire thing

Streaming interface to warc files.

Avoid parsing entirety of warc file
Don't parse http records inside

Any improvements we can make to mean that large and gargantuan warc files can be read and processed speedily

`pip install` is broken on Linux and OSX

On Ubuntu:

$ uname -a
Linux vm-home0.archive.org 3.5.0-32-generic #53~precise1-Ubuntu SMP Wed May 29 20:33:37 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ pip install warctools
Downloading/unpacking warctools
  Could not find a version that satisfies the requirement warctools (from versions: 4.7.macosx-10.8-intel)

On OS X:

$ uname -a
Darwin rajs-MacBook-Air-2.local 12.5.0 Darwin Kernel Version 12.5.0: Mon Jul 29 16:33:49 PDT 2013; root:xnu-2050.48.11~1/RELEASE_X86_64 x86_64


$ pip install warctools
Downloading/unpacking warctools
  Downloading warctools-4.7.macosx-10.8-intel.tar.gz (54kB): 54kB downloaded
  Running setup.py egg_info for package warctools
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
    IOError: [Errno 2] No such file or directory: '/private/tmp/testenv/build/warctools/setup.py'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

IOError: [Errno 2] No such file or directory: '/private/tmp/testenv/build/warctools/setup.py'

G-Zip Content-Length

Warctools uses the Content-Length field to determine the length of the body for validating and reading WARC files. Since the g-zipped bodies are no longer g-zipped in common-crawl WARC files, not the whole of g-zipped messages is being parsed.
#14 fixes this and allows proper parsing common-crawl WARC files.

Error when running warcfilter

Any insight into this error? Data related or code/version?

/opt/warctools/bin/warcfilter -H text /home/test/test.warc
Traceback (most recent call last):
File "/opt/warctools/bin/warcfilter", line 8, in
sys.exit(run())
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 121, in run
sys.exit(main(sys.argv))
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 71, in main
filter_archive(fh, options, pattern,out)
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 95, in filter_archive
code, content_type, message = parse_http_response(record)
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 34, in parse_http_response
logging.warning('trailing data in http response for %s'% record.url)
NameError: global name 'logging' is not defined

/opt/warctools/bin/warcfilter --help
Usage: warcfilter [options] pattern warc warc warc

Options:
-h, --help show this help message and exit
-l LIMIT, --limit=LIMIT
limit (ignored)
-I INPUT_FORMAT, --input=INPUT_FORMAT
input format (ignored)
-i, --invert invert match
-U, --url match on url
-T, --type match on (warc) record type
-C, --content-type match on (warc) record content type
-H, --http-content-type
match on http payload content type
-D, --warc-date match on WARC-Date header
-L LOG_LEVEL, --log-level=LOG_LEVEL
log level(ignored)

Any help would be appreciated!