Giter Club home page Giter Club logo

warctools's People

Contributors

donrichards avatar lljrsr avatar nlevitt avatar pmyteh avatar stevejones avatar tef avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warctools's Issues

Python 3.10 / Ubuntu 22.04

Hello,

Apologies if I just missed this answered elsewhere or if there's an easy solution I'm just not skilled enough to figure out, but it seems that since Ubuntu whatever-the-hell versions and/or python 3.something made such drastic changes to the site/dist packages, these tools have a cascading series of modules that aren't found. I tried just updating those that had the same functionality, but changed names, but there are some that are changed completely, and that I lack the skills to fix. I ended up just copying the entirety of the old library over to my new system, but it was not graceful, so if this project is still live, it would be great to get a version that natively works well with the latest.

Thanks!

Use tox instead of nose in setup.py

$ python3 setup.py test
running test
WARNING: Testing via this command is deprecated and will be removed in a future version. 
Users looking for a generic test entry point independent of test runner are encouraged to use tox.
Searching for nose
Best match: nose 1.3.7

G-Zip Content-Length

Warctools uses the Content-Length field to determine the length of the body for validating and reading WARC files. Since the g-zipped bodies are no longer g-zipped in common-crawl WARC files, not the whole of g-zipped messages is being parsed.
#14 fixes this and allows proper parsing common-crawl WARC files.

Extract entire WARC file?

I have several multi-gigabyte WARC files that I need to completely unpack

Is this possible or not?

Running warcextract on the file just gives me some basic information about the WARC

Apparently it wants an offset but I want to extract the entire thing

record dumper assumes content type and content length

As stated in [http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#content-type]

All records with a non-empty block (non-zero Content-Length), except ‘continuation’ records, should have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader may attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the reader should treat it as type “application/octet-stream”.

This is a should not a must. The record dumper should not assume that a record has a content type or content length. It currently crashes on such records but should be able to handle such cases.

TypeError (str/bytes) in warc.py error path

In production at IA, probably caused by petabox downtime or network error, I got a the following exception and stack trace:

TypeError: sequence item 0: expected str instance, bytes found
  File "extraction_ungrobided.py", line 272, in <module>
    MRExtractUnGrobided.run()
  File "mrjob/job.py", line 424, in run
    mr_job.execute()
  File "mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "extraction_ungrobided.py", line 228, in mapper
    info, status = self.extract(info)
  File "extraction_ungrobided.py", line 143, in extract
    info['file:cdx']['c_size'])
  File "extraction_ungrobided.py", line 126, in fetch_warc_content
    gwb_record = rstore.load_resource(warc_uri, offset, c_size)
  File "wayback/resourcestore.py", line 65, in load_resource
    return create_resource(loader.load_block(bstart, blen))
  File "wayback/resource.py", line 583, in create_resource
    record, errors, offset = parser.parse(rs, 0, line)
  File "hanzo/warctools/warc.py", line 223, in parse
    % (",".join(self.KNOWN_VERSIONS)),

self.KNOWN_VERSIONS is defined as bytes at https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/warc.py#L177, but is being joined with a string.

One fix, though i'm not sure it would work in Python 2.7, would be:

(",".join([s.decode('utf-8') for s in self.KNOWN_VERSIONS])

There's probably a more idiomatic way, but I can submit a patch for that.

While we're at it, might want to make it a join on ", ", not ","?

ArcParser raises exception instead of returning error info as WarcParser does

ArcParser raises exception when it encounters malformatted record:

Traceback (most recent call last):
  File "../../fixarc.py", line 30, in <module>
    offset, record, errors = a._read_record(True)
  File "/home/kenji/projects/wbm/hanzo/warctools/stream.py", line 125, in _read_record
    self.record_parser.parse(self.gz, offset=None)
  File "/home/kenji/projects/wbm/hanzo/warctools/arc.py", line 140, in parse
    headers = self.parse_header_list(line)
  File "/home/kenji/projects/wbm/hanzo/warctools/arc.py", line 185, in parse_header_list
    raise StandardError('missing headers %s %s'%(",".join(values), ",".join(self.headers)))
StandardError: missing headers http://rservicespb.ru/robots.txt,91.218.228.14,20130908181048,08Sep201318:10-1:47GMT URL,IP-address,Archive-date,Content-type,Archive-length

on the other hand, WarcParser returns tuple with error info in second element and record offset in 3rd, which is useful for locating the troubling record (and possibly automating repair process).

It'd be useful if ArcParser behaves the same way as WarcParser on error.

`pip install` is broken on Linux and OSX

On Ubuntu:

$ uname -a
Linux vm-home0.archive.org 3.5.0-32-generic #53~precise1-Ubuntu SMP Wed May 29 20:33:37 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ pip install warctools
Downloading/unpacking warctools
  Could not find a version that satisfies the requirement warctools (from versions: 4.7.macosx-10.8-intel)

On OS X:

$ uname -a
Darwin rajs-MacBook-Air-2.local 12.5.0 Darwin Kernel Version 12.5.0: Mon Jul 29 16:33:49 PDT 2013; root:xnu-2050.48.11~1/RELEASE_X86_64 x86_64


$ pip install warctools
Downloading/unpacking warctools
  Downloading warctools-4.7.macosx-10.8-intel.tar.gz (54kB): 54kB downloaded
  Running setup.py egg_info for package warctools
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
    IOError: [Errno 2] No such file or directory: '/private/tmp/testenv/build/warctools/setup.py'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

IOError: [Errno 2] No such file or directory: '/private/tmp/testenv/build/warctools/setup.py'

Cannot use warc library to open WAT files

Hello,
I am trying to use the warctools library to open a WAT file, which is a special type of WARC file that contains metadata. When I try to read in a file I get this:

('Reading file: ', 'sampleWAT.wat')
Traceback (most recent call last):
File "createReports.py", line 104, in
main(sys.argv[1])
File "createReports.py", line 101, in main
readWARC(argv)
File "createReports.py", line 18, in readWARC
warcFile = warc.open(fileName)
File "/usr/local/lib/python2.7/dist-packages/warc/init.py", line 38, in open
raise IOError("Don't know how to open '%s' files"%format)
IOError: Don't know how to open 'unknown' files

Is there a way to be able to open files with *.wat extension without it throwing this error or having to change the filename extension to *.wat? Perhaps a flag which turns off file validation? Would appreciate this.

Error when running warcfilter

Any insight into this error? Data related or code/version?

/opt/warctools/bin/warcfilter -H text /home/test/test.warc
Traceback (most recent call last):
File "/opt/warctools/bin/warcfilter", line 8, in
sys.exit(run())
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 121, in run
sys.exit(main(sys.argv))
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 71, in main
filter_archive(fh, options, pattern,out)
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 95, in filter_archive
code, content_type, message = parse_http_response(record)
File "/opt/warctools/lib/python2.7/site-packages/hanzo/warcfilter.py", line 34, in parse_http_response
logging.warning('trailing data in http response for %s'% record.url)
NameError: global name 'logging' is not defined

/opt/warctools/bin/warcfilter --help
Usage: warcfilter [options] pattern warc warc warc

Options:
-h, --help show this help message and exit
-l LIMIT, --limit=LIMIT
limit (ignored)
-I INPUT_FORMAT, --input=INPUT_FORMAT
input format (ignored)
-i, --invert invert match
-U, --url match on url
-T, --type match on (warc) record type
-C, --content-type match on (warc) record content type
-H, --http-content-type
match on http payload content type
-D, --warc-date match on WARC-Date header
-L LOG_LEVEL, --log-level=LOG_LEVEL
log level(ignored)

Any help would be appreciated!

Streaming interface to warc files.

  • Avoid parsing entirety of warc file
  • Don't parse http records inside

Any improvements we can make to mean that large and gargantuan warc files can be read and processed speedily

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.