Giter Club home page Giter Club logo

Comments (7)

ikreymer avatar ikreymer commented on May 27, 2024

Hi, was the WARC created in uncompressed form and then gzip-ed afterwards?
I believe I've seen this error when using a plain .warc file that was later gzipped.

The .warc.gz files created by archiving tools (such as wget, warcprox, Heritrix, etc..) contain individually gzipped record chunks, which are then concatenated together. This allows pywb to seek to each record without unzipping the whole file.

If the whole warc is gzipped, there is unfortunately no way to seek to the middle of a record, so indexing it in this way is not useful and is not supported.

But, it should definitely have a better error message and detect such as WARC.

from pywb.

vschiavoni avatar vschiavoni commented on May 27, 2024

The input warc.gz can be found here:

http://lemurproject.org/clueweb12/specs.php

(see the link to this file http://lemurproject.org/clueweb12/0013wb-88.warc.gz )

from pywb.

ikreymer avatar ikreymer commented on May 27, 2024

Taking a look at the warc, it does indeed appear that it is not gzip chunk compressed, but a single gzip file. For now, I'd say you should use that uncompressed version. Perhaps the warc was uncompressed and then recompressed again at some point? That would cause this issue. I've also checked this WARC against other tools and they also return errors.

I'll leave this open to improve the error messaging when dealing with such a WARC in pywb.

Also, there should be probably be a separate utility to help you compress the WARC properly, as running 'gzip' on it will not work, maybe gzip-warc or something like that which will compress each record individually.

from pywb.

ikreymer avatar ikreymer commented on May 27, 2024

Briefly taking a look at CreateClueWeb12B13Dataset.java from http://lemurproject.org/clueweb12/ClueWeb12-CreateB13.php
If this was the code used to create these warcs, I believe that this is causing the issue.
It looks like its building a buffer of records and putting them into a single gzip file, which causes the problem. Instead, it needs to create a GzipOutputStream per record and then write it to a file.

from pywb.

ikreymer avatar ikreymer commented on May 27, 2024

Oh, the warctools package (https://github.com/internetarchive/warctools) has such a tool which will properly gzip the warc called warc2warc

gunzip 0013wb-88.warc.gz
warc2warc /tmp/0013wb-88.warc > /tmp/0013wb-88.warc.gz
cdx-indexer -s /tmp/0013wb-88.warc.gz

You can use this to fix the warcs and have a compressed version.

from pywb.

anarcat avatar anarcat commented on May 27, 2024

i'm having the same issue, but when indexing a bunch of files added at once - and i don't know which one is triggering the fault. is there a way to find out?

from pywb.

anarcat avatar anarcat commented on May 27, 2024

i've moved that conversation to #411, sorry for the noise.

from pywb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.