n0tan3rd / node-warc Goto Github PK

View Code? Open in Web Editor NEW

93.0 9.0 20.0 8.18 MB

Parse And Create Web ARChive (WARC) files with node.js

License: MIT License

JavaScript 100.00%

webarchive webarchiving web-archives warc-files warc web-archiving pupeteer chrome-remote-interface

node-warc's People

Contributors

Stargazers

Watchers

Forkers

geppy orz-forks bubuanabelas happy-ferret cloudtracer hyl adrianloer ikreymer envoked mooniker oceanswave datocrats-org royvb-git megabug linuxperia faceplace jloiola context-labs fushihara

node-warc's Issues

Typescript definitions not working

I'm trying to use your library with Typescript (v 3.6) and it fails to compile my project. I think the problem is that the provided definitions in index.d.ts:

are using default parameters in function definitions
using puppeteer as dev dependency, even though it should be a dependency

At least that's what I've got from the quick scanning of the following error log

node_modules/node-warc/index.d.ts:2:23 - error TS2688: Cannot find type definition file for 'puppeteer'.

2 /// <reference types="puppeteer" />
                        ~~~~~~~~~

node_modules/node-warc/index.d.ts:8:29 - error TS2307: Cannot find module 'puppeteer'.

8 import * as puppeteer  from 'puppeteer'
                              ~~~~~~~~~~~

node_modules/node-warc/index.d.ts:221:41 - error TS1015: Parameter cannot have question mark and initializer.

221     constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
                                            ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:221:41 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

221     constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
                                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:222:35 - error TS1015: Parameter cannot have question mark and initializer.

222     attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:222:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

222     attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:223:35 - error TS1015: Parameter cannot have question mark and initializer.

223     detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:223:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

223     detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:251:35 - error TS1015: Parameter cannot have question mark and initializer.

251     constructor (page?: CRIEPage, requestEvent?: string = 'request');
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:251:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

251     constructor (page?: CRIEPage, requestEvent?: string = 'request');
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:252:29 - error TS1015: Parameter cannot have question mark and initializer.

252     attach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:252:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

252     attach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:253:29 - error TS1015: Parameter cannot have question mark and initializer.

253     detach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:253:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

253     detach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The content in the warcRecord includes the trailing \r\n

The 'content' ArrayBuffer in the record appears to include the trailing \r\n
Tested this with compressed WARCs, may not be the case for uncompressed

Every response has the same Record ID

I was testing the new features of the library, specialy the Puppeteer's Request Capturer and the WARC Generator along with headless-chrome-crawler with the following script:

const HCCrawler = require('headless-chrome-crawler')
const { PuppeteerCapturer, PuppeteerWARCGenerator } = require('node-warc')

const warc = new PuppeteerWARCGenerator()
warc.initWARC('./test.warc', {appending: true})

const run = async () => {
  const crawler = await HCCrawler.launch({
    customCrawl: async (page, crawl) => {
      capture = new PuppeteerCapturer(page)
      await page.setRequestInterception(true)

      page.on('request', request => {
        capture.requestWillBeSent(request)
        request.continue()
      })

      const result = await crawl()

      for (let req of capture.iterateRequests()) {
        await warc.generateWarcEntry(req)
      }

      return result
    },
    maxDepth: 0
  })

  await crawler.queue({url: 'http://books.toscrape.com', skipDuplicates: true})
  await crawler.onIdle()
  warc.end()
  await crawler.close()
}


run()

It creates the WARC file without any errors but when you look into it all the WARC-Record-ID.
Because of this, all the WARC-Concurrent-To fields are the same too.

One way to fix it is to create the generator, init it, write the request and close it for each request like this:

for (let req of capture.iterateRequests()) {
    const warc = new PuppeteerWARCGenerator()
    warc.initWARC('./test5.warc', {appending: true})
    await warc.generateWarcEntry(req)
    warc.end()
}

But that is 100% inefficient.

httpHeaders Set-Cookie is single string

Where there are multiple Set-Cookie headers in a server response from a WARC record the value of httpHeaders.Set-Cookie is always the last one in the list. This should be returned as an array of the Set-Cookie headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.

Example WARC record (minus the content block):

WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4

GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip



WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip

Response from console.log(record.httpHeaders); when used in the record callback:

{ Server: 'nginx',
  Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
  'Content-Type': 'text/html; charset=UTF-8',
  'X-Crawler-Transfer-Encoding': 'chunked',
  Connection: 'keep-alive',
  Vary: 'Accept-Encoding',
  'Set-Cookie':
   'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
  'X-XSS-Protection': '1; mode=block',
  'X-Content-Type-Options': 'nosniff',
  'X-Nginx-Cache-Status': 'BYPASS',
  'X-Server-Powered-By': 'Engintron',
  'X-Crawler-Content-Encoding': 'gzip' }

AutoWARCParser is not a constructor

Trying to use AutoWARCParser but when using the example from the readme.md

const {AutoWARCParser} = require('node-warc');

const parser = new AutoWARCParser('/path/to/my.warc.gz');
parser.on('record', record => { console.log(record); });
parser.on('done', () => { console.log('finished'); });
parser.on('error', error => { console.error(error); });
parser.start();

I get the following error when running the test_warc.js file:

/home/test_warc.js:4
const parser = new AutoWARCParser(
               ^

TypeError: AutoWARCParser is not a constructor
    at Object.<anonymous> (/home/test_warc.js:4:16)
    at Module._compile (internal/modules/cjs/loader.js:776:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:787:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
    at Function.Module._load (internal/modules/cjs/loader.js:585:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:829:12)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)

# node -v
v10.16.0

# npm -v
6.10.1

# lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic

WARC Parsing sometimes results in truncated records.

The WARC parsing sometimes results in records being truncated.

This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.

The issue can be seen by running:

const { AutoWARCParser } = require('node-warc');
  
(async () => {
  for await (const record of new AutoWARCParser('test1.warc.gz')) {
    console.log(record.content.toString('utf-8'));
  }
})();

With these example files:
test1.warc.gz
(last couple of bytes are cut-off)

test2.warc.gz
(most of the file is cut-off after initial comment)

For comparison, the warcio version prints the full record:

from warcio import ArchiveIterator
  
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
    print(record.content_stream().read().decode('utf-8'))

Add re-gzip or re-deflate if to serialization

see N0taN3rd/Squidwarc#14

TypeError on Puppeteer example: Cannot read property 'Page' of undefined

Copying/pasting the Puppeteer example in the readme, replacing only the URL, and I get UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'Page' of undefined.

I've tried Puppeteer v1.2.x a d v2.0.0 with the same error. Pardon any Puppeteer ignorance on my end, but I don't think the library is exporting an Events object.

But const { Events } = require('chrome-remote-interface-extra') seems to do the trick and bring in the Events needed, not puppeteer.

WARCStreamTransform can experience ERR_MULTIPLE_CALLBACK error

node-warc/lib/parsers/warcStreamTransform.js

Lines 110 to 115 in be38971

 _flush (done) { 

 if (this.buffered) { 

 this._consumeChunk(this.buffered, done, true) 

 } 

 done() 

 }

since _consumeChunk calls done, i think this should be:

  _flush (done) {
    if (this.buffered) {
      this._consumeChunk(this.buffered, done, true)
    } else {
      done()
    }
  }

Capturing two URLs are not being properly read by Webrecorder Player?

I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.

I tried a simple site: www.drupal.org

If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.

However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.

I guess something is missing on the Warc file or I am missing something, any ideas?

Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.

Investigate generating WARCS through headless Chromium / puppeteer

https://github.com/GoogleChrome/puppeteer/ seems to be the hot thing today to render web pages in headless/remote controlled environments. It would be great to add this as an option to generate WARCs, i.e. a PuppeteerWARCGenerator. One main advantage is that it installs completely seamless through npm and runs on headless boxes, so the setup effort is minimized.

I will investigate this in my current project and may come back with a PR - just wanted to collect any thoughts that may have been given to this already beforehand.

Some WARCs lack bookmarks according to Webrecorder player

Archived the following URL: https://www.facebook.com/socialdemokraternailjusdal/photos/a.1409766429240263/2084031858480380/?type=3&__xts__%5B0%5D=68.ARA3blS6QVatnljfKg2ED3yFCSVs2fEjWVC085o9H1oNPpiSDeld4Iu5HfWS59RvuteqLBXXBZZj0oN9I8r0S7RxjC_W77aYdiOtyPeaVCRfYm0O1rgzzqnYDIZTXDJEYPG-XJ0dpOoaGR8JI0JbP6NPCTYXaKKEPUUUKg1XihsVouag0W91ra3-Rqr-TpDrPm96rVOvgjIy8oe5Kse0ZV50kJ65pwWhKvBxm7bMoyTo1fsXAkK6sYdaM_iQhbT7PO25qk6VUbbrTlHSu5i7a3idF2huVM4KM7s-LaOZMPztninlNYMFjCjbJpOeK8wgNUrcXdzwPLsS3iYZ4-D4RcYwPSsU&__tn__=-R

Opening the resulting WARC in Webrecorder Player shows "No bookmarks available in the table":

Maybe the URL is too long or not escaped properly? Other (shorter) URLs seem to work fine.

Example zipped WARC file below.

fbtest.warc.zip

Feature request: Add fields to warcinfo

The spec says:

Allowable fields include, but are not limited to...

Currently it just admits the isPartOf, UserAgent and description. It would be useful to add software-based fields, to change the robots fields and to add some extra info to the software line.

node-warc/lib/writers/warcFields.js

Lines 175 to 189 in 09a8a0b

 function warcInfoContent ({ version, isPartOfV, warcInfoDescription, ua }) { 

 const base = [ 

 `software: node-warc/${version}${CRLF}format: WARC File Format ${WARCV}${CRLF}robots: ignore${CRLF}` 

 ] 

 if (isPartOfV != null) { 

 base.push(`isPartOf: ${isPartOfV}${CRLF}`) 

 } 

 if (warcInfoDescription != null) { 

 base.push(`description: ${warcInfoDescription}${CRLF}`) 

 } 

 if (ua != null) { 

 base.push(`http-header-user-agent: ${ua}${CRLF}`) 

 } 

 return base.join('') 

 }

WARCWriterBase Record-ID and Concurrent-To should be the same

A few days ago I was trying to use headless-chrome-crawler along with node-warc to generate WARCs from headless chrome as #2 suggests. I made a few tests, start tinkering around with the code and ended up exposing the WARCWriterBase class and using it in the code that can be seen in yujiosaka/headless-chrome-crawler#118 (comment).

But when the WARCs was generated I noticed that the WARC-Concurrent-To of the request didn't match the WARC-Record-ID from the response, and the writeResponseRecord method generates it's own Record ID.
So shouldn't this line use the object record id?

node-warc/lib/writers/warcWriterBase.js

Line 157 in a9438cd

rid: uuid(),

That way writeRequestRecord could use the record id of the response as the WARC-Concurrent-To field and generate it's own record id.

Every Concurrent-To field is null.

I don't know if you've seen my latest comment in the #18 issue so I open this new one just in case.

When fixing #18 it broke the WARC-Concurrent-To field, make in it <urn:uuid:null>. I used the same code to test this as in #18.
This issue may be related to #4.

According to ISO 28500:

This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To association shall be considered bidirectional even if the header only appears on one)

So it should be the same if the field is either in the 'request' or the 'response' record.

	_flush (done) {
	if (this.buffered) {
	this._consumeChunk(this.buffered, done, true)
	}
	done()
	}

	function warcInfoContent ({ version, isPartOfV, warcInfoDescription, ua }) {
	const base = [
	`software: node-warc/${version}${CRLF}format: WARC File Format ${WARCV}${CRLF}robots: ignore${CRLF}`
	]
	if (isPartOfV != null) {
	base.push(`isPartOf: ${isPartOfV}${CRLF}`)
	}
	if (warcInfoDescription != null) {
	base.push(`description: ${warcInfoDescription}${CRLF}`)
	}
	if (ua != null) {
	base.push(`http-header-user-agent: ${ua}${CRLF}`)
	}
	return base.join('')
	}