Giter Club home page Giter Club logo

node-warc's People

Contributors

adrianloer avatar bubuanabelas avatar hyl avatar n0tan3rd avatar orzfly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-warc's Issues

Typescript definitions not working

I'm trying to use your library with Typescript (v 3.6) and it fails to compile my project. I think the problem is that the provided definitions in index.d.ts:

  1. are using default parameters in function definitions
  2. using puppeteer as dev dependency, even though it should be a dependency

At least that's what I've got from the quick scanning of the following error log

node_modules/node-warc/index.d.ts:2:23 - error TS2688: Cannot find type definition file for 'puppeteer'.

2 /// <reference types="puppeteer" />
                        ~~~~~~~~~

node_modules/node-warc/index.d.ts:8:29 - error TS2307: Cannot find module 'puppeteer'.

8 import * as puppeteer  from 'puppeteer'
                              ~~~~~~~~~~~

node_modules/node-warc/index.d.ts:221:41 - error TS1015: Parameter cannot have question mark and initializer.

221     constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
                                            ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:221:41 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

221     constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
                                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:222:35 - error TS1015: Parameter cannot have question mark and initializer.

222     attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:222:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

222     attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:223:35 - error TS1015: Parameter cannot have question mark and initializer.

223     detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:223:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

223     detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:251:35 - error TS1015: Parameter cannot have question mark and initializer.

251     constructor (page?: CRIEPage, requestEvent?: string = 'request');
                                      ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:251:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

251     constructor (page?: CRIEPage, requestEvent?: string = 'request');
                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:252:29 - error TS1015: Parameter cannot have question mark and initializer.

252     attach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:252:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

252     attach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:253:29 - error TS1015: Parameter cannot have question mark and initializer.

253     detach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~

node_modules/node-warc/index.d.ts:253:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.

253     detach (page: CRIEPage, requestEvent?: string = 'request'): void;
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Every response has the same Record ID

I was testing the new features of the library, specialy the Puppeteer's Request Capturer and the WARC Generator along with headless-chrome-crawler with the following script:

const HCCrawler = require('headless-chrome-crawler')
const { PuppeteerCapturer, PuppeteerWARCGenerator } = require('node-warc')

const warc = new PuppeteerWARCGenerator()
warc.initWARC('./test.warc', {appending: true})

const run = async () => {
  const crawler = await HCCrawler.launch({
    customCrawl: async (page, crawl) => {
      capture = new PuppeteerCapturer(page)
      await page.setRequestInterception(true)

      page.on('request', request => {
        capture.requestWillBeSent(request)
        request.continue()
      })

      const result = await crawl()

      for (let req of capture.iterateRequests()) {
        await warc.generateWarcEntry(req)
      }

      return result
    },
    maxDepth: 0
  })

  await crawler.queue({url: 'http://books.toscrape.com', skipDuplicates: true})
  await crawler.onIdle()
  warc.end()
  await crawler.close()
}


run()

It creates the WARC file without any errors but when you look into it all the WARC-Record-ID.
Because of this, all the WARC-Concurrent-To fields are the same too.

One way to fix it is to create the generator, init it, write the request and close it for each request like this:

for (let req of capture.iterateRequests()) {
    const warc = new PuppeteerWARCGenerator()
    warc.initWARC('./test5.warc', {appending: true})
    await warc.generateWarcEntry(req)
    warc.end()
}

But that is 100% inefficient.

httpHeaders Set-Cookie is single string

Where there are multiple Set-Cookie headers in a server response from a WARC record the value of httpHeaders.Set-Cookie is always the last one in the list. This should be returned as an array of the Set-Cookie headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.

Example WARC record (minus the content block):

WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4

GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip



WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip

Response from console.log(record.httpHeaders); when used in the record callback:

{ Server: 'nginx',
  Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
  'Content-Type': 'text/html; charset=UTF-8',
  'X-Crawler-Transfer-Encoding': 'chunked',
  Connection: 'keep-alive',
  Vary: 'Accept-Encoding',
  'Set-Cookie':
   'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
  'X-XSS-Protection': '1; mode=block',
  'X-Content-Type-Options': 'nosniff',
  'X-Nginx-Cache-Status': 'BYPASS',
  'X-Server-Powered-By': 'Engintron',
  'X-Crawler-Content-Encoding': 'gzip' }

AutoWARCParser is not a constructor

Trying to use AutoWARCParser but when using the example from the readme.md

const {AutoWARCParser} = require('node-warc');

const parser = new AutoWARCParser('/path/to/my.warc.gz');
parser.on('record', record => { console.log(record); });
parser.on('done', () => { console.log('finished'); });
parser.on('error', error => { console.error(error); });
parser.start();

I get the following error when running the test_warc.js file:

/home/test_warc.js:4
const parser = new AutoWARCParser(
               ^

TypeError: AutoWARCParser is not a constructor
    at Object.<anonymous> (/home/test_warc.js:4:16)
    at Module._compile (internal/modules/cjs/loader.js:776:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:787:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
    at Function.Module._load (internal/modules/cjs/loader.js:585:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:829:12)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
# node -v
v10.16.0

# npm -v
6.10.1

# lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic

WARC Parsing sometimes results in truncated records.

The WARC parsing sometimes results in records being truncated.

This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.

The issue can be seen by running:

const { AutoWARCParser } = require('node-warc');
  
(async () => {
  for await (const record of new AutoWARCParser('test1.warc.gz')) {
    console.log(record.content.toString('utf-8'));
  }
})();

With these example files:
test1.warc.gz
(last couple of bytes are cut-off)

test2.warc.gz
(most of the file is cut-off after initial comment)

For comparison, the warcio version prints the full record:

from warcio import ArchiveIterator
  
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
    print(record.content_stream().read().decode('utf-8'))

TypeError on Puppeteer example: Cannot read property 'Page' of undefined

Copying/pasting the Puppeteer example in the readme, replacing only the URL, and I get UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'Page' of undefined.

I've tried Puppeteer v1.2.x a d v2.0.0 with the same error. Pardon any Puppeteer ignorance on my end, but I don't think the library is exporting an Events object.

But const { Events } = require('chrome-remote-interface-extra') seems to do the trick and bring in the Events needed, not puppeteer.

Capturing two URLs are not being properly read by Webrecorder Player?

I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.

I tried a simple site: www.drupal.org

If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.

However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.

I guess something is missing on the Warc file or I am missing something, any ideas?

Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.

Investigate generating WARCS through headless Chromium / puppeteer

https://github.com/GoogleChrome/puppeteer/ seems to be the hot thing today to render web pages in headless/remote controlled environments. It would be great to add this as an option to generate WARCs, i.e. a PuppeteerWARCGenerator. One main advantage is that it installs completely seamless through npm and runs on headless boxes, so the setup effort is minimized.

I will investigate this in my current project and may come back with a PR - just wanted to collect any thoughts that may have been given to this already beforehand.

Some WARCs lack bookmarks according to Webrecorder player

Feature request: Add fields to warcinfo

The spec says:

Allowable fields include, but are not limited to...

Currently it just admits the isPartOf, UserAgent and description. It would be useful to add software-based fields, to change the robots fields and to add some extra info to the software line.

function warcInfoContent ({ version, isPartOfV, warcInfoDescription, ua }) {
const base = [
`software: node-warc/${version}${CRLF}format: WARC File Format ${WARCV}${CRLF}robots: ignore${CRLF}`
]
if (isPartOfV != null) {
base.push(`isPartOf: ${isPartOfV}${CRLF}`)
}
if (warcInfoDescription != null) {
base.push(`description: ${warcInfoDescription}${CRLF}`)
}
if (ua != null) {
base.push(`http-header-user-agent: ${ua}${CRLF}`)
}
return base.join('')
}

WARCWriterBase Record-ID and Concurrent-To should be the same

A few days ago I was trying to use headless-chrome-crawler along with node-warc to generate WARCs from headless chrome as #2 suggests. I made a few tests, start tinkering around with the code and ended up exposing the WARCWriterBase class and using it in the code that can be seen in yujiosaka/headless-chrome-crawler#118 (comment).

But when the WARCs was generated I noticed that the WARC-Concurrent-To of the request didn't match the WARC-Record-ID from the response, and the writeResponseRecord method generates it's own Record ID.
So shouldn't this line use the object record id?


That way writeRequestRecord could use the record id of the response as the WARC-Concurrent-To field and generate it's own record id.

Every Concurrent-To field is null.

I don't know if you've seen my latest comment in the #18 issue so I open this new one just in case.

When fixing #18 it broke the WARC-Concurrent-To field, make in it <urn:uuid:null>. I used the same code to test this as in #18.
This issue may be related to #4.

According to ISO 28500:

This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To association shall be considered bidirectional even if the header only appears on one)

So it should be the same if the field is either in the 'request' or the 'response' record.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.