n0tan3rd / node-warc Goto Github PK
View Code? Open in Web Editor NEWParse And Create Web ARChive (WARC) files with node.js
License: MIT License
Parse And Create Web ARChive (WARC) files with node.js
License: MIT License
I'm trying to use your library with Typescript (v 3.6
) and it fails to compile my project. I think the problem is that the provided definitions in index.d.ts
:
puppeteer
as dev dependency, even though it should be a dependencyAt least that's what I've got from the quick scanning of the following error log
node_modules/node-warc/index.d.ts:2:23 - error TS2688: Cannot find type definition file for 'puppeteer'.
2 /// <reference types="puppeteer" />
~~~~~~~~~
node_modules/node-warc/index.d.ts:8:29 - error TS2307: Cannot find module 'puppeteer'.
8 import * as puppeteer from 'puppeteer'
~~~~~~~~~~~
node_modules/node-warc/index.d.ts:221:41 - error TS1015: Parameter cannot have question mark and initializer.
221 constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:221:41 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
221 constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:222:35 - error TS1015: Parameter cannot have question mark and initializer.
222 attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:222:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
222 attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:223:35 - error TS1015: Parameter cannot have question mark and initializer.
223 detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:223:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
223 detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:251:35 - error TS1015: Parameter cannot have question mark and initializer.
251 constructor (page?: CRIEPage, requestEvent?: string = 'request');
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:251:35 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
251 constructor (page?: CRIEPage, requestEvent?: string = 'request');
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:252:29 - error TS1015: Parameter cannot have question mark and initializer.
252 attach (page: CRIEPage, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:252:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
252 attach (page: CRIEPage, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:253:29 - error TS1015: Parameter cannot have question mark and initializer.
253 detach (page: CRIEPage, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~
node_modules/node-warc/index.d.ts:253:29 - error TS2371: A parameter initializer is only allowed in a function or constructor implementation.
253 detach (page: CRIEPage, requestEvent?: string = 'request'): void;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The 'content' ArrayBuffer in the record appears to include the trailing \r\n
Tested this with compressed WARCs, may not be the case for uncompressed
I was testing the new features of the library, specialy the Puppeteer's Request Capturer and the WARC Generator along with headless-chrome-crawler with the following script:
const HCCrawler = require('headless-chrome-crawler')
const { PuppeteerCapturer, PuppeteerWARCGenerator } = require('node-warc')
const warc = new PuppeteerWARCGenerator()
warc.initWARC('./test.warc', {appending: true})
const run = async () => {
const crawler = await HCCrawler.launch({
customCrawl: async (page, crawl) => {
capture = new PuppeteerCapturer(page)
await page.setRequestInterception(true)
page.on('request', request => {
capture.requestWillBeSent(request)
request.continue()
})
const result = await crawl()
for (let req of capture.iterateRequests()) {
await warc.generateWarcEntry(req)
}
return result
},
maxDepth: 0
})
await crawler.queue({url: 'http://books.toscrape.com', skipDuplicates: true})
await crawler.onIdle()
warc.end()
await crawler.close()
}
run()
It creates the WARC file without any errors but when you look into it all the WARC-Record-ID
.
Because of this, all the WARC-Concurrent-To
fields are the same too.
One way to fix it is to create the generator, init it, write the request and close it for each request like this:
for (let req of capture.iterateRequests()) {
const warc = new PuppeteerWARCGenerator()
warc.initWARC('./test5.warc', {appending: true})
await warc.generateWarcEntry(req)
warc.end()
}
But that is 100% inefficient.
Where there are multiple Set-Cookie
headers in a server response from a WARC record the value of httpHeaders.Set-Cookie
is always the last one in the list. This should be returned as an array of the Set-Cookie
headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.
Example WARC record (minus the content block):
WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip
WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip
Response from console.log(record.httpHeaders);
when used in the record
callback:
{ Server: 'nginx',
Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
'Content-Type': 'text/html; charset=UTF-8',
'X-Crawler-Transfer-Encoding': 'chunked',
Connection: 'keep-alive',
Vary: 'Accept-Encoding',
'Set-Cookie':
'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
'X-XSS-Protection': '1; mode=block',
'X-Content-Type-Options': 'nosniff',
'X-Nginx-Cache-Status': 'BYPASS',
'X-Server-Powered-By': 'Engintron',
'X-Crawler-Content-Encoding': 'gzip' }
Trying to use AutoWARCParser but when using the example from the readme.md
const {AutoWARCParser} = require('node-warc');
const parser = new AutoWARCParser('/path/to/my.warc.gz');
parser.on('record', record => { console.log(record); });
parser.on('done', () => { console.log('finished'); });
parser.on('error', error => { console.error(error); });
parser.start();
I get the following error when running the test_warc.js file:
/home/test_warc.js:4
const parser = new AutoWARCParser(
^
TypeError: AutoWARCParser is not a constructor
at Object.<anonymous> (/home/test_warc.js:4:16)
at Module._compile (internal/modules/cjs/loader.js:776:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:787:10)
at Module.load (internal/modules/cjs/loader.js:653:32)
at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
at Function.Module._load (internal/modules/cjs/loader.js:585:3)
at Function.Module.runMain (internal/modules/cjs/loader.js:829:12)
at startup (internal/bootstrap/node.js:283:19)
at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
# node -v
v10.16.0
# npm -v
6.10.1
# lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
The WARC parsing sometimes results in records being truncated.
This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.
The issue can be seen by running:
const { AutoWARCParser } = require('node-warc');
(async () => {
for await (const record of new AutoWARCParser('test1.warc.gz')) {
console.log(record.content.toString('utf-8'));
}
})();
With these example files:
test1.warc.gz
(last couple of bytes are cut-off)
test2.warc.gz
(most of the file is cut-off after initial comment)
For comparison, the warcio version prints the full record:
from warcio import ArchiveIterator
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
print(record.content_stream().read().decode('utf-8'))
Copying/pasting the Puppeteer example in the readme, replacing only the URL, and I get UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'Page' of undefined
.
I've tried Puppeteer v1.2.x a d v2.0.0 with the same error. Pardon any Puppeteer ignorance on my end, but I don't think the library is exporting an Events
object.
But const { Events } = require('chrome-remote-interface-extra')
seems to do the trick and bring in the Events
needed, not puppeteer
.
node-warc/lib/parsers/warcStreamTransform.js
Lines 110 to 115 in be38971
since _consumeChunk calls done, i think this should be:
_flush (done) {
if (this.buffered) {
this._consumeChunk(this.buffered, done, true)
} else {
done()
}
}
I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.
I tried a simple site: www.drupal.org
If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.
However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.
I guess something is missing on the Warc file or I am missing something, any ideas?
Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.
https://github.com/GoogleChrome/puppeteer/ seems to be the hot thing today to render web pages in headless/remote controlled environments. It would be great to add this as an option to generate WARCs, i.e. a PuppeteerWARCGenerator. One main advantage is that it installs completely seamless through npm and runs on headless boxes, so the setup effort is minimized.
I will investigate this in my current project and may come back with a PR - just wanted to collect any thoughts that may have been given to this already beforehand.
The spec says:
Allowable fields include, but are not limited to...
Currently it just admits the isPartOf, UserAgent and description. It would be useful to add software-based fields, to change the robots fields and to add some extra info to the software line.
node-warc/lib/writers/warcFields.js
Lines 175 to 189 in 09a8a0b
A few days ago I was trying to use headless-chrome-crawler
along with node-warc
to generate WARCs from headless chrome as #2 suggests. I made a few tests, start tinkering around with the code and ended up exposing the WARCWriterBase class and using it in the code that can be seen in yujiosaka/headless-chrome-crawler#118 (comment).
But when the WARCs was generated I noticed that the WARC-Concurrent-To
of the request didn't match the WARC-Record-ID
from the response, and the writeResponseRecord
method generates it's own Record ID.
So shouldn't this line use the object record id?
node-warc/lib/writers/warcWriterBase.js
Line 157 in a9438cd
writeRequestRecord
could use the record id of the response as the WARC-Concurrent-To
field and generate it's own record id.I don't know if you've seen my latest comment in the #18 issue so I open this new one just in case.
When fixing #18 it broke the WARC-Concurrent-To
field, make in it <urn:uuid:null>
. I used the same code to test this as in #18.
This issue may be related to #4.
According to ISO 28500:
This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To association shall be considered bidirectional even if the header only appears on one)
So it should be the same if the field is either in the 'request' or the 'response' record.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.