Giter Club home page Giter Club logo

n0tan3rd / squidwarc Goto Github PK

View Code? Open in Web Editor NEW
164.0 10.0 26.0 2.44 MB

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

Home Page: https://n0tan3rd.github.io/Squidwarc/

License: Apache License 2.0

JavaScript 98.60% Shell 0.10% Dockerfile 1.30%
webarchiving webarchives crawler high-fidelity-preservation chrome-headless chrome puppeteer headless-chrome crawling browser-automation

squidwarc's Introduction

Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.

Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still being easy enough for the personal archivist to setup and use.

Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls rather seeks to address Heritrix's shortcomings namely:

  • No JavaScript execution
  • Everything is plain text
  • Requiring configuration to know how to preserve the web
  • Setup time and technical knowledge required of its users

For more information about this see

Squidwarc is built using Node.js, node-warc, and chrome-remote-interface.

If running a crawler through the commandline is not your thing, then Squidwarc highly recommends warcworker, a web front end for Squidwarc by @peterk.

If you are unable to install Node on your system but have docker, then you can use the provided docker file or compose file.

If you have neither then Squidwarc highly recommends WARCreate or WAIL. WARCreate did this first and if it had not Squidwarc would not exist ๐Ÿ’•

If recording the web is what you seek, Squidwarc highly recommends Webrecorder.

Out Of The Box Crawls

Page Only

Preserve the only the page, no links are followed

Page + Same Domain Links

Page Only option plus preserve all links found on the page that are on the same domain as the page

Page + All internal and external links

Page + Same Domain Link option plus all links from other domains

Usage

Squidwarc uses a bootstrapping script to install dependencies. First, get the latest version from source:

$ git clone https://github.com/N0taN3rd/Squidwarc
$ cd Squidwarc

Then run the bootstrapping script to install the dependencies:

$ ./bootstrap.sh

Once the dependencies have been installed you can start a pre-configured (but customizable) crawl with either:

$ ./run-crawler.sh -c conf.json

or:

$ node index.js -c conf.json

Config file

The config.json file example below is provided for you without annotations as the annotations (comments) are not valid json

For more detailed information about the crawl configuration file and its field please consult the manual available online.

{
  "mode": "page-only", // the mode you wish to crawl using
  "depth": 1, // how many hops out do you wish to crawl

  // path to the script you want Squidwarc to run per page. See `userFns.js` for more information
  "script": "./userFns.js",
  // the crawls starting points
  "seeds": [
    "https://www.instagram.com/visit_berlin/"
  ],

  "warc": {
    "naming": "url", // currently this is the only option supported do not change.....
    "append": false // do you want this crawl to use a save all preserved data to a single WARC or WARC per page
  },

  // Chrome instance we are to connect to is running on host, port.
  // must match --remote-debugging-port=<port> set when Squidwarc is connecting to an already running instance of  Chrome.
  // localhost is default host when only setting --remote-debugging-port
  "connect": {
    "launch": true, // if you want Squidwarc to attempt to launch the version of Chrome already on your system or not
    "host": "localhost",
    "port": 9222
  },

  // time is in milliseconds
  "crawlControl": {
    "globalWait": 60000, // maximum time spent visiting a page
    "inflightIdle": 1000, // how long to wait for until network idle is determined when there are only `numInflight` (no response recieved) requests
    "numInflight": 2, // when there are only N inflight (no response recieved) requests start network idle count down
    "navWait": 8000 // wait at maximum 8 seconds for Chrome to navigate to a page
  }
}

JavaScript Style Guide

squidwarc's People

Contributors

machawk1 avatar mhucka avatar n0tan3rd avatar peterk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

squidwarc's Issues

Unable to reuse local Chrome user dir/cookies

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a userDataDir attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:

{ "use": "puppeteer", "headless": true, "script": "./userFns.js", "mode": "page-all-links", "depth": 1, "seeds": [ "https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly" ], "warc": { "naming": "url", "append": true }, "connect": { "launch": true, "host": "localhost", "port": 9222, "userDataDir": "/Users/machawk1/Library/Application Support/Google/Chrome" }, "crawlControl": { "globalWait": 5000, "inflightIdle": 1000, "numInflight": 2, "navWait": 8000 } }
...in an attempt to preserve https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly, a URI that will provide a login page if not authenticated. I get the following result on stdout:
Running Crawl From Config File /Users/machawk1/Desktop/squidwarcWithCookies.json With great power comes great responsibility! Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-all-links mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Running user script
Crawler Generating WARC
Crawler Has 18 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Running user script
Crawler Generating WARC
Crawler Has 17 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Running user script
Crawler Generating WARC
Crawler Has 16 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Running user script
Crawler Generating WARC
Crawler Has 15 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Running user script
Crawler Generating WARC
Crawler Has 14 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Running user script
Crawler Generating WARC
Crawler Has 13 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Running user script
Crawler Generating WARC
Crawler Has 12 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Running user script
Crawler Generating WARC
Crawler Has 11 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Running user script
Crawler Generating WARC
Crawler Has 10 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Running user script
Crawler Generating WARC
Crawler Has 9 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Running user script
Crawler Generating WARC
Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Crawler Navigated To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Running user script
Crawler Generating WARC
Crawler Has 7 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Running user script
Crawler Generating WARC
Crawler Has 6 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Running user script
Crawler Generating WARC
Crawler Has 5 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Running user script
Crawler Generating WARC
Crawler Has 4 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash

  • index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9

  • _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15

  • puppeteer.js:155 PuppeteerCrawler.navigate
    /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11

Please Inform The Maintainer Of This Project About It. Information In package.json

The resulting WARC does not contain any records related to the specified URI, oddly, since anonymous access results in an HTTP 200. The URI https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random, however, is shown in the WARC. Replaying this page shows a login interface, indicative that my browser's cookies were not used.

What is the expected behavior?

Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.

What's your environment?

macOS 10.14.2
Squidwarc a402335 (current master)
node v10.12.0

Other information

We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.

Make scalable

Are you submitting a bug report or a feature request?

Feature Request

What is the current behavior?

Does not scale

What is the expected behavior?

To scale

What's your environment?

Other information

Exception immediately thrown on first-run

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Unable to get started using instructions in README. Per the README's Usage section, after cloning the repo I ran npm install then ./run-crawler.sh -c conf.json. The console reported that an exception was thrown:

$ ./run-crawler.sh -c conf.json
internal/modules/cjs/loader.js:582
    throw err;
    ^

Error: Cannot find module '../../node-warc/lib/writers/remoteChrome'
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:580:15)
    at Function.Module._load (internal/modules/cjs/loader.js:506:25)
    at Module.require (internal/modules/cjs/loader.js:636:17)
    at require (internal/modules/cjs/helpers.js:20:18)
    at Object.<anonymous> (/private/tmp/Squidwarc/lib/crawler/chrome.js:18:35)
    at Module._compile (internal/modules/cjs/loader.js:688:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:699:10)
    at Module.load (internal/modules/cjs/loader.js:598:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:537:12)
    at Function.Module._load (internal/modules/cjs/loader.js:529:3)

What is the expected behavior?

A crawl to be started (per Usage) or some more information to get started.

What's your environment?

  • Squidwarc d7792b8 (latest master)
  • macOS 10.14.1
  • nodejs v10.12.0

Support a simple list of URIs

Are you submitting a bug report or a feature request?

Feature request.

What is the current behavior?

Reads JSON list of URIs as input.

What is the expected behavior?

Also allow a simple CRLF list of URIs

What's your environment?

N/A

Other information

Scenario put forth by @ibnesayeed but I can imagine this being useful for others.

add support for one big warc file

when crawling lot of pages, whole domains... there will be millions of urls, and millions of files, warc files.

there should be support for crawling all to one warc file with fixed size, configurable in config json file.

Fatal error when starting crawl

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

I get an error when I want to start a crawl. This is the error

Running Crawl From Config File configurations/social-media.json
Crawler Operating In undefined mode
Crawler Will Be Preserving 2 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At warcs
Crawler Is Connecting To Chrome On Host localhost
Crawler Is Connecting To Chrome On Port 9222
Crawler Will Be Waiting At Maximum For Navigation To Happen For 8s
Crawler Will Be Waiting After For 2 inflight requests
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
  TypeError: Cannot read property 'length' of undefined

  - chromeFinder.js:275 Function.findChromeDarwin
    /Users/nastasia/Developer/Squidwarc/lib/launcher/chromeFinder.js:275:20

  - chrome.js:90 async Function.launch
    /Users/nastasia/Developer/Squidwarc/lib/launcher/chrome.js:90:28

  - chrome.js:143 async ChromeCrawler.init
    /Users/nastasia/Developer/Squidwarc/lib/crawler/chrome.js:143:22

  - chromeRunner.js:143 async chromeRunner
    /Users/nastasia/Developer/Squidwarc/lib/runners/chromeRunner.js:143:3

  - index.js:31 async runner
    /Users/nastasia/Developer/Squidwarc/lib/runners/index.js:31:5

This is my configuration file:

{
	"mode": "page-only",
	"depth": 1,
	"seeds": [
		"http://www.facebook.com/nastyvdp",
		"http://www.twitter.com/nvanderperren"
	],
	"warc": {
		"naming": "url",
		"append": "true",
		"output": "warcs"
	},
	"connect": {
		"launch": true,
		"host": "localhost",
		"port": 9222,
		"userDataDir": "/Users/nastasia/Library/Application Support/Google/Chrome"
	},
	"crawlControl": {
		"globalWait": 60000,
		"inflightIdle": 1000,
		"numInflight": 2,
		"navWait": 8000
	}
}	

Because it says that mode is undefined, I also placed mode under crawlControl as suggested in issue #50, but that doesn't solve the issue

What is the expected behavior?

A starting crawl.

What's your environment?

node v14.12.0
Squidwarc: current master
macOS High Sierra 10.13.6
Chrome Versie 86.0.4240.80 (Officiรซle build) (x86_64)

Other information

I don't have this issue if I use puppeteer.

Exception with defaults using Docker in Puppeteer config

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Exception called with defaults using Docker and latest master (a2f1d63). I pulled the repo, changed the directory in the compose file to the working directory root (/tmp/Squidwarc), ran docker-compose build then docker-compose up.

I received the exception:

squidwarc    | Crawler Will Be Generating WARC Files Using the filenamified url
squidwarc    | A Fatal Error Occurred
squidwarc    |   TypeError: Cannot read property 'Disconnected' of undefined
squidwarc    |
squidwarc    |   - puppeteer.js:116 PuppeteerCrawler.init
squidwarc    |     /Squidwarc/lib/crawler/puppeteer.js:116:37
squidwarc    |
squidwarc    |   - next_tick.js:81 processTicksAndRejections
squidwarc    |     internal/process/next_tick.js:81:5
squidwarc    |
squidwarc    |
squidwarc    | Please Inform The Maintainer Of This Project About It. Information In package.json

What is the expected behavior?

For the crawl with the default configuration to complete.

What's your environment?

node v10.12.0 (though may be moot due to Docker), macOS 10.14.2, Squidwarc a2f1d63 (latest master), Docker 18.09.0

TypeError: input.on is not a function

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

Tried setting up a docker image based on the zenika/alpine-chrome image. Copied over Squidwarc. Headless Chrome starts correctly but trying to run the included conf.json crawl script I get the following error:

/usr/src/app/Squidwarc $ node --harmony index.js -c conf.json
Running Crawl From Config File conf.json
Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /usr/src/app/Squidwarc
Crawler Is Connecting To Chrome On Host localhost
Crawler Is Connecting To Chrome On Port 9222
Crawler Will Be Waiting At Maximum For Navigation To Happen For 8s
Crawler Will Be Waiting After For 2 inflight requests
A Fatal Error Occurred
  TypeError: input.on is not a function

  - readline.js:189 new Interface
    readline.js:189:11

  - readline.js:69 Object.createInterface
    readline.js:69:10

  - launcher.js:436 Promise
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:436:25

  - new Promise

  - launcher.js:435 waitForWSEndpoint
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:435:10

  - launcher.js:255 Function.launch
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:255:31


Please Inform The Maintainer Of This Project About It. Information In package.json
events.js:167
      throw er; // Unhandled 'error' event
      ^

What is the expected behavior?

The crawl script should execute correctly.

What's your environment?

Alpine Linux 3.7

/usr/src/app/Squidwarc $ node --version
v10.6.0

/usr/src/app/Squidwarc $ npm --version
6.1.0

/usr/src/app/Squidwarc $ chromium-browser --version
Chromium 64.0.3282.168

Other information

Thank you for what looks like a great project!

Feature request: ignore sets

Are you submitting a bug report or a feature request?

Feature request

What is the current behavior?

Squidwarc can't control which url should ignore (e.g. other languages of site) to add in seed list or warc file.

What is the expected behavior?

Can control which url should ignore before or/and while running.

Add docker file

Are you submitting a bug report or a feature request?

feature request

What is the current behavior?

Create MWE Dockerfile for others to use

Make option for gzipped WARC:s?

feature request: Make option for gzipped WARC:s.

Not sure if this should be a feature or if users should gzip WARC files outside of Squidwarc/node-warc?

What is the current behavior?

The WARCS created are currently not gzipped.

Generated WARC files fail indexing with OpenWayback cdx-indexer

Are you submitting a bug report or a feature request?

bug report

What is the current behavior?

  1. Set up an instance of Openwayback following their Docker tutorial.
  2. Got the latest version of Squidwarc installed and ran a crawl with the sample conf.json config.
  3. Moved the resulting WARC-file to the Openwayback files directory.
  4. Ran the cdx-index command to index the WARC from inside the openwayback Docker container. cdx-indexer /data/files1/instagram.com\!visit_berlin-10-21-2018_1540123477677.warc > /data/index1.cdx
  5. Received error as per below:
Oct 21, 2018 12:08:23 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {reader-identifier=/data/files1/instagram.com!visit_berlin-10-21-2018_1540123477677.warc, absolute-offset=0, WARC-Filename=instagram.com!visit_berlin-10-21-2018_1540123477677.warc, WARC-Date=2018-10-21T12:04:37Z, Content-Length=379, WARC-Record-ID=<urn:uuid:7b817fe0-d529-11e8-8ec6-391ee1fffcc8>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: **Unexpected character 57(Expecting d)**
java.io.IOException: Unexpected character 57(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:70)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

What is the expected behavior?

The captured WARC file should be properly indexed and replayable.

What's your environment?

Squidwarc: 945e4de / 1.0
Openwayback: c49f8e720 / openwayback-core-2.4.0-SNAPSHOT

Describe requirements in the README

Are you submitting a bug report or a feature request?

Feature request/documentation enhancement

What is the current behavior?

The requirements for a user to get up and running are insufficient with regard to the requirements and dependencies. I encountered this experience when trying to resolve #31 on a fresh Win10 VM with the Linux extensions. npm is not installed by default. The recommendation is to use the bootstrap script, which relies on npm and git. I know how to resolve this but this barrier may cause some users to give up.

What is the expected behavior?

Provide all information for the user to get going. Maybe have a setup section at the top level of the readme at the same scope of Usage with some more details to help potential users get going.

What's your environment?

Win10 (for this test)

Trouble running on macOS

./Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --disable-gpu --remote-debugging-port=9222

...starts fine (per run-chrome.sh). In a second tab I try:

./run-crawler.sh -c conf.json from a fresh pull of the repo (db9e923) and receive the following output.

$ ./run-crawler.sh -c conf.json 
Running Crawl From Config File conf.json
Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /Users/machawk1/Downloads/Squidwarc
Crawler Is Connecting To Chrome On Host localhost
Crawler Is Connecting To Chrome On Port 9222
Crawler Will Be Waiting At Maximum For Navigation To Happen For 8s
Crawler Will Be Waiting After Page Load For 7s
Crawler Encountered A Random Error
  Error: getaddrinfo ENOTFOUND localhost localhost:9222
  
  - dns.js:28 errnoException
    dns.js:28:10
  
  - dns.js:76 GetAddrInfoReqWrap.onlookup [as oncomplete]
    dns.js:76:26

Google Chrome v59.0.3071.115
Node v7.4.0
macOS 10.12.5

Squidwarc on Windows 10 (using npm)

I have trouble installing Squidwarc on Windows 10 (using npm).

node-warc directory remains empty - is that correct?

ฮป node index.js -c conf.json
module.js:550
    throw err;
    ^

Error: Cannot find module '../../node-warc/lib/requestCapturers/remoteChrome'
    at Function.Module._resolveFilename (module.js:548:15)
    at Function.Module._load (module.js:475:25)
    at Module.require (module.js:597:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (C:\Users\m\Projekty\crawler\squidwarc\lib\crawler\chrome.js:18:24)
    at Module._compile (module.js:653:30)
    at Object.Module._extensions..js (module.js:664:10)
    at Module.load (module.js:566:32)
    at tryModuleLoad (module.js:506:12)
    at Function.Module._load (module.js:498:3)


Also tried installing with yarn - the same effect.

Syntax error with awk usage on Mac

Are you submitting a bug report or a feature request?

๐Ÿ›

What is the current behavior?

run-crawler.sh crashes on macOS, citing a syntax error with the awk command:

/System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -dump | grep -i 'google chrome\( canary\)\?.app$' | awk '{$1="" print $0}'
A Fatal Error Occurred
  Error: Command failed: /System/Library/Frameworks/CoreServices.framework/Versi  ons/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -dump   | grep -i 'google chrome\( canary\)\?.app$' | awk '{$1="" print $0}'
  awk: syntax error at source line 1
   context is
  {$1="" >>>  print <<<  $0}
  awk: illegal statement at source line 1

I am running sh run-crawler.sh -c conf.json. In launcher.js around line 144 there is a command:

let str = await exec(`${LSREGISTER} -dump | grep -i 'google chrome\\( canary\\)\\?.app$' | awk '{$1="" print $0}'`)

On my machine the value that's piped to awk after LSREGISTER resolution is:
/System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -dump | grep -i 'google chrome\( canary\)\?.app$'

Running just this (without the final awk piping) produces:

 grep -i 'google chrome\( canary\)\?.app$'
	path:          /Applications/Google Chrome.app

Running awk '{$1="" print $0}' alone without any piping produces a similar error:

$ awk '{$1="" print $0}'
awk: syntax error at source line 1
 context is
	{$1="" >>>  print <<<  $0}
awk: illegal statement at source line 1

What is the expected behavior?

Value is properly extracted using awk.

What's your environment?

macOS 10.13.4, squidwarc 3a44e7c, node v8.9.4

Other information

My awk is rusty!

Original Response headers (i.e., start with X-Archive-Orig-...) are modified

Are you submitting a bug report or a feature request?

A bug report.

What is the current behavior?

Generate a WARC file for https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ .

What is the expected behavior?

The Response headers of requesting https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ should be as following:

Content-Encoding: gzip
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-content-length: 11603
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

But we got:

Date: Mon, 14 Aug 2017 03:41:43 GMT
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-Content-Length: 22495
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

The issue is that the value of one of the original Response headers (i.e., X-Archive-Orig-content-length) has been changed from 11603 to 22495. In general, I think all original Response headers (i.e., start with "X-Archive-Orig-...") should not be modified.

What's your environment?

macOS Sierra

Other information

I think the issue is from the following lines of code:
File: .../node-modules/node-warc/lib/writers/remoteChrome.js
Lines: 767 and 768
The code:

          responseHeaders = responseHeaders.replace(noGZ, '')
          responseHeaders = responseHeaders.replace(replaceContentLen, `Content-Length: ${Buffer.byteLength(resData, 'utf8')}${CRLF}`)

Error: options.stripFragment is renamed to options.stripHash

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

Error while capture some site with Squidwarc:

A Fatal Error Occurred
  Error: options.stripFragment is renamed to options.stripHash
  
  - index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9
  
  - _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15
  
  - puppeteer.js:155 PuppeteerCrawler.navigate
    /home/sian1468/Squidwarc/lib/crawler/puppeteer.js:155:11
  

Please Inform The Maintainer Of This Project About It. Information In package.json

Example config and logs

What is the expected behavior?

Squidwarc working correctly.

What's your environment?

Fedora 29
node.js v11.4.0
Squidwarc 9bbc461

Setting mode in conf.json is ignored

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Changing the mode from page-only to page-same-domain as described in the manual wont change the search behavior.

{
"use": "puppeteer",
"headless": true,
"script": "./userFns.js",
"mode": "page-same-domain",
"depth": 5,
"seeds": [
"...."
],
"warc": {
"naming": "url",
"append": "true"
},
"connect": {
"launch": true,
"host": "localhost",
"port": 9222
},
"crawlControl": {
"globalWait": 60000,
"inflightIdle": 1000,
"numInflight": 2,
"navWait": 8000
}
}

What is the expected behavior?

The search mode should change.

How to fix.

If I write the search mode under crawlControl it does change.
{
"use": "puppeteer",
"headless": true,
"script": "./userFns.js",
"seeds": [
"....."
],
"warc": {
"naming": "url",
"append": "true"
},
"connect": {
"launch": true,
"host": "localhost",
"port": 9222
},
"crawlControl": {
"globalWait": 60000,
"inflightIdle": 1000,
"numInflight": 2,
"navWait": 8000,
"mode": "page-same-domain",
"depth": 5
}
}
In the file config.yml the mode is also listed under crawlControl but its not in the manual.

Installation fails with missing node-warc submodule

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Installing as per the instructions gives the following error in the bootstrap.sh script (relating to the node-war submodule):

Submodule 'node-warc' (https://github.com/N0taN3rd/node-warc.git) registered for path 'node-warc'
Cloning into '/Squidwarc/node-warc'...
error: Server does not allow request for unadvertised object 0de56e6628d1e0e8d18cb9e772ae7871bd8cd926
Fetched in submodule path 'node-warc', but it did not contain 0de56e6628d1e0e8d18cb9e772ae7871bd8cd926. Direct fetching of that commit failed.

What is the expected behavior?

Installation proceeds normally.

Allow crawl config to contain user scripts for a page

Are you submitting a bug report or a feature request?

Feature request

What is the current behavior?

Currently a page is loaded and put into a WARC. It would be great if a crawl config could allow for one or more user script to be run by the crawler after page load. This means a user could specify scrolling and click events (e.g. expand comments) to capture a page even better.

Instagram pages not recorded properly?

Are you submitting a bug report or a feature request?

bug report

What is the current behavior?

Tried to capture an Instagram page (https://www.instagram.com/visit_berlin/) using the following config:

    {
      "jobid": "973f4eee0c103ddcb3dc1e7d839630d0",
      "headless": true,
      "mode": "page-only",
      "depth": 1,
      "seeds": [
        "https://www.instagram.com/visit_berlin/"
      ],
      "warc": {
        "naming": "url",
        "output": "/archive/973f4eee0c103ddcb3dc1e7d839630d0"
      },
      "connect": {
        "launch": false,
        "host": "localhost",
        "port": 9222
      },
      "crawlControl": {
        "globalWait": 60000,
        "inflightIdle": 1000,
        "numInflight": 2,
        "navWait": 8000
      }
    }

The capture seems to work correctly but the resulting warc can not be played back properly (images are missing). I can not see if images have been recorded properly in the Warc. Maybe a problem when saving the images?

This is how it looks in Webrecorder Player and pywb:

image

What is the expected behavior?

Captured Instagram page should be able to play back with images.

Running with Chromium 64.0.3282.168 on Alpine Linux 3.7

Add node-warc as a sub-module

To facilitate ease of development for both Squidwarc and node-warc make node-warc a submodule
node-warcs remote Chrome request capturer and WARC generator are developed using Squidwarc
Kill two birds with one stone

Feature request: set browser accept language

When running Squidwarc on server hosts in other countries, websites will sometimes present the UI in the language relating to the IP address range of the server host. (E.g. when I run archiving of Facebook pages from a server in Germany it will present the Facebook interface in German). If it was possible to set the chrome accept language parameter from the job json it would be possible to give more control to the archiver.

TypeError in node-warc causes Squidwarc to crash

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Running Squidwarc 1a19eed (latest master) using docker-compose up. After a long process, I receive a TypeError:

Attaching to squidwarc
squidwarc    | Running Crawl From Config File warcs/conf.json
squidwarc    | With great power comes great responsibility!
squidwarc    | Squidwarc is not responsible for ill behaved user supplied scripts!
squidwarc    |
squidwarc    | Crawler Operating In page-only mode
squidwarc    | Crawler Will Be Preserving 1 Seeds
squidwarc    | Crawler Will Be Generating WARC Files Using the filenamified url
squidwarc    | Crawler Generated WARCs Will Be Placed At /Squidwarc
squidwarc    | Crawler Navigating To https://www.instagram.com/visit_berlin/
squidwarc    | Crawler Navigated To https://www.instagram.com/visit_berlin/
squidwarc    | Running user script
squidwarc    | Crawler Generating WARC
squidwarc    | A Fatal Error Occurred
squidwarc    |   TypeError: Cannot read property 'software' of undefined
squidwarc    |
squidwarc    |   - warcWriterBase.js:204 PuppeteerCDPWARCGenerator.writeWarcInfoRecord
squidwarc    |     [Squidwarc]/[node-warc]/lib/writers/warcWriterBase.js:204:15
squidwarc    |
squidwarc    |   - puppeteer.js:249 PuppeteerCrawler.genWarc
squidwarc    |     /Squidwarc/lib/crawler/puppeteer.js:249:31
squidwarc    |
squidwarc    |   - puppeteerRunner.js:72 puppeteerRunner
squidwarc    |     /Squidwarc/lib/runners/puppeteerRunner.js:72:21
squidwarc    |
squidwarc    |   - next_tick.js:68 process._tickCallback
squidwarc    |     internal/process/next_tick.js:68:7
squidwarc    |
squidwarc    |
squidwarc    | Please Inform The Maintainer Of This Project About It. Information In package.json
squidwarc exited with code 0

What is the expected behavior?

To not crash.

What's your environment?

macOS 10.14.2, node.js 10.12.0, Docker 18.09.0, Squidwarc 1a19eed

Other information

This is an odd error to crash on.:

    if (winfo.software == null) {
      winfo.software = `node-warc/${this._version}`
    }

Is this the most reliable way to check if the software property is defined on winfo? Isolating this code in a separate file and fabricating an example does not crash node, but executes gracefully, so I think something else is at play.

Recursion

It's not clear from the docs whether page-same-domain is supposed to crawl recursively. From my tests it doesn't seem to, but it's possible I'm doing something wrong. If not, then consider this a feature request for recursion.

Feature request: following links

Are you submitting a bug report or a feature request?

Feature request

What is the current behavior?

Squidwarc can't config to following links if url is same domain as seeds.

What is the expected behavior?

Can config to following links if url is same domain as seeds.

Append config seem working not correctly.

Are you submitting a bug report or a feature request?

bug report

What is the current behavior?

Test capture with append config:

{
  "use": "puppeteer",
  "headless": true,
  "script": "./userFns.js",
  "mode": "page-all-links",
  "depth": 1,
  "seeds": [
    "https://www.instagram.com/visit_berlin/"
  ],
  "warc": {
    "naming": "url",
   "append": true
  },
  "connect": {
    "launch": true,
    "host": "localhost",
    "port": 9222
  },
  "crawlControl": {
    "globalWait": 60000,
    "inflightIdle": 1000,
    "numInflight": 2,
    "navWait": 8000
  }
}

after capture finished ( logs ) open result ( instagram.com_visit_berlin-12-11-2018_1544502298260.zip ) with Webrecorder Player but it show empty warc file:

webrecorder_player_2018-12-11_11-55-23

What is the expected behavior?

Capture with append config should make correct warc file

What's your environment?

Windows 10 v1809
node.js v10.14.1
Squidwarc a402335

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.