Giter Club home page Giter Club logo

js-wacz's Introduction

js-wacz

Tests npm version JavaScript Style Guide

JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.

It can be used to combine a set of .warc / .warc.gz files into a single .wacz file:

... programmatically (Node.js):

import { WACZ } from '@harvard-lil/js-wacz'

const archive = new WACZ({ 
  input: 'collection/*.warc.gz', 
  output: 'collection.wacz',
})

await archive.process() // "my-collection.wacz" is ready!

... or via the command line:

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.

Perma Tools


Summary


Install

js-wacz requires Node JS 18+.

npm can be used to install this package and make the js-wacz command accessible system-wide:

npm install -g @harvard-lil/js-wacz

๐Ÿ‘† Back to summary


CLI: create command

The create command helps combine one or multiple .warc or .warc.gz files into a single .wacz file.

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz accepts the following options and arguments for customizing how the WACZ file is assembled.

--file, -f

This is the only required argument, which indicates what file(s) should be processed and added to the resulting WACZ file.

The target can be a single file, or a glob pattern such as folder/*.warc.gz.

# Single file:
js-wacz create --file archive.warc
# Collection:
js-wacz create --file "collection/*.warc"

Note: When using globs, make sure to surround the path with quotation marks.

--output, -o

Specify where the resulting .wacz file should be created, and what its filename should be.

Defaults to archive.wacz in the current directory if not provided.

js-wacz create --file cool-beans.warc --output cool-beans.wacz

--pages, -p

Pass a specific pages.jsonl file.

If not provided, js-wacz is going to attempt to detect pages in WARC records to build its own pages.jsonl index.

js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl

--cdxj

Pass a directory of existing CDXJ files, rather than indexing from WARCs. Must be used in combination with --pages.

js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl --cdxj collection/indexes/

--url

If provided, will be used as the mainPageUrl attribute for datapackage.json.

Must be a valid URL.

js-wacz create -f "collection/*.warc.gz" --url "https://lil.law.harvard.edu"

--ts

If provided, will be used as the mainPageDate attribute for datapackage.json.

Can be any value that can be parsed by JavaScript's Date() constructor.

js-wacz create -f "collection/*.warc.gz" --ts "2023-02-22T12:00:00.000Z"

--title

If provided, will be used as the title attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --title "My collection."

--desc

If provided, will be used as the description attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --desc "My cool collection of web archives."

--signing-url

If provided, will be used as an API endpoint for applying a cryptographic signature to the resulting WACZ file.

This endpoint is expected to be authsign-compatible.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign"

--signing-token

Used conjointly with --signing-url if provided, in case the signing server requires authentication.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign" --signing-token "FOO-BAR"

--log-level

Can be used to determine how verbose js-wacz needs to be.

  • Possible values are: silent, trace, debug, info, warn, error
  • Default is: info
js-wacz create -f "collection/*.warc.gz" --log-level trace

๐Ÿ‘† Back to summary


Programmatic use

js-wacz's CLI and underlying logic are decoupled, and it can therefore be consumed as a JavaScript module (currently only with Node.js).

Example: Creating a signed WACZ programmatically

import { WACZ } from '@harvard-lil/js-wacz'

try {
  const archive = new WACZ({ 
    file: 'collection/*.warc.gz',
    output: 'collection.wacz',
    signingUrl: 'https://example.com/sign',
    signingToken: 'FOO-BAR',
  }

  await archive.process()

  // collection.wacz is ready
} catch(err) {
  // ...
}

Although a process() convenience method is made available, every step of said process can be run individually and the archive's state inspected / edited throughout.

Notable affordances

  • WACZ.addPage() allows for manually adding an entry to pages.jsonl.
  • WACZ.addFileToZip() allows for manually adding any additional data to the final WACZ file.
  • The datapackageExtras option allows for adding an arbitrary JSON-serializable object to datapackage.json under extras.

References:

๐Ÿ‘† Back to summary


Feature parity with py-wacz

js-wacz is aiming at partial feature parity with webrecorder's py-wacz, similar to Webrecorder's py-wacz.

This section lists notable differences in implementation that might affect interoperability.

Main differences in currently implemented features:

  • CLI: create --detect-pages: --detect-pages is implied in js-wacz unless --pages is provided.
  • CLI: create --file: that argument can be implied in py-wacz, it is always explicit in js-wacz.

๐Ÿ‘† Back to summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

This project uses Node.js' built-in test runner.

npm run test

Tests-specific environment variables

The following environment variables allow for testing features requiring access to a third-party server.

These are optional, and can be added to a local .env file which will be automatically interpreted by the test runner.

Name Description
TEST_SIGNING_URL URL of an authsign-compatible endpoint for signing WACZ files.
To run such an endpoint locally, use npm run dev-signer, which will overwrite .env and set this variable to http://localhost:5000/sign; see .services/signer.
TEST_SIGNING_TOKEN If required by the server at TEST_SIGNING_URL, an authentication token.

Available CLI

# Runs test suite
npm run test

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Step-by-step NPM publishing helper
npm run publish-util

# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer

๐Ÿ‘† Back to summary

js-wacz's People

Contributors

matteocargnelutti avatar bensteinberg avatar dependabot[bot] avatar tw4l avatar rebeccacremona avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.