webrecorder / pywb Goto Github PK

View Code? Open in Web Editor NEW

1.3K 61.0 207.0 33.51 MB

Core Python Web Archiving Toolkit for replay and recording of web archives

Home Page: https://pypi.python.org/pypi/pywb

License: GNU General Public License v3.0

Python 38.29% JavaScript 57.98% CSS 0.12% HTML 1.50% Shell 0.10% Arc 0.07% Dockerfile 0.02% Vue 1.92%

python wayback pywb web-archiving web-archives

pywb's Introduction

Conifer

Collect and revisit web pages.

Conifer provides an integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections.

This repository represents the hosted service running at https://conifer.rhizome.org/, which can also be deployed locally using Docker

This README refers to the 5.x version of Conifer, released in June, 2020. This release includes a new UI and the renaming of Webrecorder.io to Conifer. Other parts of the open source efforts remain at the Webrecorder Project. For more info about this momentous change, read our announcement blog post.

The previous UI is available on the legacy branch.

Frequently asked questions

If you have any questions about how to use Conifer, please see our User Guide.
If you have a question about your account on the hosted service (conifer.rhizome.org), please contact us via email at [email protected]
If you have a previous Conifer installation (version 3.x), see Migration Info for instructions on how to migrate to the latest version.

Using the Conifer Platform

Conifer and related tools are designed to make web archiving more portable and decentralized, as well as to serve users and developers with a broad range of skill levels and requirements. Here are a few ways that Conifer can be used (starting with what probably requires the least technical expertise).

1. Hosted Service

Using our hosted version of Conifer at https://conifer.rhizome.org/, users can sign up for a free account and create their own personal collections of web archives. Captures web content will be available online, either publicly or only privately, under each user account, and can be downloaded by the account owner at any time. Downloaded web archives are available as WARC files. (WARC is the ISO standard file format for web archives.) The hosted service can also be used anonymously and the captured content can be downloaded at the end of a temporary session.

2. Offline Capture and Browsing

The Webrecorder Project is a closely aligned effort that offers OSX/Windows/Linux Electron applications:

Webrecorder Player browse WARCs created by Webrecorder (and other web archiving tools) locally on the desktop.
Webrecorder Desktop a desktop version of the hosted Webrecorder service providing both capture and replay features.

3. Preconfigured Deployment

To deploy the full version of Conifer with Ansible on a Linux machine, the Conifer Deploy workbook can be used to install this repository, configure nginx and other dependencies, such as SSL (via Lets Encrypt). The workbook is used for the https://conifer.rhizome.org deployment.

4. Full Conifer Local Deployment

The Conifer system in this repository can be deployed directly by following the instructions below. Conifer runs entirely in Docker and also requires Docker Compose.

5. Standalone Python Wayback (pywb) Deployment

Finally, for users interested in the core "replay system" and very basic recording capabilities, deploying pywb could also make sense. Conifer is built on top of pywb (Python Wayback/Python Web Archive Toolkit), and the core recording and replay functionality is provided by pywb as a standalone Python library. pywb comes with a Docker image as well.

pywb can be used to deploy your own web archive access service. See the full pywb reference manual for further information on using and deploying pywb.

Running Locally

Conifer can be run on any system that has Docker and Docker Compose installed. To install manually, clone

git clone https://github.com/rhizome-conifer/conifer
cd conifer; bash init-default.sh.
docker-compose build
docker-compose up -d

(The init-default.sh is a convenience script that copies wr_sample.env → wr.env and creates keys for session encryption.)

Point your browser to http://localhost:8089/ to access the locally running Conifer instance.

(Note: you may see a maintenance message briefly while Conifer is starting up. Refresh the page after a few seconds to see the Conifer home page).

Installing Remote Browsers

Remote Browsers are standard browsers like Google Chrome and Mozilla Firefox, encapsulated in Docker containers. This feature allows Conifer to directly use fixed versions of browsers for capturing and accessing web archives, with a more direct connection to the live web and web archives. Remote browsers in many cases can improve the quality of web archives during capture and access. They can be "remote controlled" by users and are launched as needed, and use the same amount of computing and memory resources as they would when just running as regular desktop apps.

Remote Browsers are optional, and can be installed as needed.

Remote Browsers are just Docker images which start with oldweb-today/, and are part of oldweb-today organization on GitHub. Installing the browsers can be as simple as running docker pull on each browser image each as well as additional Docker images for the Remote Desktop system.

To install the Remote Desktop System and all of the officially supported Remote Browsers, run install-browsers.sh

Configuration

Conifer reads its configuration from two files: wr.env, and less-commonly changed system settings in wr.yaml.

The wr.env file contains numerous deployment-specific customization options. In particular, the following options may be useful:

Host Names

By default, Conifer assumes its running on localhost or a single domain, but on different ports for application (the Conifer user interface) and content (material rendered from web archives). This is a security feature preventing archived web sites accessing and possibly changing Conifer's user interface, and other unwanted interactions.

To run Conifer on different domains, the APP_HOST and CONTENT_HOST environment variables should be set.

For best results, the two domains should be two subdomains, both with https enabled.

The SCHEME env var should also be set to SCHEME=https when deploying via https.

Anonymous Mode

By default Conifer disallows anonymous recording. To enable this feature, set ANON_DISABLED=false to the wr.env file and restart.

Note: Previously the default setting was anonymous recording enabled (ANON_DISABLED=false)

Storage

Conifer uses the ./data/ directory for local storage, or an external backend, currently supporting S3.

The DEFAULT_STORAGE option in wr.env configures storage options, which can be DEFAULT_STORAGE=local or DEFAULT_STORAGE=s3

Conifer uses a temporary storage directory for data while it is actively being captured, and temporary collections. Data is moved into the 'permanent' storage when the capturing process is completed or a temporary collection is imported into a user account.

The temporary storage directory is: WARCS_DIR=./data/warcs.

The permanent storage directory is either STORAGE_DIR=./data/storage or local storage.

When using s3, the value of STORAGE_DIR is ignored and data gets placed into S3_ROOT which is an s3:// bucket URL.

Additional s3 auth environment settings must also be set in wr.env or externally.

All data related to Conifer that is not web archive data (WARC and CDXJ) is stored in the Redis instance, which persists data to ./data/dump.rdb. (See Conifer Architecture below.)

Email

Conifer can send confirmation and password recovery emails. By default, a local SMTP server is run in Docker, but can be configured to use a remote server by changing the environment variables EMAIL_SMTP_URL and EMAIL_SMTP_SENDER.

Frontend Options

The react frontend includes a number of additional options useful for debugging. Setting NODE_ENV=development will switch react to development mode with hot reloading on port 8096.

Additional frontend configuration can be found in frontend/src/config.js

Administration tool

The script admin.py provides easy low level management of users. Adding, modifying, or removing users can be done via the command line.

To interactively create a user:

docker exec -it app python -m webrecorder.admin -c

or programmatically add users by supplying the appropriate positional values:

docker exec -it app  python -m webrecorder.admin \
                -c <email> <username> <passwd> <role> '<full name>'

Other arguments:

-m modify a user
-d delete a user
-i create and send a new invite
-l list invited users
-b send backlogged invites

See docker exec -it app python -m webrecorder.admin --help for full details.

Restarting Conifer

When making changes to the Conifer backend app, running

docker-compose kill app; docker-compose up -d app

will stop and restart the container.

To integrate changes to the frontend app, either set NODE_ENV=development and utilize hot reloading. If you're running production (NODE_ENV=production), run

docker-compose kill frontend; docker-compose up -d frontend

To fully recreate Conifer, deleting old containers (but not the data!) use the ./recreate.sh script.

Conifer Architecture

This repository contains the Docker Compose setup for Conifer, and is the exact system deployed on https://conifer.rhizome.org. The full setup consists of the following components:

/app - The Conifer backend system includes the API, recording and WARC access layers, split into 3 containers:
- app -- The API and data model and rewriting system are found in this container.
- recorder -- The WARC writer is found in this container.
- warcserver -- The WARC loading and lookup is found in this container.

The backend containers run different tools from pywb, the core web archive replay toolkit library.

/frontend - A React-based frontend application, running in Node.js. The frontend is a modern interface for Conifer and uses the backend api. All user access goes through frontend (after nginx).
/nginx - A custom nginx deployment to provide routing and caching.
redis - A Redis instance that stores all of the Conifer state (other than WARC and CDXJ).
dat-share - An experimental component for sharing collections via the Dat protocol
shepherd - An instance of OldWebToday Browser Shepherd for managing remote browsers.
mailserver - A simple SMTP mail server for sending user account management mail
behaviors - Custom automation behaviors
browsertrix - Automated crawling system

Dependencies

Conifer is built using both Python (for backend) and Node.js (for frontend) using a variety of Python and Node open source libraries.

Conifer relies on a few separate repositories in this organization:

The remote browser system uses https://github.com/oldweb-today/ repositories, including:

Contact

Conifer is a project of Rhizome, made possible with generous past support from the Andrew W. Mellon Foundation.

For more info on using Conifer, you can consult our user guide at: https://guide.conifer.rhizome.org

For any general questions/concerns regarding the project or https://conifer.rhizome.org you can:

Open issues on GitHub
Tweet to us at https://twitter.com/rhizomeconifer
Contact us at [email protected]

License

Conifer is Licensed under the Apache 2.0 License. See NOTICE and LICENSE for details.

pywb's People

Stargazers

Watchers

Forkers

nlevitt jcushman rajbot phillipsm pombredanne tilgovi ptrourke akeprojecta machawk1 mnachmi robertknight yarwelp peval danielbicho gema-arta arquivo orbiter gwu-libraries flinkt chdorner hypothesis jeffreychung soedomoto giordanocardillo treora italoadler gvsurenderreddy sonalranjit hscale sebastian-nagel cequencer info-labs anastasia n0tan3rd bytearchive m4rk3r fernando-melo segerberg rebeccacremona leetcodes leonirlopes atomotic ukwa humberthardy tripti825 markosii ekilfeather kris-sigur eszense hubprojects anarcat arunk2 babibubebon halvir peterk nmunro anjackson commoncrawl fish2000 shawnmjones sesas harvard-lil divyank0 whitten surfndez dorsug sts0mrg0 neolithera backwardn openslx yvmarques hyl masterscott giovanisp sahwar fakegit traverseda syzyyp xlee igobranco micronn 5l1v3r1 apoudel1021 daleathan jayvdb xw0078 sohamjadiya solversa c00renut vishalbelsare malexmave bpgallagher e-harvester artur303 at911 arcalex kaij thedatashed ldko cyber-squirrel

pywb's Issues

Too many open files ?

I get this strange error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/wsgi_wrappers.py", line 98, in handle_methods
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/proxy.py", line 37, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/archivalrouter.py", line 36, in __call__
    return route.handler(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/handlers.py", line 75, in __call__
    return self.handle_request(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/handlers.py", line 133, in handle_request
    cdx_lines, output = self.index_reader.load_for_request(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/query_handler.py", line 71, in load_for_request
    cdx_iter = self.load_cdx(wbrequest, params)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/query_handler.py", line 92, in load_cdx
    cdx_iter = self.cdx_server.load_cdx(**params)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 77, in load_cdx
    return self._check_cdx_iter(cdx_iter, query)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 45, in _check_cdx_iter
    cdx_iter = self.peek_iter(cdx_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 85, in peek_iter
    first = next(iterable)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 110, in <genexpr>
    return (cdx for cdx, _ in itertools.izip(cdx_iter, xrange(limit)))
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 201, in cdx_filter
    for cdx in cdx_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 100, in <genexpr>
    return (cls(line) for line in text_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 83, in create_merged_cdx_gen
    source_iters = map(lambda src: src.load_cdx(query), sources)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 83, in <lambda>
    source_iters = map(lambda src: src.load_cdx(query), sources)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxsource.py", line 31, in load_cdx
    source = open(self.filename)
IOError: [Errno 24] Too many open files: '/opt/clueweb12B13/cdx/1508wb-50.cdx'

130.125.11.134 - - [04/Nov/2014 16:34:01] "GET /clueweb/*/www.repubblica.it HTTP/1.1" 400 2830

Support Vine capture and replay

Several use cases have cropped up for archving Vines, both embedded and from direct.
Vine video is HTML5 based but looks like a few custom rules may still be needed to get the pages to work nicely.

Implement pywb cdx server!

Basic cdx server implementation which will support cdx loading for pywb, and for use as a standalone module

Custom behavior and error page per collection on missing captures

Currently, the error page is a global setting. It would be helpful to have an error page per collection.

In proxy mode, there should be the option to just pipe requests through to the web.

Better handling of content when Content-Type is wrong or absent.

Currently, pywb relies on Content-Type to determine the type of rewriting, if any needs to be performed. However, content-type may be wrong or missing.

Some possible ideas:

For text types:

If chardet returns 0.0 confidence and no mime type, assume binary and skip rewriting.
maybe: use https://pypi.python.org/pypi/binaryornot/0.2.0 to verify
either keep wrong content-type and hope client ignores it, or use https://github.com/ahupp/python-magic to attempt to determine correct type.

For binary or non-rewritable content-type:

no checking, or use https://pypi.python.org/pypi/binaryornot/0.2.0 to ensure binary/text

For no content-type:

either serve without content-type
use https://github.com/ahupp/python-magic to determine correct type.

Cleanup rewriting response

Currently, html rewriting is fully buffered, and css/js/xml rewriting is buffered.
Cleanup the interface to allow all rewriting to be either buffered fully (and served with Content-Length) or streamed (w/o Content-Length).

Route based on PATH_INFO?

Speaking of REQUEST_URI, would it make sense to do the routing part with PATH_INFO instead of the full request_uri? For example, my main wsgi file routes /warc urls to pywb:

from werkzeug.wsgi import DispatcherMiddleware
application = DispatcherMiddleware(
    get_wsgi_application(), # Django
    {
        '/warc': warc_application # pywb
    }
)

Then a request to /warc/foo/bar?a=b comes in with env = {'SCRIPT_NAME': '/warc', 'PATH_INFO': '/foo/bar', 'QUERY_STRING': 'a=b'}

If my pywb routes then match against PATH_INFO, I can change the location of the whole application without needing to edit the routes.

Support Url Wildcard Query

eg. /pywb/example.com* to return results from multiple urls under example.com, in some fashion.

Better general rewriting of css generated in JS

A few sites generate css in JS, currently have been adding explicit rewrite rules for these.. (eg. wikimedia blackout, instagram).. However, it may make sense to add a general JS rule if possible..

Luckily, the urls are generally of the form url(\/\example.com) or url(//example.com) so may be possible to detect the css url() wrapper in JS, but need to be careful.. May still want to apply on a per-site basis.

Note: this is needed as intercepting a style as its being set has proven to be quite difficult, so best to rewrite the style string itself.

Would like custom port

Pywb always runs on port 8080, I'd like to be able to change that via config.yaml and/or a command line parameter.

CDXSource can't return unicode strings

I'm not sure if this is a bug or expected behavior, but if a CDXSource object returns unicode strings from load_cdx, it seems to cause errors in the rewrite phase for some files:

Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 98, in handle_methods
    response = wb_router(env)
  File "/vagrant/perma_web/warc_server/pywb_config.py", line 71, in __call__
    return super(Router, self).__call__(env)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 36, in __call__
    return route.handler(wbrequest)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 75, in __call__
    return self.handle_request(wbrequest)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 138, in handle_request
    return self.handle_replay(wbrequest, cdx_lines)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 152, in handle_replay
    cdx_callback)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 83, in render_content
    failed_files)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 148, in replay_capture
    response_iter)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 166, in buffered_response
    for buff in iterator:
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 250, in rewrite_text_stream_to_gen
    buff = rewrite_func(buff)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/rewrite/regex_rewriters.py", line 54, in rewrite
    return self.regex.sub(lambda x: self.replace(x), string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

SCRIPT_NAME environment variable undefined?

index page loads fine.
tried to hit mydomain.com/pywb/*/example.com

Pywb Error

'SCRIPT_NAME'
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 77, in __call__
    wbrequest = self.parse_request(env, use_abs_prefix)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 90, in parse_request
    rel_prefix = env['SCRIPT_NAME'] + '/' + matched_str + '/'
KeyError: 'SCRIPT_NAME'

I'm running pywb 0.4.7 (installed w/ pip) via uWSGI behind Nginx.

Nginx server block

upstream pywb {
    server 127.0.0.1:8001;
}

server {
    listen 80 default_server;
    listen [::]:80 default_server ipv6only=on;

    location / {
        uwsgi_pass pywb;
        include /etc/nginx/uwsgi_params;
       }
}

contents of uwsgi_params:
https://github.com/phusion/nginx/blob/master/conf/uwsgi_params

command to run uWSGI:
$ /usr/local/bin/uwsgi --ini /etc/pywb/wsgi.ini

contents of /etc/pywb/wsgi.ini

[uwsgi]
socket = :8001
master = true
processes = 10
buffer-size = 65536
die-on-term = true

# specify config file here
env = PYWB_CONFIG_FILE=/etc/pywb/config.yaml
chdir = /usr/local/lib/python2.7/dist-packages/pywb/
wsgi = pywb.apps.wayback

contents of /etc/pywb/config.yaml

# pywb config file
# ========================================
#
# Settings for each collection

collections:
    # <name>: <cdx_path>
    # collection will be accessed via /<name>
    # <cdx_path> is a string or list of:
    #  - string or list of one or more local .cdx file
    #  - string or list of one or more local dirs with .cdx files
    #  - a string value indicating remote http cdx server
    pywb: /my_archive/cdx/

    # ex with filtering: filter CDX lines by filename starting with 'dupe'
    #pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}

# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
#   * Set to true if cdxs start with surts: com,example)/
#   * Set to false if cdx start with urls: example.com)/
#
# default:
# surt_ordered: true

# list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
# in the cdx to their absolute path
#
# if path is:
#   * local dir, use path as prefix
#   * local file, lookup prefix in tab-delimited sorted index
#   * http:// path, use path as remote prefix
#   * redis:// path, use redis to lookup full path for w:<warc> as key

archive_paths: /my_archive/warcs/

# The following are default settings -- uncomment to change
# Set to '' to disable the ui

# ==== UI: HTML/Jinja2 Templates ====

# template for <head> insert into replayed html content
#head_insert_html: ui/head_insert.html

# template to for 'calendar' query,
# eg, a listing of captures  in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, will list raw cdx in plain text
#query_html: ui/query.html

# template for search page, which is displayed when no search url is entered
# in a collection
#search_html: ui/search.html

# template for home page.
# if no other route is set, this will be rendered at /, /index.htm and /index.html
#home_html: ui/index.html


# error page temlpate for may formatting error message and details
# if omitted, a text response is returned
#error_html: ui/error.html

# ==== Other Paths ====

# list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer
#
# eg: an incorrect request for http://localhost:8080/image.gif with a referrer
# of http://localhost:8080/pywb/index.html, pywb can correctly redirect
# to http://localhost:8080/pywb/image.gif
#

#hostpaths: ['http://localhost:8080']

# Rewrite urls with absolute paths instead of relative
#absoulte_paths: true

# List of route names:
# <route>: <package or file path>
# default route static/default for pywb defaults
static_routes:
          static/default: pywb/static/

# ==== New / Experimental Settings ====
# Not yet production ready -- used primarily for testing

# Enable simple http proxy mode
enable_http_proxy: true

# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true

# custom rules for domain specific matching
# set to false to disable
#domain_specific_rules: rules.yaml

# Memento support, enable
enable_memento: true

# Replay content in an iframe
framed_replay: true

ampersand causes extra semicolon

There's an extra semicolon in the playback for this URL:

https://webrecorder.io/replay/20140317180312/http://www.boston.com/business/news/2013/10/08/jones-lang-lasalle-survey-boylston-seventh-most-expensive-street-for-office-rents/CeaD6LLvrNKyhPegK7RreM/story.html

The menu item "A&E" becomes "A&E;".

I suspect that some stage of the rewrite process is normalizing the HTML and treating "&E" as an HTML entity.

Let me know if the original .warc would be helpful as well. Thanks!

Support remaining query api from wayback-cdx-server

Support roughly same features as documented under:
https://github.com/iipc/openwayback/tree/master/wayback-cdx-server

Templated Head Insert

Support head insert which can be either fixed string or generated by (Jinja2 or other) template, with info about current capture and request available to template.

WARCs of text files on web are indexed but not replayable.

Uploaded WARCs containing captured of pages with only text are indexed with the URIs properly extracted but I receive a sad face icon (see below screenshot) when webrecorder tries to replay the content. This occurs both when the WARC is uploaded or placed somewhere on the web and webrecorder is given the URI.

Example WARC:
http://matkelly.com/temp/20140811154106173.warc

Validity verified:
./jwattools.sh test -e ./20140811154106173.warc
sh cdx-indexer ./20140811154106173.warc

Screenshot:

/cc @phonedude

Add index / better error page.

Currently /index.html results in an error.

Add a placeholder index page that lists all routes
Add a better error page for empty or invalid request.

Documentation of variables available in templates

Need a list of variables available in the jinja2 templates.

Simplified Configuration System for multiple collections

Add an optional 'convention over configuration' system for setting up collections of warcs and cdx files..

Currently, setting up multiple collections can be a bit tedious. For example, configuring two collections, each with custom set of cdx, warcs, search page and banner might look like this:

coll1:
        index_paths: ./coll1/cdx/
        archive_paths: ./coll1/warcs/
        banner_html: ./coll1/templates/banner.html
        search_html: ./coll1/templates/search.html

coll2:
        index_paths: ./coll1/cdx/
        archive_paths: ./coll1/warcs/
        banner_html: ./coll1/templates/banner.html
        search_html: ./coll1/templates/search.html

With the new convention system, the collections can instead be configured implicitly within a collections directory, eg: collections/coll1/, collections/coll2, and each having a
indexes , archive and an optional templates subdir.

Edit: changed cdx -> indexes, warcs -> archive in new config system

Multithread architecture ?

Hi,
i've been profiling a bit the wayback web server, and it seems to be inherently single-threaded. Is this the case ? I've tried with an increasing number of clients (cURL clients) that concurrently issue requests to the same page, and the %time_total reported by cURL starts increasing steadily with the number of clients.
I run the pywb on a 8-core Xeon processor, and during the benchmark execution, only 1 core is pushed to 100% of its capabilities.
Is there some quick hack to improve the performances?
Do you plan to support multi-core architectures ?

Here's the scripts I use to benchmark:

:~/pywb$ cat single_client_pywb.sh 
#!/bin/bash
URL=$1
for i in $(seq 1 1 1000); do 
        curl -w 'Total time: \t%{time_total}\n' $URL -o /dev/null -s; 
done
:~/pywb$ cat bench.sh 
#!/bin/bash
N_CLIENTS=$1
for i in $(seq 1 1 $N_CLIENTS); do 
        ./single_client_pywb.sh http://localhost:8080/clueweb/20120418042230/http://wikitravel.org/en/282001 &
done
wait
```bash

I can provide the warc.gz if required, but any page would work.

Missing file in 0.3.0 release

Looks like the 0.3.0 .tar.gz here ( https://pypi.python.org/pypi/pywb/0.3.0 ) is missing the pywb/rules.yaml file, which is referred to in pywb/utils/dsrules.py .

Make new config system also compatible with/understand bagit directory structure.

Reference: http://en.wikipedia.org/wiki/BagIt
This is related to new config system, #55

As part of config system improvements, it may be very useful to support bagit directory structures.
Figure out what is needed to support this layout, it may just be recursive directory structures, or possible recognizing additional metadata.

Create custom not_found.html template separate from generic error.html

Not found errors are different than other server-side errors. Provide a separate not_found.html template, overridable per collection. Based on discussion in #58

Plaintext content on the live web is not recorded/replayed correctly

This was pointed out to me by @phonedude and applicable to webrecorder (which uses pywb). URIs where the only resource is a text document are "recorded" but when the download button is pressed and Web ARChive (WARC) is selected, the user is returned an error:

WebRecorder.io error
Temporary Warc Error
Please try downloading again.

Example URI:
http://www.cs.odu.edu/~mkelly/semester/2014_summer/bioCopy.txt

Slow response proxying archive.org

The really fun news is that this works:
https://via.hypothes.is/h/https://web.archive.org/web/20141111131547/http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/

But I waited probably three minutes for it to succeed. I actually thought it was hung. Notice that annotations that were made at the original page on theatlantic.com over a year ago work! (Because theatlantic uses rel=canonical, and we are paying attention).

Any idea why it might take so long? Did it for you?

Support Memento Protocol

Implement Memento Support for replay, as well as timegate and timemap, as an optional config setting.

Ability to modify html rewriting rules as needed.

Currently, the html rewriting tags is somewhat hardcoded in the htmlparser.

Would be nice to have this more configurable, especially for certain specific use cases, like tags.

A particular use case is not rewriting to work with pywb-hypothesis integration.

ZipNum: add support for compressed cdx

Support for compressed chunked cdx lines with a secondary index

TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

The cdx-indexer fails with a message like this:

Traceback (most recent call last):
  File "/usr/local/bin/cdx-indexer", line 9, in <module>
    load_entry_point('pywb==0.6.3', 'console_scripts', 'cdx-indexer')()
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 252, in main
    cdx09=cmd.cdx09)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 139, in write_multi_cdx_index
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 378, in create_index_iter
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 206, in create_record_iter
    for record in arcv_iter.iter_records():
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 50, in iter_records
    record = self._next_record(next_line)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 138, in _next_record
    self.known_format)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 138, in parse_record_stream
    status_headers = self.http_parser.parse(stream)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/utils/statusandheaders.py", line 172, in parse
    value += next_line
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

Input warc file here:
https://www.dropbox.com/s/jy9n5s5479850yd/0704wb-31.warc.gz?dl=0

Client-Side Testing Suite

The server side testing suite is pretty extensive (eg 99%) coverage. However, this does not yet include client side rewriting, and testing of actual page content for replay accuracy.

Need at least some way to test the client side rewriting, and maybe more extensive testing after that.

wombat hooks for missing video

Sometimes embedded videos supported by the yt-dl library are not available for various reasons at the time of archiving: video deleted, copyright violation, user closed their account, and so forth.

If a native video player is not supported at archiving time and a reason is known why the archiving failed, provide javascript hooks in wombat to handle those.

If the native video player can be archived and already shows the reason, this is not required.

Some very dynamic pages do not work with the proxy

The following page looks like a normal static page in the browser: http://readwrite.com/.... However it is using angular. For example using pywb-h/h/http://readwrite.com/... shows an empty page.

I assume this is a proxy problem. Or should this be reported on pywb-hypothesis?

Support YAML Configs

Basic cdx/warc source config should be definable via yaml.

Implement binsearch over text file

Generic binsearch over seekable stream, which supports size(), seek(), readline()
Necessary for further cdx server functionality.

Improved top rewriting

'window.top' needs to be rewritten in framed mode, however it can be tricky to get it right, since variable can be just 'top' and can also be used for other things besides window, eg a local var top.

This is an attempt to improve detecting when top needs to be rewritten.

Incorrect URLS and redirect loop

looking at list of captures:
mydomain.com/pywb/*/<url>

on localhost the links are of the format:
mydomain.com/pywb/<id>/<url>

on my deployed site the links are:
mydomain.com/pywb/<current page url>/<id>/<ur>/

If I manually enter in the correct url for a capture, I get a redirect loop. not a problem on localhost.

my environment is setup the same as described here: #39
this is probably an issue with how I have things set up

cdx-indexer fails to index warc.gz file

I got this:

~/pywb$ cdx-indexer --sort clueweb12b13/cdx/ /tmp/clue/
Traceback (most recent call last):
  File "/usr/local/bin/cdx-indexer", line 9, in <module>
    load_entry_point('pywb==0.6.3', 'console_scripts', 'cdx-indexer')()
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 252, in main
    cdx09=cmd.cdx09)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 120, in write_multi_cdx_index
    write_cdx_index(outfile, infile, filename, **options)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 160, in write_cdx_index
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 378, in create_index_iter
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 206, in create_record_iter
    for record in arcv_iter.iter_records():
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 50, in iter_records
    record = self._next_record(next_line)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 138, in _next_record
    self.known_format)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 97, in parse_record_stream
    known_format))
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 171, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
pywb.warc.recordloader.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response

The /tmp/clue/ directory contains only 1 file, 1000tw-00.warc.gz.
If the file is unzipped, the cdx-indexer works fine.

Avoid masking re-raised errors.

In this example in wbapp.py, and similar spots, it would make for easier debugging to use plain raise instead of raise e:

    except Exception as e:
        logging.exception('*** pywb could not init with settings from {0}.pywb_config()!\n'.format(config_name))
        raise e

That way python will print out a stack trace that goes back to the original source of the error, rather than the point where it was re-raised. Thanks!

Problematic page that shows an advertisement and then loads the content

http://www.businessinsider.com/... shows an advertisement for some time and then redirects to the article. If you visit the page again, it remembers (by using a cookie) that you have seen the ad and goes directly to the page.

This does not work with the proxy http://pywb-h.herokuapp.com/h/http://www.businessinsider.com/....

To see the effect, make sure you have removed cookies for pywb-h.herokuapp.com. If you reload the page after it is stuck in the ad page, it will go directly to the article.

Add Basic Support for an Exclusion/Perms System

Support filtering cdx lines based on a permissions checking module.

Redirect from bare domain to www subdomain throws Self Redirect error.

Suppose you archive http://metafilter.com/ using phantomjs and warcprox. That URL redirects to http://www.metafilter.com/. So you end up with an entry in your .warc for metafilter.com, which is a 301 redirect, and www.metafilter.com, which is the actual contents of the page.

Now suppose you try to replay http://metafilter.com/. replay.py will throw a self-redirect error at this point:

        # Check for self redirect
        if wbresponse.status_headers.statusline.startswith('3'):
            if self.isSelfRedirect(wbrequest, wbresponse.status_headers):
                raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))

The reason is that surt normalizes http://metafilter.com/ and http://www.metafilter.com/ to be the same thing, so they both come back as the 301 redirect:

>>> surt.surt('http://metafilter.com')
'com,metafilter)/'
>>> surt.surt('http://www.metafilter.com')
'com,metafilter)/'

I have no idea if this is a surt bug, a warcprox bug, or even a pywb bug, but figured you might know ...

encoding issue - failing to playback warc

see stack trace below. we took a warc from our collection, indexed and visited the url in a locally running pywayback. this warc was made by wget (we can send the file via email but it is too big to upload here). other warcs we have tried that we created using webrecorder.io work perfectly. we're on v0.4.5

Pywb Error

'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 78, in __call__
    return self.handler(wbrequest) if wbrequest else None
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 41, in __call__
    cdx_callback)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 81, in __call__
    return self.render_content(wbrequest, *args)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 159, in render_content
    failed_files)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 224, in replay_capture
    response_iter)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 242, in buffered_response
    for buff in iterator:
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 224, in stream_to_gen
    buff = rewrite_func(buff)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 151, in do_rewrite
    buff = self._decode_buff(buff, stream, encoding)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 179, in _decode_buff
    buff = buff.decode(encoding)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte

nytimes.com redirect issues.

@dwhly dug this up with testing on our "via.hypothes.is" installation of PyWB.

If you make a request for an nytimes.com page...such as:
https://via.hypothes.is/h/http://www.nytimes.com/2015/01/12/business/media/pop-music-critic-leaves-the-new-yorker-to-annotate-lyrics-for-a-start-up.html?gwh=D7261152B8A43951CDB507B033AE73A8&gwt=pay&assetType=nyt_now&_r=0
(with or without query parameters)
...the nytimes.com site will send the browser on an endless redirect journey to no where. 😦

It's likely for some user "finger printing" of some kind.

Would love your thoughts on the matter @ikreymer. Thanks!

Ensure IDN urls can be replayed/proxied

Ensure that international domain name sites (IDN) can be live proxied and replayed without issues.

Add support for lxml parser

Optionally support lxml target parser api, if lxml is available.
http://lxml.de/parsing.html#the-target-parser-interface

Provide option to toggle lxml via 'use_lxml_parser'

lxml rewriting error

Something weird happens when using lxml in the first doc in this WARC: https://www.dropbox.com/s/tmb7cusy7vg3u3o/audobon.warc

The URL is http://web4.audubon.org/bird/stateofthebirds/cbid/

When it gets played back and lxml is on, the HTML ends partway through, like this:

<!-- Begin Breadcrumb Nav -->
<tr>
    <td class="breadcrumbnav">
<a href="/warc/Y5UN-LGT2/http://www.audubon.org/bird/stateofthebirds/">State of the Birds</a> &gt;
<a href="/warc/Y5UN-LGT2/http://web4.audubon.org/bird/stateofthebirds/cbid/index.php">Common Birds in Decline</a>
</td></tr></table></td></tr></table></td></tr></table></td></tr></table></body></html>

Seems to work fine without lxml. Any thoughts?

Ensure pywb runs on Windows

Mostly path-url conversion related issues, as well as possible line ending differences?

Support Http Proxy Mode

Should support replay in http proxy mode, with rewriting.
Need to:

Rewrite https -> http urls
Filter encoding related headers

Better reporting of non-chunked gzip warcs/arcs

cdx-indexer will return an obscure message when trying to parse a non-chunked gzip. Detect this better and return a useful error message, something like:
This WARC/ARC is not properly compressed, to use it, please decompress the file first
Raised by #44

Support PathIndexPrefixResolver

Resolve warc paths from a simple, tab-delimitted text file, eg:

file.warc.<tab>file://full/path/to/file.warc
...