Giter Club home page Giter Club logo

cc-index-server's Introduction

Common Crawl Index Server

This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by CommonCrawl

Usage & Installation

To run locally, please install with pip install -r requirements.txt

CommonCrawl stores data on Amazon S3 and the index is publicly accessible from S3.

Currently, individual indexes for each crawl can be accessed under: s3://aws-publicdatasets/commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]

Most of the index will be served from S3, however, a smaller secondary index must be installed locally for each collection.

This can be done automatically by running: install-collections.sh which will install all available collections locally.

This script will use s3cmd tool to sync the the index.

If successful, there should be collections directory with at least one index.

To run, simply run cdx-server to start up the index server, or optionally wayback, to run pywb replay system along with the cdx server.

CDX Server API

The API endpoints correspond to existing index collections in collections directory.

For example, one currently available index is CC-MAIN-2015-06 and it can be accessed via

http://localhost:8080/CC-MAIN-2015-06-index?url=commoncrawl.org

Refer to CDX Server API for more detailed instructions on the API itself.

The pywb README provides additional information about pywb.

Building the Index

Please see the webarchive-indexing repository for more info on how the index is built.

cc-index-server's People

Contributors

ikreymer avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.