Giter Club home page Giter Club logo

cc-index-server's People

Contributors

ikreymer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cc-index-server's Issues

Errors using aws-publicdatasets bucket

Forbidden

s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-06/metadata.yaml
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-22/metadata.yaml
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-35/metadata.yaml
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/metadata.yaml
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cluster.idx

Missing Indexes from 2016

$ s3cmd ls s3://aws-publicdatasets/common-crawl/cc-index/collections/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2014-52/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-06/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-11/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-18/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-22/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-27/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-32/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-35/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-40/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/
     DIR   s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2016-07/

404 Not Found when querying CC-MAIN-2015-48

2016-09-05 01:53:20,671: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,671: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:20,732: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:20,732: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:20,733: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-40/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd11b90>
2016-09-05 01:53:20,733: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-40-index
2016-09-05 01:53:20,733: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,733: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:20,792: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:20,793: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:20,793: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-11/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd34f50>
2016-09-05 01:53:20,793: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-11-index
2016-09-05 01:53:20,794: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,794: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:20,849: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:20,849: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:20,850: [WARNING]: skipping unrecognized URI: /cdx/collections/CC-MAIN-2015-48/indexes/.s3cmd.Rnuxx_.tmp
2016-09-05 01:53:20,850: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-48/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfccee190>
2016-09-05 01:53:20,850: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-48-index
2016-09-05 01:53:20,850: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,851: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:20,906: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:20,906: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:20,907: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2016-07/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfccd0d50>
2016-09-05 01:53:20,907: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-07-index
2016-09-05 01:53:20,907: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,907: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:20,963: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:20,963: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:20,964: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-18/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcc8fa50>
2016-09-05 01:53:20,964: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-18-index
2016-09-05 01:53:20,964: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:20,964: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,019: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,019: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,019: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-06/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd3b490>
2016-09-05 01:53:21,019: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-06-index
2016-09-05 01:53:21,020: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,020: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,074: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,075: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,075: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-14/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd2e3d0>
2016-09-05 01:53:21,075: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-14-index
2016-09-05 01:53:21,075: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,076: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,131: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,131: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,132: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2014-52/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcc96ed0>
2016-09-05 01:53:21,132: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-52-index
2016-09-05 01:53:21,132: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,132: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,187: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,188: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,188: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-22/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd7e0d0>
2016-09-05 01:53:21,188: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-22-index
2016-09-05 01:53:21,189: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,189: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,244: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,244: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,245: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-32/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcccc4d0>
2016-09-05 01:53:21,245: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-32-index
2016-09-05 01:53:21,245: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,245: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,308: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,308: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,308: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-27/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfcd5fa10>
2016-09-05 01:53:21,308: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-27-index
2016-09-05 01:53:21,309: [DEBUG]: Adding query_html: query.html
2016-09-05 01:53:21,309: [DEBUG]: CDX Surt-Ordered? True
2016-09-05 01:53:21,365: [DEBUG]: CustomCanonilizer? True
2016-09-05 01:53:21,365: [DEBUG]: FuzzyMatcher? True
2016-09-05 01:53:21,365: [DEBUG]: Adding CDX Source: ZipNum Cluster: /cdx/collections/CC-MAIN-2015-35/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7fabfccf6850>
2016-09-05 01:53:21,366: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-35-index
2016-09-05 01:53:21,366: [DEBUG]: *** pywb app inited with config from "create_cdx_server_app"!

2016-09-05 01:53:21,367: [INFO]: Starting pywb CDX Index Server on port 8080
2016-09-05 01:53:25,010: [DEBUG]: Loading 5 blocks from s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz:371355599+1218681
2016-09-05 01:53:25,018: [DEBUG]: Retrieving credentials from metadata server.
2016-09-05 01:53:25,020: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
2016-09-05 01:53:25,021: [ERROR]: Unable to read instance data, giving up
2016-09-05 01:53:25,021: [DEBUG]: Retrieving credentials from metadata server.
2016-09-05 01:53:25,022: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
2016-09-05 01:53:25,022: [ERROR]: Unable to read instance data, giving up
2016-09-05 01:53:25,023: [DEBUG]: path=/
2016-09-05 01:53:25,023: [DEBUG]: auth_path=/aws-publicdatasets/
2016-09-05 01:53:25,023: [DEBUG]: Method: HEAD
2016-09-05 01:53:25,023: [DEBUG]: Path: /
2016-09-05 01:53:25,023: [DEBUG]: Data: 
2016-09-05 01:53:25,023: [DEBUG]: Headers: {}
2016-09-05 01:53:25,024: [DEBUG]: Host: aws-publicdatasets.s3.amazonaws.com
2016-09-05 01:53:25,024: [DEBUG]: Port: 443
2016-09-05 01:53:25,024: [DEBUG]: Params: {}
2016-09-05 01:53:25,024: [DEBUG]: establishing HTTPS connection: host=aws-publicdatasets.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
2016-09-05 01:53:25,024: [DEBUG]: Token: None
2016-09-05 01:53:25,025: [DEBUG]: Final headers: {'Content-Length': 0, 'User-Agent': 'Boto/2.34.0 Python/2.7.9 Linux/3.16.0-4-amd64'}
2016-09-05 01:53:25,160: [DEBUG]: Response headers: [('x-amz-bucket-region', 'us-east-1'), ('x-amz-id-2', 'FHxxloN4AncgEdEcvzG2PPKZBw7Rx+/UUQyRwrCZ73+Mu4TVMLOUfPxIffAWYI8VyQApRAPu6Ew='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'B1C197BFA572A608'), ('date', 'Mon, 05 Sep 2016 01:53:26 GMT'), ('content-type', 'application/xml')]
2016-09-05 01:53:25,160: [DEBUG]: path=//common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,161: [DEBUG]: auth_path=/aws-publicdatasets//common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,161: [DEBUG]: Method: HEAD
2016-09-05 01:53:25,161: [DEBUG]: Path: /common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,161: [DEBUG]: Data: 
2016-09-05 01:53:25,161: [DEBUG]: Headers: {}
2016-09-05 01:53:25,161: [DEBUG]: Host: aws-publicdatasets.s3.amazonaws.com
2016-09-05 01:53:25,161: [DEBUG]: Port: 443
2016-09-05 01:53:25,161: [DEBUG]: Params: {}
2016-09-05 01:53:25,162: [DEBUG]: Token: None
2016-09-05 01:53:25,162: [DEBUG]: Final headers: {'Content-Length': 0, 'User-Agent': 'Boto/2.34.0 Python/2.7.9 Linux/3.16.0-4-amd64'}
2016-09-05 01:53:25,219: [DEBUG]: Response headers: [('content-length', '414361648'), ('x-amz-id-2', '+vH7cOoY3M4g1bEo2sULeD0cmUIbvUIXEWrLsH7ZEuicRM2AZv26wDCAqmivxkuduWuZMjGO+mo='), ('accept-ranges', 'bytes'), ('server', 'AmazonS3'), ('last-modified', 'Tue, 08 Dec 2015 05:04:41 GMT'), ('x-amz-request-id', 'FD1B95BAE96B41E5'), ('etag', '"3cb568591adb3a8a67138d8bdbc051f8"'), ('date', 'Mon, 05 Sep 2016 01:53:26 GMT'), ('x-amz-version-id', 'null'), ('content-type', 'application/octet-stream')]
2016-09-05 01:53:25,219: [DEBUG]: path=//common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,220: [DEBUG]: auth_path=/aws-publicdatasets//common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,220: [DEBUG]: Method: GET
2016-09-05 01:53:25,220: [DEBUG]: Path: /common-crawl/cc-index/collections/CC-MAIN-2015-48/indexes/cdx-00027.gz
2016-09-05 01:53:25,220: [DEBUG]: Data: 
2016-09-05 01:53:25,220: [DEBUG]: Headers: {'Range': 'bytes=371355599-372574279'}
2016-09-05 01:53:25,220: [DEBUG]: Host: aws-publicdatasets.s3.amazonaws.com
2016-09-05 01:53:25,220: [DEBUG]: Port: 443
2016-09-05 01:53:25,220: [DEBUG]: Params: {}
2016-09-05 01:53:25,221: [DEBUG]: Token: None
2016-09-05 01:53:25,221: [DEBUG]: Final headers: {'Range': 'bytes=371355599-372574279', 'Content-Length': 0, 'User-Agent': 'Boto/2.34.0 Python/2.7.9 Linux/3.16.0-4-amd64'}
2016-09-05 01:53:25,258: [DEBUG]: Response headers: [('content-length', '1218681'), ('x-amz-id-2', 'M3pRoibJ18iBhNrS61/evlp9nU3FJyulAnmNGRQI8P8KSLe4DYS3SUssihbqAVgB3pi9oc/YPFQ='), ('accept-ranges', 'bytes'), ('server', 'AmazonS3'), ('last-modified', 'Tue, 08 Dec 2015 05:04:41 GMT'), ('content-range', 'bytes 371355599-372574279/414361648'), ('x-amz-request-id', 'A6331FC1452CBE4F'), ('etag', '"3cb568591adb3a8a67138d8bdbc051f8"'), ('date', 'Mon, 05 Sep 2016 01:53:26 GMT'), ('x-amz-version-id', 'null'), ('content-type', 'application/octet-stream')]
127.0.0.1 - - [05/Sep/2016 01:53:25] "GET /CC-MAIN-2015-48-index?matchType=domain&filter=statuscode:200&filter=mime:text/html&output=json&fl=url,digest,length,offset,filename&limit=40000&url=bing.com HTTP/1.1" 500 85

I'm not quite sure but it looks like s3://aws-publicdatasets is no longer being maintained and has lots of inconsistencies.

When using s3://commoncrawl all files are accessible and CC-MAIN-2015-48 works fine.

What's the difference between these 2?

Should we move to s3://commoncrawl? What do you guys think?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.