Giter Club home page Giter Club logo

comcrawl's Issues

client = IndexClient() not working

Hi,

I've been using this library to fetch the html files from quite sometime but today I'm facing error which says " Failed to establish the connection" I've tried with 2 different internet connection and it is not working on both.

Is it an issue from common crawl server?

JSONDecodeError while instantiating `IndexClient`

Hello, I'm trying to run the following code and getting this error (It's in a virtual environment on a Mac, with Python v3.7.0. I tried on Google Colab, and got the same error as well.

Any idea what I might be doing wrong?

Thanks!

>>> from comcrawl import IndexClient
>>> client = IndexClient(verbose=True)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 502 182
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/comcrawl/core/index_client.py", line 49, in __init__
    self.indexes = fetch_available_indexes()
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/comcrawl/utils/initialization.py", line 20, in fetch_available_indexes
    .get("https://index.commoncrawl.org/collinfo.json")
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
$ pip freeze 

pip freeze
certifi==2020.6.20
chardet==3.0.4
comcrawl==1.0.2
idna==2.10
requests==2.24.0
urllib3==1.25.10

Filter data based on year

I am trying to get the data using comcrawl for year 2019 - 2020 using client = IndexClient(['2019-04', '2020-45']). Looks like it is giving the data only for 2019-04 & 2020-45. Is there a way to filter data year wise? Do I have to write all the index as parameter for all the years needed?

How to get TEXT (wet) instead of HTML (WARC)?

From what I understand, the index offsets are different for WET vs WARC. So is there a way to search for the WET files using index.commoncrawl.org or would we need to download all the indexes and re-write this script to read from there instead of querying index.commoncrawl.org?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.