Giter Club home page Giter Club logo

comcrawl's Introduction

Software Engineer with a passion for Productivity, Product Management, Design Systems, Component Architectures and AI.

Currently building web stuff at Joyn. Previously at Lateral and ApoSync.


I have several years of experience ranging from working at small-scale startups in the medical and AI industry to a bigger scale-up in the media and entertainment industry.

I care deeply about UX and DX alike and always strive for the code to be just as delightful for developers as the user interfaces built with it should be for the end users.

I value good product management, because if I can see that the product I am working on is solving the right problems in the right way, it really fuels my dedication.

Finally I am very keen on productivity topics and techniques for focused work and effective communication. Especially working remotely I think it is extremely important to pay attention to how information is shared and time is allocated.

comcrawl's People

Contributors

michaelharms avatar sarunas-girdenas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

comcrawl's Issues

How to get TEXT (wet) instead of HTML (WARC)?

From what I understand, the index offsets are different for WET vs WARC. So is there a way to search for the WET files using index.commoncrawl.org or would we need to download all the indexes and re-write this script to read from there instead of querying index.commoncrawl.org?

Filter data based on year

I am trying to get the data using comcrawl for year 2019 - 2020 using client = IndexClient(['2019-04', '2020-45']). Looks like it is giving the data only for 2019-04 & 2020-45. Is there a way to filter data year wise? Do I have to write all the index as parameter for all the years needed?

JSONDecodeError while instantiating `IndexClient`

Hello, I'm trying to run the following code and getting this error (It's in a virtual environment on a Mac, with Python v3.7.0. I tried on Google Colab, and got the same error as well.

Any idea what I might be doing wrong?

Thanks!

>>> from comcrawl import IndexClient
>>> client = IndexClient(verbose=True)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 502 182
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/comcrawl/core/index_client.py", line 49, in __init__
    self.indexes = fetch_available_indexes()
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/comcrawl/utils/initialization.py", line 20, in fetch_available_indexes
    .get("https://index.commoncrawl.org/collinfo.json")
  File "/Users/Elias/Desktop/Temp/commoncrawl/venv/lib/python3.7/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
$ pip freeze 

pip freeze
certifi==2020.6.20
chardet==3.0.4
comcrawl==1.0.2
idna==2.10
requests==2.24.0
urllib3==1.25.10

client = IndexClient() not working

Hi,

I've been using this library to fetch the html files from quite sometime but today I'm facing error which says " Failed to establish the connection" I've tried with 2 different internet connection and it is not working on both.

Is it an issue from common crawl server?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.