Giter Club home page Giter Club logo

lazynlp's Introduction

Hi, I'm Chip ๐Ÿ‘‹

I'm a writer and computer scientist. I grew up chasing grasshoppers in a small rice-farming village in Vietnam. I spend a lot of time with chickens and alpacas.

I'm best reached via email. I'm always open to interesting conversations and collaboration.


Twitter Follow

lazynlp's People

Contributors

cclauss avatar chiphuyen avatar monomagentaeggroll avatar ss18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lazynlp's Issues

License?

Hello,

There are legal problems with code with no license, where I work using code that has no license attached to it is outright banned.

Would you be so kind to add some sort of license in a file?

It would be very nice of you if it were something permissive, like MIT or Apache 2 or BSD too.

Thank you!

Sum of n-gram counts

Thanks for building this, really nice work!

I was reading through the code and noticed this line

count.update()

Were you looking to iteratively add up the line-ngram-counts? If yes, I can help complete that and raise a PR

Lmk

All the best

Bugs and Errors Format issue

Hello,

I am reaching out regarding your source code files for your Python codes. After running tests using Pyflakes and Pylint, there were a few errors present in the source codes and I felt that it could be something you could fix or look into.

lazynlp/analytics.py:200:10: C0209: Formatting a regular string which could be an f-string (consider-using-f-string)
lazynlp/analytics.py:231:21: W0613: Unused argument 'file' (unused-argument)
lazynlp/analytics.py:231:27: W0613: Unused argument 'gran' (unused-argument)
lazynlp/analytics.py:231:40: W0613: Unused argument 'max_n' (unused-argument)

lazynlp/cleaner.py:74:8: R1724: Unnecessary "else" after "continue", remove the "else" and de-indent the code inside it (no-else-continue)
lazynlp/crawl.py:96:4: E0633: Attempting to unpack a non-sequence defined at line 65 of tldextract.tldextract (unpacking-non-sequence)
lazynlp/crawl.py:106:0: R0911: Too many return statements (11/6) (too-many-return-statements)

outputLint.txt

These issues cause unnecessary memory use and can be better formatted. I have shown a few errors that I found using Pylint. I also added a link to a text file in which the errors/bugs that were present in all source files were reported using Pylint. Hope this helps.

Regards,
Rebal

urllib fails without headers

Hi,
Thanks for this great tool.

I noticed urllib fails with a Forbidden Request error when I call download_page on some links. You can reproduce the error by trying the code below:

import lazynlp
link = "https://punchng.com/"
page = lazynlp.download_page(link, context=None, timeout=None)

This raises a 403 as shown below.
Screen Shot 2019-09-16 at 2 09 51 PM

I've attempted to create a PR that adds headers to the request by default.

"Bug Report: Pylint Warning W0102 - Dangerous Default Value in download_pages Function"

Hello,

I am reaching out regarding your source code file for your Python codes (crawl.py). After running tests using Pylint a few errors present in the source code were found. I felt that it could be something you could fix or look into.

lazynlp/crawl.py:173:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
lazynlp/crawl.py:173:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
lazynlp/crawl.py:222:8: W0105: String statement has no effect (pointless-string-statement)

outputLint.txt

Possible Fix is:

def download_pages(link_file,
                   folder,
                   timeout=30,
                   default_skip=True,
                   extensions=None,
                   domains=None):
    """
    Your function documentation here.
    """
    # Check if extensions and domains are None, and if so, initialize them to empty lists
    if extensions is None:
        extensions = []
    if domains is None:
        domains = []

    # Your function code continues...

This modification ensures that each call to download_pages() gets its own separate empty list for extensions and domains.

I have shown a few errors that I found using Pylint. I also added a link to a text file in which the errors present in all source files were reported using Pylint. Hope this helps.

Regards,
Rebal

Bug and Error Report for unused variables

Hello,
I am reaching out regarding your Python code. After running tests using Pylint and Pyflakes, there are a few errors considering used variables that are present in the source codes and I felt that it could be something to look into and consider fixing:

Pylint:
lazynlp/cleaner.py:21:4: W0612: Unused variable 'e' (unused-variable)
lazynlp/crawl.py:95:4: W0612: Unused variable 'raw_url' (unused-variable)
lazynlp/crawl.py:118:4: W0612: Unused variable 'e' (unused-variable)
Pyflakes:
lazynlp/crawl.py:95:5: local variable 'raw_url' is assigned to but never used
lazynlp/crawl.py:118:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:121:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:152:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:155:5: local variable 'e' is assigned to but never used

outputFlakes.txt

outputLint.txt

There were a few more present in the remaining source code files, but for sake of not creating too long a message, I have shown a few errors for only variable types that I found using Pyflakes and Pylint. I also added a link to a text file in which the errors/bugs that were present in all source files were reported using Pylint and Pyflakes. Hope this helps.

Regards,
Rebal

syntax error near unexpected token

I see a "syntax error near unexpected token `sgp.urls,'" on submitting the following command:
lazynlp.download_pages(sgp.urls, text_docs, timeout = 30, default_skip = True, extensions = [], domains = [])

Is there something wrong I am doing? sgp.urls has all the URLs, text_docs is the name of the folder to get the outputs into, the rest of the parameters as default.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.