Giter Club home page Giter Club logo

livescrape's People

Contributors

kvdveer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

datahack-ru

livescrape's Issues

Add documentation to QA, somehow

Documentation is an significant part of this project, it needs CI/QA as well. Some ideas:

  • Spell checking
  • Extract code fragments and check for syntax errors
  • Extract code fragments and try to execute them.

http headers

Hi Koert,

I've had an unsuccessful stab at allowing one to pass user agent strings to the requests call in livescrape. Any chance you could have a look at this? I'd like to be able to instantiate livescrape with something like

page = ScrapedPage(scrape_url='zzz', headers={'User-agent':'Mozilla something or other'})

Any chance of you having a look at this?

Thanks,

Mat

Document relation to other scraping projects.

Livescrape is not the only player in the scraping ecosystem. We will need to explain the reasoning behind its creation, and inform users so they can choose between Livescrape and other projects.

We should at the very least address Scrapy

Tutorial issues

Hi,

Firstly, thanks for writing this, looks really useful.

I'd like to make a couple of suggestions for the tutorial page.

  1. from livescrape import ScrapedPage, Css, CssMulti
    (beginner friendlify it)
  2. class project_page = GithubProjectPage(ScrapedPage):
    ^^ your version didn't capitalise Page in GithubProjectPage
  3. project_page = GithubProjectPage("ondergetekende", "livesscrape")
    ( double s in livesscrape)
  4. print (project_page.contents) returns empty dicts. It would improve the tutorial if something came back here.

I appreciate this isn't as interesting as editing code - I'm happy to send over an error-free tutorial page (or figure out git) if it's easier.

Thanks again,

Mat

Typos in tutorial

I'm assuming that you didn't intend to use a decorator here?

from livescrape import ScrapedPage, Css

class GithubProjectpage(ScrapedPage):
    scrape_url = "https://github.com/%(username)s/%(projectname)s/"
    scrape_args = ("username", "projectname")

    git_repo = @Css("input.input-monospace", attribute="value")

page = GithubProjectpage("python", "cpython")
print(page.description)

Figure out how to use decorator syntax with CssGroup

For regular attributes, we can use decorator syntax, but that won't work for CssGroup attributes.

Perhaps something like this?

group = CssGroup(".whatever")

@group
@Css("thing")
def foo(self, value, element):
    return None

CssMulti is counter-intuitive

Whereas the ScrapedPage descendants use attribute access to access the members, CssMulti uses dictionary access. This is especially confusing when the CssMulti exposes a CssLink. In that case you end up with access like article_index.posts[0]['author'].homepage. I feel this should look like article_index.posts[0].author.homepage.

Also, the name CssMulti doesn't really cover it's intended use. I'm considering a syntax like:

class ExampleIndex(ScrapedPage):
    @CssComposite(".article")
    class article(ScrapedSubPage):
        link = CssLink('a.main')

That way the attribute resolving can be made lazy, which will help performance. This method also allows us to use the decorator syntax, and allows definition of non-scraper things.

One disadvantage is that because the class transitions from being a class to an attribute, there is no way to not violate the Pep008 naming scheme. We could work around that by using a two-stage definition scheme, but that feels cumbersome to me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.