Giter Club home page Giter Club logo

Comments (11)

alirezamika avatar alirezamika commented on August 15, 2024 1

Thanks for the examples @PickNickChock. They can help really for diagnosis and making it better.

  1. Yes. There's this issue with multiple tables in similar structure and path.
  2. Maybe you didn't scape the characters in wanted list? you should put '\\a' instead of '\a' for python to escape. I checked and there was no problem in both cases.
  3. For scraping tables, I recommend using grouped=True parameter. It will output each column separately without removing duplicates and you can fine-tune the results. Again I didn't have problem with it.
  4. Same as 3.

Also make sure you are using the latest version.

from autoscraper.

PickNickChock avatar PickNickChock commented on August 15, 2024 1

@felipewhitaker

If you want to get tables from a website, why not use pandas?

Just as I mentioned above. And I guess the reason is that craine would like to do that with autoscraper. Also, sometimes introducing Pandas as another dependency just to grab tables from site would be an overkill.

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

That would mean that you'll need to create custom parser with BS4 or whatever unless site provides some endpoint which returns table data in JSON (I guess that this is what req.json() in your message implies). However, again, the point is that craine, would like to do this with autoscraper

from autoscraper.

PickNickChock avatar PickNickChock commented on August 15, 2024

I'm 99% sure that this package is not capable of pulling tables (and I'm not sure that it should), at least in a more or less «pretty way», but anyway, provided site has several problems:

  1. It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.
  2. Table contents is downloaded via JavaScript, i.e. when autoscraper or other lib downloads the page there is no content yet to scrape.

At current stage, if you don't want to build custom scraper with smth like BeautifulSoup and you have sites with static content you want to scrape, you can try using Pandas, specifically read_html function that will try to extract tables within that page.

Also, if you want to scrape content, which is downloaded by JavaScript, like in the site you provided, you can try to use something like Selenium to simulate browser and get HTML of the page when everything is loaded

from autoscraper.

 avatar commented on August 15, 2024
  1. It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.

its not blocking the request because of user-agent. They have scrape protection incapsula which you can see in the requested header cookies visid_incap and incap_ses.

from autoscraper.

craine avatar craine commented on August 15, 2024

I'm not focusing specifically on that website alone. I tried a few sites with tables. I know I can go grab stuff with BS or Scrapy but thought your tool was cool as hell and would save me a ton of time.

from autoscraper.

alirezamika avatar alirezamika commented on August 15, 2024

Can you share the other websites which you had troubles with tables so we can check?
Thanks.

from autoscraper.

PickNickChock avatar PickNickChock commented on August 15, 2024

Personally I tried out several sites:

  1. Wikipedia, this page, for example. I want to get Theatre table, so as wanted_list I enter items of the first row (in all other examples I will use the first row as well). When I look at results I see that program scraped data from all tables on the page.
  2. Then I tried this and this sites. They have only one table, so, I thought, everything should go ok. However, in both cases program returned None. Probably, the problem is that wanted items include escape characters.
  3. Also I tried to scrape one tricky table from this page. As wanted list I wrote ['Yes', 'No', 'No', 'No', 'No', 'No'] and I got ['Entire program', 'Yes', 'No', 'Containing class', 'Current assembly', 'Derived types', 'Derived types within current assembly']. As far as I remember, program removes duplicates from results and in some cases (like this one) this may be undesirable.
  4. Finally, I found a simple table here. Enter first row, scrape and voila — we have ['abstract', 'MustInherit', 'internal', 'Friend', 'new', 'Shadows'] and so on. In this case we kinda get what we want but not in a very good format, so, I guess, one would have to format it somehow to work with it further.

from autoscraper.

craine avatar craine commented on August 15, 2024

Another page was this:
https://www.pro-football-reference.com/teams/buf/2019_advanced.htm
I'd want to grab each table individually.

from autoscraper.

felipewhitaker avatar felipewhitaker commented on August 15, 2024

If you want to get tables from a website, why not use pandas?

import pandas as pd

io = "https://en.wikipedia.org/wiki/Daisy_Ridley"
dfs = pd.read_html(io)

# now dfs is a list of the tables in {url} - mostly well formatted and ready to be manipulated

print(dfs[1]) # the second table of {io}
# out
Year Title Role Notes
0 2013 Lifesaver Jo Screen debut; interactive short film[75]
1 2013 Blue Season Sarah Short film[75]
2 2013 100% BEEF Girl Short film[76]
3 2013 Crossed Wires Her Short film[77]
4 2014 Under Waitress Short film[75]
5 2015 Scrawl Hannah nan
6 2015 Star Wars: The Force Awakens Rey nan
7 2016 Only Yesterday Taeko Okajima Voice; English dub
8 2016 The Eagle Huntress Narrator Voice; also executive producer
9 2017 Murder on the Orient Express Mary Debenham nan
10 2017 Star Wars: The Last Jedi Rey nan
11 2018 Ophelia Ophelia nan
12 2018 Peter Rabbit Cottontail Rabbit Voice; also featured in a short companion piece named Flopsy Turvy
13 2019 Star Wars: The Rise of Skywalker Rey nan
14 2020 Asteroid Hunters[78] Narrator Voice; post-production
15 2021 Chaos Walking Viola Eade Post-production

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

import requests

url = ""
req = requests.get(url)
data = req.json()

from autoscraper.

craine avatar craine commented on August 15, 2024

@PickNickChock 100%. I know how to use Scrapy and BS4. The beauty of this tool is simplicity. Just thought it'd be a great feature.

from autoscraper.

ubalklen avatar ubalklen commented on August 15, 2024

@craine I agree, tables should be auto scrapable.

In the meantime, I created Untable, a tiny module that does exactly that.

from autoscraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.