Giter Club home page Giter Club logo

Comments (12)

joenano avatar joenano commented on September 24, 2024 1

Have implemented this, seems to be working as expected so far.

from rpscrape.

joenano avatar joenano commented on September 24, 2024

The site limits requests per minute, there are no gains to be made there, vast majority of running time is waiting for requests to be allowed. I think writing iteratively is also preferable given the precarious nature of web scraping with edge cases and connection, you dont want to risk losing hours of work in memory.

from rpscrape.

TomMcL avatar TomMcL commented on September 24, 2024

Yeah I'm less sure on the iterative vs one-off writing being an improvement for those reasons, however I'm also not sure whether an incomplete data set is worth much more than none at all if there's a break!

However on async vs sync, loading 2019-2020 GB jumps seems to take ~840s sequentially vs <200s when I use asynchronous requests calls, with (as far as I can tell) the same output. It's possible (likely?) I'm missing something, but there definitely appears to be an improvement.

from rpscrape.

joenano avatar joenano commented on September 24, 2024

Interesting, its possible they have removed the limit or changed it as it has been a few years since I checked. Do you have a fork with your changes I can test?

from rpscrape.

joenano avatar joenano commented on September 24, 2024

Ran a little test there for multiple minutes and got no 403 responses which suggests the limiting has been removed and async will dramatically improve speed as you have seen. Ill start working on it just now, or you can make a pull request if youve already done it.

from rpscrape.

TomMcL avatar TomMcL commented on September 24, 2024

Awesome, glad you could replicate. Sorry was about to push my version then noticed my auto formatter had gone a little rogue with the quotation mark formatting. Can fix up and make a PR tomorrow if you havenโ€™t done already by then.

from rpscrape.

TomMcL avatar TomMcL commented on September 24, 2024

Awesome, and more properly async than I did too! Thanks

from rpscrape.

gbettle avatar gbettle commented on September 24, 2024

Many thanks for the update.

I can fetch yesterday's results fine.

But should I be getting this noise awell?:

$ cd rpscrape/scripts/

$ python rpscrape.py -d 2021/04/19

Exception ignored in: <function _ProactorBasePipeTransport.del at 0x0000026D175A7E50>
Traceback (most recent call last):
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 719, in call_soon
self._check_closed()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

Finished scraping. 2021_04_19.csv saved in rpscrape/data//all
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x0000026D175A7E50>
Traceback (most recent call last):
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 719, in call_soon
self._check_closed()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

...

from rpscrape.

joenano avatar joenano commented on September 24, 2024

No you shouldnt be getting that, seems to be a windows specific issue, will have a look.

from rpscrape.

gbettle avatar gbettle commented on September 24, 2024

Many thanks!

from rpscrape.

joenano avatar joenano commented on September 24, 2024

I have tested on windows and runs fine with no errors, can you confirm?

from rpscrape.

gbettle avatar gbettle commented on September 24, 2024

Confirmed working now. Thanks again!

from rpscrape.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.