Comments (12)
Have implemented this, seems to be working as expected so far.
from rpscrape.
The site limits requests per minute, there are no gains to be made there, vast majority of running time is waiting for requests to be allowed. I think writing iteratively is also preferable given the precarious nature of web scraping with edge cases and connection, you dont want to risk losing hours of work in memory.
from rpscrape.
Yeah I'm less sure on the iterative vs one-off writing being an improvement for those reasons, however I'm also not sure whether an incomplete data set is worth much more than none at all if there's a break!
However on async vs sync, loading 2019-2020 GB jumps seems to take ~840s sequentially vs <200s when I use asynchronous requests calls, with (as far as I can tell) the same output. It's possible (likely?) I'm missing something, but there definitely appears to be an improvement.
from rpscrape.
Interesting, its possible they have removed the limit or changed it as it has been a few years since I checked. Do you have a fork with your changes I can test?
from rpscrape.
Ran a little test there for multiple minutes and got no 403 responses which suggests the limiting has been removed and async will dramatically improve speed as you have seen. Ill start working on it just now, or you can make a pull request if youve already done it.
from rpscrape.
Awesome, glad you could replicate. Sorry was about to push my version then noticed my auto formatter had gone a little rogue with the quotation mark formatting. Can fix up and make a PR tomorrow if you havenโt done already by then.
from rpscrape.
Awesome, and more properly async than I did too! Thanks
from rpscrape.
Many thanks for the update.
I can fetch yesterday's results fine.
But should I be getting this noise awell?:
$ cd rpscrape/scripts/
$ python rpscrape.py -d 2021/04/19
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x0000026D175A7E50>
Traceback (most recent call last):
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 719, in call_soon
self._check_closed()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Finished scraping. 2021_04_19.csv saved in rpscrape/data//all
Exception ignored in: <function _ProactorBasePipeTransport.del at 0x0000026D175A7E50>
Traceback (most recent call last):
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 116, in del
self.close()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 719, in call_soon
self._check_closed()
File "C:\Users\garry\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
...
from rpscrape.
No you shouldnt be getting that, seems to be a windows specific issue, will have a look.
from rpscrape.
Many thanks!
from rpscrape.
I have tested on windows and runs fine with no errors, can you confirm?
from rpscrape.
Confirmed working now. Thanks again!
from rpscrape.
Related Issues (20)
- orjson.JSONDecodeError: trailing characters HOT 5
- Issues scraping data for the day HOT 3
- source code string cannot contain null bytes HOT 2
- TypeError: Dict key must be str HOT 5
- get_class_from_rating HOT 3
- ERROR: distance_to_furlongs() HOT 1
- IndexError: list index out of range - not scraping Wetherby 12/10 HOT 1
- Error pulling down today's racecards HOT 6
- Racecard Error on Tomorrow HOT 1
- Error extract results HOT 2
- TypeError: Dict key must be str HOT 2
- rpscrape not pulling GB jumps data between given dates HOT 3
- Issue getting data for a year HOT 1
- Problem downloading tomorrow's racecard (04-Nov-23) HOT 11
- Racecards Issue possibly HOT 1
- order by race time HOT 7
- rpscrape HOT 2
- Error in Jockey Data when Scraping HOT 4
- Missing data: TS, RPR HOT 3
- Not Scraping Topspeed HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rpscrape.