Comments (18)
I got the same error.
from wayback-machine-scraper.
me to
from wayback-machine-scraper.
Thanks to everyone for reporting this. I'll find some time to look into it and figure out what is going on. To anybody else finding this issue, please feel free to leave a comment confirming it. I have a lot of other things on my plate right now, and knowing that people are actually trying to use this has a big impact on how high of a priority it is for me.
from wayback-machine-scraper.
I had similar problems with strange underlying behavior:
Doesn't work:
cat list_of_domains.txt | xargs -I{} wayback-machine-scraper {}
echo -e "wayback-machine-scraper domain1.com\n wayback-machine-scraper domain2.com" > domains.sh ; bash domains.sh
Works:
Pasting ``"wayback-machine-scraper domain1.com` directly into a shell and hitting enter. This works with more than one command.
Something strange is going on here that I can't explain. It's not a big deal for me personally, since I can just copy/paste all commands at once, but it's still a bit cumbersome considering I have several hundred domains to scrape through.
Maybe some children processes are being killed improperly based on some incorrect state poll?
from wayback-machine-scraper.
I am facing the same issue. Any solution?
from wayback-machine-scraper.
bump?
from wayback-machine-scraper.
Is this thing working or not?
from wayback-machine-scraper.
The same error here:
(pewing-dev) pm@DESKTOP-43243242 D:\workspace\czytelniamedyczna
$ wayback-machine-scraper -f 20000101 -t 20200101 urologiapolska.pl
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: scrapybot)
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.17763-SP0
2020-03-03 12:52:50 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
'LOG_LEVEL': 'INFO',
'USER_AGENT': 'Wayback Machine Scraper/1.0.7 '
'(+https://github.com/sangaline/scrapy-wayback-machine)'}
2020-03-03 12:52:50 [scrapy.extensions.telnet] INFO: Telnet Password: x
2020-03-03 12:52:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider opened
2020-03-03 12:52:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-03 12:52:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-03 12:52:51 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://urologiapolska.pl> (failed 3 times): Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.scraper] ERROR: Error downloading <GET http://urologiapolska.pl>
Traceback (most recent call last):
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\twisted\internet\defer.py", line 1418, in inlineCallbacks
result = g.send(result)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy\core\downloader\middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init.py", line 64, in process_request
return self.build_cdx_request(request)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init_.py", line 91, in build_cdx_request
cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\nturl2path.py", line 65, in pathname2url
raise OSError(error)
OSError: Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-03 12:52:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/builtins.OSError': 3,
'elapsed_time_seconds': 0.226015,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 264106),
'log_count/ERROR': 2,
'log_count/INFO': 10,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/builtins.OSError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 38091)}
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider closed (finished)
from wayback-machine-scraper.
As above
So,how to solve this problem
from wayback-machine-scraper.
I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.
from wayback-machine-scraper.
I got this working not too long after I first posted here, although it took a lot of editing. I'll see about posting my fork soon.
from wayback-machine-scraper.
@ weirdalsuperfan Is your fork on this repo or scrapy-wayback-machine
? I just merged a couple of pull requests there that should help resolve some issues.
from wayback-machine-scraper.
I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.
I was using a standard python.org Python 3.5.4 64bit distro.
from wayback-machine-scraper.
same problem, hope to see it solved..
from wayback-machine-scraper.
It might be platform related. The problem is on line 93 of __init__.py
at the middleware repo:
cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
The method pathname2url() is described in the docs as:
Convert the pathname path from the local syntax for a path to the form used in the path component of a URL. This does not produce a complete URL. The return value will already be quoted using the quote() function.
I guess it's supposed to replace Windows's backslashes "\". However I wasn't really able to create a string with backslashes, because the encoding at some point messes with it.
The fastest solution on Windows 10 was to simply remove the pathname2url
function:
cdx_url = self.cdx_url_template.format(url=request.url)
from wayback-machine-scraper.
If I setup a Linux O/S will this work okay? Any recommendations on flavor - just looking for something that I should not have problems running this with, since that's the reason I'd be setting it up. I was going to try editing the init.py file but on Win 10 it is 0 kb.
from wayback-machine-scraper.
This repository is a command line utility/interface (CLI) that uses this middleware.
I have used the middleware itself along with Scrapy, not the CLI on this repo. So here's the __init__.py I was talking about, my post wasn't clear on that at all.
I assume this CLI also installs the scrapy-wayback-machine package, so looking at the pip packages folder you should be able to find the mentioned file.
I had no further problems with the middleware + Scrapy, Python 3 on Win10, very simple and clean.
from wayback-machine-scraper.
To answer @dayindave question, I believe that all of these issues are stemming from differences on Windows, and operating system that I don't have easy access to test things on. I think that the correlation with Anaconda is just incidental, and people are more likely to use it on Windows. Things will generally work better on Linux, and I'll be more responsive to issues there by virtue of being able to test things.
The pathname2url()
call was failing on Windows due to http(s)://
being invalid in a Windows path. This should have been resolved by scrapy-wayback-machine/pulls#2 which removes the invalid prefix. I'm going to close this issue now because I think that it's fixed now. If you're still getting an error with the pathname2url()
call on Windows when using wayback-machine-scraper v1.0.8 and scrapy-wayback-machine v1.0.2, then please leave a comment here and I'll reopen. If you're experiencing any issue other than that, then please open a new issue to start a fresh discussion.
from wayback-machine-scraper.
Related Issues (16)
- Compatibility? HOT 3
- Would it be possible to add a functionality to download a screenshot? HOT 1
- Inspired by warrick ? HOT 1
- Import Error: No module named request HOT 2
- [Question] How to get latest crawl? HOT 1
- snapshot functionality for a full site at a given time?
- 'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file. HOT 2
- Broken with Scrapy 2.x
- Error 429 + Scraper gives up HOT 2
- ImportError: cannot import name timezone HOT 2
- 'ExecutionEngine' object has no attribute 'schedule' HOT 1
- How can I use this to get the number of times a site is crawled by the wayback? HOT 2
- Crashes (includes fix) HOT 1
- Following image links HOT 2
- Error with setup HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wayback-machine-scraper.