Seems to be non functional about wayback-machine-scraper HOT 18 CLOSED

sangaline commented on June 6, 2024

Seems to be non functional

from wayback-machine-scraper.

Comments (18)

nise commented on June 6, 2024

I got the same error.

from wayback-machine-scraper.

gmonkman commented on June 6, 2024

me to

from wayback-machine-scraper.

sangaline commented on June 6, 2024

Thanks to everyone for reporting this. I'll find some time to look into it and figure out what is going on. To anybody else finding this issue, please feel free to leave a comment confirming it. I have a lot of other things on my plate right now, and knowing that people are actually trying to use this has a big impact on how high of a priority it is for me.

from wayback-machine-scraper.

red-bin commented on June 6, 2024

I had similar problems with strange underlying behavior:

Doesn't work:
cat list_of_domains.txt | xargs -I{} wayback-machine-scraper {}
echo -e "wayback-machine-scraper domain1.com\n wayback-machine-scraper domain2.com" > domains.sh ; bash domains.sh

Works:
Pasting ``"wayback-machine-scraper domain1.com` directly into a shell and hitting enter. This works with more than one command.

Something strange is going on here that I can't explain. It's not a big deal for me personally, since I can just copy/paste all commands at once, but it's still a bit cumbersome considering I have several hundred domains to scrape through.

Maybe some children processes are being killed improperly based on some incorrect state poll?

from wayback-machine-scraper.

hussnainsheikh commented on June 6, 2024

I am facing the same issue. Any solution?

from wayback-machine-scraper.

danielvarab commented on June 6, 2024

bump?

from wayback-machine-scraper.

SpongebobSquamirez commented on June 6, 2024

Is this thing working or not?

from wayback-machine-scraper.

pawel1981 commented on June 6, 2024

The same error here:

(pewing-dev) pm@DESKTOP-43243242 D:\workspace\czytelniamedyczna
$ wayback-machine-scraper -f 20000101 -t 20200101 urologiapolska.pl
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: scrapybot)
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.17763-SP0
2020-03-03 12:52:50 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
'LOG_LEVEL': 'INFO',
'USER_AGENT': 'Wayback Machine Scraper/1.0.7 '
'(+https://github.com/sangaline/scrapy-wayback-machine)'}
2020-03-03 12:52:50 [scrapy.extensions.telnet] INFO: Telnet Password: x
2020-03-03 12:52:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider opened
2020-03-03 12:52:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-03 12:52:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-03 12:52:51 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://urologiapolska.pl> (failed 3 times): Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.scraper] ERROR: Error downloading <GET http://urologiapolska.pl>
Traceback (most recent call last):
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\twisted\internet\defer.py", line 1418, in inlineCallbacks
result = g.send(result)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy\core\downloader\middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init.py", line 64, in process_request
return self.build_cdx_request(request)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init_.py", line 91, in build_cdx_request
cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\nturl2path.py", line 65, in pathname2url
raise OSError(error)
OSError: Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-03 12:52:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/builtins.OSError': 3,
'elapsed_time_seconds': 0.226015,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 264106),
'log_count/ERROR': 2,
'log_count/INFO': 10,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/builtins.OSError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 38091)}
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider closed (finished)

from wayback-machine-scraper.

zhangml17 commented on June 6, 2024

As above
So,how to solve this problem

from wayback-machine-scraper.

JomSpoons commented on June 6, 2024

I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.

from wayback-machine-scraper.

SpongebobSquamirez commented on June 6, 2024

I got this working not too long after I first posted here, although it took a lot of editing. I'll see about posting my fork soon.

from wayback-machine-scraper.

sangaline commented on June 6, 2024

@ weirdalsuperfan Is your fork on this repo or scrapy-wayback-machine? I just merged a couple of pull requests there that should help resolve some issues.

from wayback-machine-scraper.

gmonkman commented on June 6, 2024

I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.

I was using a standard python.org Python 3.5.4 64bit distro.

from wayback-machine-scraper.

alwaysbyx commented on June 6, 2024

same problem, hope to see it solved..

from wayback-machine-scraper.

hugoaboud commented on June 6, 2024

It might be platform related. The problem is on line 93 of __init__.py at the middleware repo:

cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))

The method pathname2url() is described in the docs as:
Convert the pathname path from the local syntax for a path to the form used in the path component of a URL. This does not produce a complete URL. The return value will already be quoted using the quote() function.

I guess it's supposed to replace Windows's backslashes "\". However I wasn't really able to create a string with backslashes, because the encoding at some point messes with it.

The fastest solution on Windows 10 was to simply remove the pathname2url function:

cdx_url = self.cdx_url_template.format(url=request.url)

from wayback-machine-scraper.

dayindave commented on June 6, 2024

If I setup a Linux O/S will this work okay? Any recommendations on flavor - just looking for something that I should not have problems running this with, since that's the reason I'd be setting it up. I was going to try editing the init.py file but on Win 10 it is 0 kb.

from wayback-machine-scraper.

hugoaboud commented on June 6, 2024

This repository is a command line utility/interface (CLI) that uses this middleware.

I have used the middleware itself along with Scrapy, not the CLI on this repo. So here's the __init__.py I was talking about, my post wasn't clear on that at all.

I assume this CLI also installs the scrapy-wayback-machine package, so looking at the pip packages folder you should be able to find the mentioned file.

I had no further problems with the middleware + Scrapy, Python 3 on Win10, very simple and clean.

from wayback-machine-scraper.

sangaline commented on June 6, 2024

To answer @dayindave question, I believe that all of these issues are stemming from differences on Windows, and operating system that I don't have easy access to test things on. I think that the correlation with Anaconda is just incidental, and people are more likely to use it on Windows. Things will generally work better on Linux, and I'll be more responsive to issues there by virtue of being able to test things.

The pathname2url() call was failing on Windows due to http(s):// being invalid in a Windows path. This should have been resolved by scrapy-wayback-machine/pulls#2 which removes the invalid prefix. I'm going to close this issue now because I think that it's fixed now. If you're still getting an error with the pathname2url() call on Windows when using wayback-machine-scraper v1.0.8 and scrapy-wayback-machine v1.0.2, then please leave a comment here and I'll reopen. If you're experiencing any issue other than that, then please open a new issue to start a fresh discussion.

from wayback-machine-scraper.

Seems to be non functional about wayback-machine-scraper HOT 18 CLOSED

Comments (18)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent