Giter Club home page Giter Club logo

Comments (18)

nise avatar nise commented on June 6, 2024

I got the same error.

from wayback-machine-scraper.

gmonkman avatar gmonkman commented on June 6, 2024

me to

from wayback-machine-scraper.

sangaline avatar sangaline commented on June 6, 2024

Thanks to everyone for reporting this. I'll find some time to look into it and figure out what is going on. To anybody else finding this issue, please feel free to leave a comment confirming it. I have a lot of other things on my plate right now, and knowing that people are actually trying to use this has a big impact on how high of a priority it is for me.

from wayback-machine-scraper.

red-bin avatar red-bin commented on June 6, 2024

I had similar problems with strange underlying behavior:

Doesn't work:
cat list_of_domains.txt | xargs -I{} wayback-machine-scraper {}
echo -e "wayback-machine-scraper domain1.com\n wayback-machine-scraper domain2.com" > domains.sh ; bash domains.sh

Works:
Pasting ``"wayback-machine-scraper domain1.com` directly into a shell and hitting enter. This works with more than one command.

Something strange is going on here that I can't explain. It's not a big deal for me personally, since I can just copy/paste all commands at once, but it's still a bit cumbersome considering I have several hundred domains to scrape through.

Maybe some children processes are being killed improperly based on some incorrect state poll?

from wayback-machine-scraper.

hussnainsheikh avatar hussnainsheikh commented on June 6, 2024

I am facing the same issue. Any solution?
image

from wayback-machine-scraper.

danielvarab avatar danielvarab commented on June 6, 2024

bump?

from wayback-machine-scraper.

SpongebobSquamirez avatar SpongebobSquamirez commented on June 6, 2024

Is this thing working or not?

from wayback-machine-scraper.

pawel1981 avatar pawel1981 commented on June 6, 2024

The same error here:

(pewing-dev) pm@DESKTOP-43243242 D:\workspace\czytelniamedyczna
$ wayback-machine-scraper -f 20000101 -t 20200101 urologiapolska.pl
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: scrapybot)
2020-03-03 12:52:50 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.17763-SP0
2020-03-03 12:52:50 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
'LOG_LEVEL': 'INFO',
'USER_AGENT': 'Wayback Machine Scraper/1.0.7 '
'(+https://github.com/sangaline/scrapy-wayback-machine)'}
2020-03-03 12:52:50 [scrapy.extensions.telnet] INFO: Telnet Password: x
2020-03-03 12:52:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_wayback_machine.WaybackMachineMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-03 12:52:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider opened
2020-03-03 12:52:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-03 12:52:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-03 12:52:51 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://urologiapolska.pl> (failed 3 times): Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.scraper] ERROR: Error downloading <GET http://urologiapolska.pl>
Traceback (most recent call last):
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\twisted\internet\defer.py", line 1418, in inlineCallbacks
result = g.send(result)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy\core\downloader\middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init
.py", line 64, in process_request
return self.build_cdx_request(request)
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\site-packages\scrapy_wayback_machine_init_.py", line 91, in build_cdx_request
cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
File "c:\users\pm\anaconda3\envs\pewing-dev\lib\nturl2path.py", line 65, in pathname2url
raise OSError(error)
OSError: Bad path: http://urologiapolska.pl
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-03 12:52:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/builtins.OSError': 3,
'elapsed_time_seconds': 0.226015,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 264106),
'log_count/ERROR': 2,
'log_count/INFO': 10,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/builtins.OSError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 3, 3, 11, 52, 51, 38091)}
2020-03-03 12:52:51 [scrapy.core.engine] INFO: Spider closed (finished)

from wayback-machine-scraper.

zhangml17 avatar zhangml17 commented on June 6, 2024

As above
So,how to solve this problem

from wayback-machine-scraper.

JomSpoons avatar JomSpoons commented on June 6, 2024

I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.

from wayback-machine-scraper.

SpongebobSquamirez avatar SpongebobSquamirez commented on June 6, 2024

I got this working not too long after I first posted here, although it took a lot of editing. I'll see about posting my fork soon.

from wayback-machine-scraper.

sangaline avatar sangaline commented on June 6, 2024

@ weirdalsuperfan Is your fork on this repo or scrapy-wayback-machine? I just merged a couple of pull requests there that should help resolve some issues.

from wayback-machine-scraper.

gmonkman avatar gmonkman commented on June 6, 2024

I don't know much about the technicalities here but I'm thinking maybe this has something to do with Anaconda installs in specific? All the errors on this thread seem to be using Anaconda. I'm having this same issue and I too am using Anaconda.

I was using a standard python.org Python 3.5.4 64bit distro.

from wayback-machine-scraper.

alwaysbyx avatar alwaysbyx commented on June 6, 2024

same problem, hope to see it solved..

from wayback-machine-scraper.

hugoaboud avatar hugoaboud commented on June 6, 2024

It might be platform related. The problem is on line 93 of __init__.py at the middleware repo:

cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))

The method pathname2url() is described in the docs as:
Convert the pathname path from the local syntax for a path to the form used in the path component of a URL. This does not produce a complete URL. The return value will already be quoted using the quote() function.

I guess it's supposed to replace Windows's backslashes "\". However I wasn't really able to create a string with backslashes, because the encoding at some point messes with it.

The fastest solution on Windows 10 was to simply remove the pathname2url function:

cdx_url = self.cdx_url_template.format(url=request.url)

from wayback-machine-scraper.

dayindave avatar dayindave commented on June 6, 2024

If I setup a Linux O/S will this work okay? Any recommendations on flavor - just looking for something that I should not have problems running this with, since that's the reason I'd be setting it up. I was going to try editing the init.py file but on Win 10 it is 0 kb.

from wayback-machine-scraper.

hugoaboud avatar hugoaboud commented on June 6, 2024

This repository is a command line utility/interface (CLI) that uses this middleware.

I have used the middleware itself along with Scrapy, not the CLI on this repo. So here's the __init__.py I was talking about, my post wasn't clear on that at all.

I assume this CLI also installs the scrapy-wayback-machine package, so looking at the pip packages folder you should be able to find the mentioned file.

I had no further problems with the middleware + Scrapy, Python 3 on Win10, very simple and clean.

from wayback-machine-scraper.

sangaline avatar sangaline commented on June 6, 2024

To answer @dayindave question, I believe that all of these issues are stemming from differences on Windows, and operating system that I don't have easy access to test things on. I think that the correlation with Anaconda is just incidental, and people are more likely to use it on Windows. Things will generally work better on Linux, and I'll be more responsive to issues there by virtue of being able to test things.

The pathname2url() call was failing on Windows due to http(s):// being invalid in a Windows path. This should have been resolved by scrapy-wayback-machine/pulls#2 which removes the invalid prefix. I'm going to close this issue now because I think that it's fixed now. If you're still getting an error with the pathname2url() call on Windows when using wayback-machine-scraper v1.0.8 and scrapy-wayback-machine v1.0.2, then please leave a comment here and I'll reopen. If you're experiencing any issue other than that, then please open a new issue to start a fresh discussion.

from wayback-machine-scraper.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.