medialab / minet Goto Github PK
View Code? Open in Web Editor NEWA webmining CLI tool & library for python.
License: GNU General Public License v3.0
A webmining CLI tool & library for python.
License: GNU General Public License v3.0
from my experience on Linux python3.5 python3-dev seems like required by dragnet install process. Not 100% sure though.
dragnet does not install its pip dependencies, while waiting for them to accept the PR adding those deps to minet requirements would help many users
finally dragnet being a large piece to get installed, it could be installed only optionally? Something like the dragnet option catch the specific dragnet module error and display the command to install it + documentation ?
This could be valuable if you shuffle the lines for instance to ensure you can distribute domains evenly.
prev_until
, that kind of stuff.
eBay refuses HEAD (they return a header Allow: GET) queries and redirects to a wrong page with infinite redirection when used
Example :
https://ebay.us/BUkuxU should get to https://www.ebay.com/itm/253189196428
but it goes to http://pages.ebay.com/messages/page_not_responding.html through https://pages.ebay.com/messages/page_not_responding.html and https://www.ebay.com/n/error?statuscode=500
Gotta love the web!
I noticed this:
200 : http://bit.ly/2YupNmj -> https://t.co/OqtIzx9TlI
Which is weird : t.co is also a redirection and should be followed
Using concepts such as exponential backoff. Might be a bit fun/tricky to implement in a multithreaded fashion with limited memory consumption.
Enumerate should wrap reader, not multithreaded iterator
(note: It might be more appropriate to move this issue to quenouille)
When trying to resolve this url http://www.outremersbeyou.com/talent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/ we end up with the following surprising stacktrace
(which is weird as this url is indeed a redirection, but to some normally encoded (but bad) url : Location: http://www.outremers360.comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/
I guess because of the missing slash, it considers as TLD "comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines" and since there are dashes inside, it tries to interpret it as punycode...
Traceback (most recent call last):
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/encodings/idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "bin/complete_links_resolving_v2.py", line 100, in <module>
resolve()
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "bin/complete_links_resolving_v2.py", line 57, in resolve
for res in multithreaded_resolve(urls_to_clear, threads=50, throttle=0.5, max_redirects=15):
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 353, in output
raise e.with_traceback(trace)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 303, in worker
result = func(data)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/fetch.py", line 237, in worker
error, stack = resolve(http, url, max=max_redirects, **kwargs)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 187, in resolve
headers_only=True
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 167, in request
redirect=redirect
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 68, in request
**urlopen_kw)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 89, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/poolmanager.py", line 326, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 355, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1254, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1300, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1249, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1036, in _send_output
self.send(msg)
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 974, in send
self.connect()
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 183, in connect
conn = self._new_conn()
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
Strip parenthesis tags and such.
If selector gathers multiple elements, should we aggregate retrieved text?
Supports Firefox.
Not giving a -o output.csv file will result in a TypeError: expected str, bytes or os.PathLike object, not NoneType on line 142 of fetch.py -> output_file = open(namespace.output, 'w')
I had other issues but you broke paris.demosphere.net so I can't do any further testing.
Examples: OPTIONS, HEAD etc.
Stacks are a mess on sigint
cc @paulgirard
Lol.
By url breakdown, for instance.
And test with OPTIONS and HEAD.
cc @paulgirard
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.