Comments (5)
Thank you for using this library. And bigger THANKS for reporting this issue! I will check it and try to fix it soon.
from urlextract.
I've got the same issue with the following error message:
ValueError Traceback (most recent call last)
<ipython-input-158-447ea8a54a6f> in <module>()
1 text = '[img: http://www.newsisfree.com/images/fark/smh.com.au.gif ([smh.com.au])]'
----> 2 urls = list(set(url_extractor.find_urls(text)))
3 urls
~/.local/lib/python3.6/site-packages/urlextract/urlextract_core.py in find_urls(self, text, only_unique)
754 urls = self.gen_urls(text)
755 urls = set(urls) if only_unique else urls
--> 756 return list(urls)
757
758 def has_urls(self, text):
~/.local/lib/python3.6/site-packages/urlextract/urlextract_core.py in gen_urls(self, text)
737 validated = self._validate_tld_match(text, tld, offset + tld_pos)
738 if tld_pos != -1 and validated:
--> 739 tmp_url = self._complete_url(text, offset + tld_pos, tld)
740 if tmp_url:
741 yield tmp_url
~/.local/lib/python3.6/site-packages/urlextract/urlextract_core.py in _complete_url(self, text, tld_pos, tld)
558 complete_url = self._remove_enclosure_from_url(
559 complete_url, tld_pos-start_pos, tld)
--> 560 if not self._is_domain_valid(complete_url, tld):
561 return ""
562
~/.local/lib/python3.6/site-packages/urlextract/urlextract_core.py in _is_domain_valid(self, url, tld)
630 # <scheme>://<authority>/<path>?<query>#<fragment>
631
--> 632 host = url_parts.gethost()
633 if not host:
634 return False
~/.local/lib/python3.6/site-packages/uritools/split.py in gethost(self, default, errors)
155 return _ip_literal(host[1:-1])
156 elif host.startswith(self.LBRACKET) or host.endswith(self.RBRACKET):
--> 157 raise ValueError('Invalid host %r' % host)
158 # TODO: faster check for IPv4 address?
159 return _ipv4_address(host) or uridecode(host, 'utf-8', errors).lower()
ValueError: Invalid host 'smh.com.au]'
Code:
text = '[img: http://www.newsisfree.com/images/fark/smh.com.au.gif ([smh.com.au])]'
urls = list(set(url_extractor.find_urls(text)))
urls
I'm using Python 3.6.5 and IPython 6.4.0
from urlextract.
@dominikstraessle Hi, could tell me what version of urlextract are you using? I can not reproduce your error with the provided text and current version (0.9) of urlextract.
OK, I manage to reproduce the error. Thanks.
from urlextract.
Issue should be fixed, I've added both text to tests files.
@thoppe Right now urlextract does not return text "et.al.[10]" as URL. If you are parsing something specific you might use your own settings that fit your needs. For example by setting the stop characters using set_stop_chars_right
and set_stop_chars_left
.
from urlextract.
from urlextract.
Related Issues (20)
- left walk does not stop on various unicode chars HOT 1
- should not grab email fragments HOT 1
- comma extracted at the end if url ends with comma HOT 3
- travis-ci seems no longer active repository HOT 2
- URLExtract() init really slow
- ERROR: Can not download list of TLDs. (URLError: [Errno 104] Connection reset by peer) HOT 2
- Does Not extract the URL that is leading special character
- urlextract without authority causes AttributeError HOT 2
- Wrong indices and incomplete extraction when string contains similar urls HOT 1
- Handle upper-case false positives HOT 9
- Unable to detect t.me links HOT 1
- Bug with flag `allow_mixed_case_hostname=False` HOT 4
- Extracting Markdown Text, doesn't process escaped \\ correctly HOT 1
- Support non-unicode hostname HOT 3
- Support for private/reserved/custom TLDs
- Wrong indices and repeated matches when hostname contains the TLD
- Invalid URLs accepted with subdomains
- red flag from antiviruses HOT 2
- [Errno 11002] Temporary failure in name resolution after using URLExtract HOT 1
- Add `py.typed` marker to source and package
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from urlextract.