Comments (8)
@impredicative Thanks! Forgotten print is removed in 0.12.1.
from urlextract.
OK, closing issue. @Larrax can reopen it if some related bug is found.
Thanks @impredicative for testing! I should not do late night releases ...
from urlextract.
Thank you for reporting it, I have to debug it. Right now I can not tell what is causing this issue.
from urlextract.
@lipoja Hi, here is a simpler case of a missing URL:
>>> import urlextract
>>> urlextract.__version__
'0.11'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['https://medicalxpress.com/news/2017-09-margarine-butter.html']
From the last command above, two URLs were expected, but only one was returned. To get all the URLs, I am having to use a workaround such as the one below:
>>> words = [word for word in text.split() if not word.isalnum()]
>>> [url for s in words for url in url_extractor.find_urls(s)]
['https://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']
Please investigate. Thanks.
from urlextract.
This issue should be fixed as part of 0.12.0 release
from urlextract.
Thanks. I confirm that at least my reported example is fixed with 0.12.0:
>>> import urlextract
>>> urlextract.__version__
'0.12.0'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['http://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']
I also tested Larrax's example which too now works.
from urlextract.
@lipoja There is just the issue of the print
line 624 in urlextract_core.py
.
from urlextract.
I am satisfied. I will leave it to @Larrax to also test 0.12.1 and to maybe try to come up with any failing example if that is even possible.
from urlextract.
Related Issues (20)
- Wrong indices when the domain name contains the same TLD twice HOT 3
- add types to urlextract HOT 3
- URLExtract no longer support Python 3.6 because of filelock recent changes
- TLD cache filelock error on read-only systems HOT 12
- Wrong indices with uppercase characters in domain name HOT 1
- Passing custom cache_dir doesnt seem to actually save the tlds...txt file in that dir
- IPv6? HOT 1
- left walk does not stop on various unicode chars HOT 1
- should not grab email fragments HOT 1
- comma extracted at the end if url ends with comma HOT 3
- travis-ci seems no longer active repository HOT 2
- URLExtract() init really slow
- ERROR: Can not download list of TLDs. (URLError: [Errno 104] Connection reset by peer) HOT 2
- Does Not extract the URL that is leading special character
- urlextract without authority causes AttributeError HOT 2
- Wrong indices and incomplete extraction when string contains similar urls HOT 1
- Handle upper-case false positives HOT 9
- Unable to detect t.me links HOT 1
- Bug with flag `allow_mixed_case_hostname=False` HOT 4
- Extracting Markdown Text, doesn't process escaped \\ correctly HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from urlextract.