thp / urlwatch Goto Github PK

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.

Home Page: https://thp.io/2008/urlwatch/

License: Other

Python 70.98% Roff 28.69% Dockerfile 0.18% Shell 0.15%

python webpage monitor automation

urlwatch's Introduction

                         _               _       _       ____
              _   _ _ __| |_      ____ _| |_ ___| |__   |___ \
             | | | | '__| \ \ /\ / / _` | __/ __| '_ \    __) |
             | |_| | |  | |\ V  V / (_| | || (__| | | |  / __/
              \__,_|_|  |_| \_/\_/ \__,_|\__\___|_| |_| |_____|

                                  ... monitors webpages for you

urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed.

Documentation: https://urlwatch.readthedocs.io/
Website: https://thp.io/2008/urlwatch/
E-Mail: [email protected]

urlwatch's People

Contributors

Stargazers

Watchers

Forkers

qvr synchroack andrewhilts mtomokazu cnsoft360 elisbyberi zachwhitford brbsix muonzoo hutt dshah133 aylhex octobertech noctem oldshatterhand77 berseus roadierich atiro sclub egroeper fatpa phracker wrobelda aweiand kevbo volkan sbraz mrvdb karolbedkowski layus davidrichard2016 4sp1r3 rlugojr kikei lechuckcaptain rayleyva-usc pking74 linnet r0nd oxivanisher bag-of-projects coderjonny pablotoribio-beta lukas0907 soundstripe agati ssdtfarm songofhack schollz enieng mspencer08 gvdenbro afaucon vmassuchetto ntcong starwar giirrii cryptosaurus henri vnhacker1337 enra64 waqarpro drsn2 badboyback mcenirm-forks anusoft jsit fernandoherreradelasheras bnauman carloss7 mikenew jelly cmichi shashi007 silentorangutan r14152 jrchow cpfrer mwerlen ntman4real samarpw ggrigor89 lrfsh quantumanya molobrakos moon-dark anuragisi edgarhuichen corrafig huadaonan rxwatcher daking kbabioch afcarl nikobockerman hervemignot g-pichler enascimento ajmaln code-inflation

urlwatch's Issues

Don't create .urlwatch if not necessary

I move the config, url list, hooks file, and cache to other directories because I don't like having my home directory cluttered with dotdirectories unnecessarily, (consider implementing the XDG basedir spec instead..,) but urlwatch insists on creating the .urlwatch directory even if it isn't actually used.

URLs with POST elements not reported identifiably

I'm watching two URLs, one has a POST element and the second is the same URL w/out the POST element. That seems to function properly as far as detecting changes.

The problem is with the diff, both pages are reported with the (same) base URL, when the one should also include the POST data or report something to indicate that POST data was used, to distinguish it from the other (same) URL w/out POST data.

UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 6

The following exception happens, when the return value from a shell script contains umlaute (ä, ö, ü):

Traceback (most recent call last):
 File "/usr/bin/urlwatch", line 376, in <module>
   main(parser.parse_args())
 File "/usr/bin/urlwatch", line 343, in main
   report.finish()
 File "/usr/lib/python3.5/site-packages/urlwatch/handler.py", line 128, in finish
   ReporterBase.submit_all(self, self.job_states, duration)
 File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 81, in submit_all
   cls(report, cfg, job_states, duration).submit()
 File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 306, in submit
   print(line)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 6: ordinal not in range(128)

Add support for creating RSS/Atom feeds out of the changes list

I was wondering If I could get the differences of the webpage in RSS format. This would result in having a custom hosted alternative to sites like Feed43, Page2RSS etc.

Parse and handle robots.txt

urlwatch should handle robots.txt access rules.

Certificate verification error in openSUSE

I get when use urlwatch

urlwatch
Traceback (most recent call last):
  File "/usr/bin/urlwatch", line 397, in <module>
    main(parser.parse_args())
  File "/usr/bin/urlwatch", line 364, in main
    report.finish()
  File "/usr/lib/python3.4/site-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/usr/lib/python3.4/site-packages/urlwatch/reporters.py", line 89, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/usr/lib/python3.4/site-packages/urlwatch/reporters.py", line 304, in submit
    body = '\n'.join(super().submit())
  File "/usr/lib/python3.4/site-packages/urlwatch/reporters.py", line 220, in submit
    summary_part, details_part = self._format_output(job_state, line_length)
  File "/usr/lib/python3.4/site-packages/urlwatch/reporters.py", line 262, in _format_output
    pretty_summary = ': '.join((job_state.verb.upper(), pretty_name))
TypeError: sequence item 1: expected str instance, NoneType found

What is missed?

error instaling Urlwatch on my Ubuntu linux

i am newbie on ubuntu linux, but i am interet abaut urlwatch. so i am instaling ubuntu to try urlwatch on my ubuntu linux.
i have done install urlwatch 1.x, i have a question :

how to upgrade or update urlwatch 1.x to 2 ?
a am try to installing urlwatch 2 using console and run python setup.py install but error
anyone can give me step by step to installing urlwatch 2 on my ubuntu linux ?

Atom feed reporter

I'd love to have urlwatch output an Atom feed which I'd then be able to import into my feed reader.

Liferea supports reading feeds generated by a command-line program; it could work great together with urlwatch.

shelljob_security_checks doesn't work on Windows

os.getuid() is not supported on Windows systems.

The entire method rely on the current_uid variable that cannot be set on Windows. There are multiple options to explore (pywin32 module?), but maybe the easiest is to disable the check on windows.

 def shelljob_security_checks(self):
        shelljob_errors = []

        current_uid = os.getuid()

        dirname = os.path.dirname(self.filename) or '.'
        dir_st = os.stat(dirname)
        if (dir_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0:
            shelljob_errors.append('%s is group/world-writable' % dirname)
        if dir_st.st_uid != current_uid:
            shelljob_errors.append('%s not owned by %s' % (dirname, get_current_user()))

        file_st = os.stat(self.filename)
        if (file_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0:
            shelljob_errors.append('%s is group/world-writable' % self.filename)
        if file_st.st_uid != current_uid:
            shelljob_errors.append('%s not owned by %s' % (self.filename, get_current_user()))

        return shelljob_errors

Feature-Wish: cli-switches to control output (URLS only / no diffs)

The subject says it all:
a command-line-switch to suppress the diffs and other decorating output, and only print the URLs of the pages which changed, would be very helpful.

Filter diff results

It is possible to define custom hooks to "prepare" the content of the website for parsing. However, this happens before calculating the diff. I'm wondering if there is an easy way to define hooks that are executed after calculating the diff? For example, sometimes I only want to be informed if content is added to an website and do not care about the content that has been removed since the last visit. The simplest way (as I'm doing now with a quickly hacked grep parser) would be to remove any lines starting with "-" from the resulting diff. Is there any way to achieve this cleanly with urlwatch?

-s flag is invalid

I'm trying to configure Gmail SMTP with urlwatch but the supplied example, urlwatch -s smtp.example.com:587 -f [email protected] --pass, does not work on Ubuntu Server 15.04. I have all of the requisite dependencies installed, but urlwatch reports error: no such option: -s.

VACUUM locking issues when running two instances concurrently

I have two URL lists with different polling frequencies but they both run at the same time once a day.
When this happens, there is a locking issue when minidb runs VACUUM:

 Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/urlwatch", line 375, in <module>
    main(parser.parse_args())
  File "/usr/lib/python-exec/python3.5/urlwatch", line 345, in main
    cache_storage.close()
  File "/usr/lib64/python3.5/site-packages/urlwatch/storage.py", line 299, in close
    self.db.close()
  File "/usr/lib64/python3.5/site-packages/minidb.py", line 170, in close
    self._execute('VACUUM')
  File "/usr/lib64/python3.5/site-packages/minidb.py", line 153, in _execute
    return self.db.execute(sql)
sqlite3.OperationalError: cannot VACUUM - SQL statements in progress

A file-level lock with fcntl.LOCK_EX would fix the problem but I don't really know where it should be used.

Error on adding URLs containing an equal sign

If I try to add an URL using the CLI interface which contains an equal sign = (for example ./urlwatch --add url=http://example.com/index.php?test=sdasd,name="test" ) I get the following error:

  File "./urlwatch", line 375, in <module>
    main(parser.parse_args())
  File "./urlwatch", line 314, in main
    sys.exit(modify_urls(jobs, args.urls, args.add, args.delete))
  File "./urlwatch", line 213, in modify_urls
    d = {k: v for k, v in (item.split('=', 2) for item in add.split(','))}
  File "./urlwatch", line 213, in <dictcomp>
    d = {k: v for k, v in (item.split('=', 2) for item in add.split(','))}
ValueError: too many values to unpack (expected 2)

If the URL is added using the config file, everything works as expected.

Smtp don't work

watched 6 URLs in 0 seconds
Traceback (most recent call last):
  File "/usr/local/bin/urlwatch", line 375, in <module>
    main(parser.parse_args())
  File "/usr/local/bin/urlwatch", line 342, in main
    report.finish()
  File "/usr/local/lib/python3.4/dist-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/usr/local/lib/python3.4/dist-packages/urlwatch/reporters.py", line 81, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/usr/local/lib/python3.4/dist-packages/urlwatch/reporters.py", line 340, in submit
    mailer.send(msg)
  File "/usr/local/lib/python3.4/dist-packages/urlwatch/mailer.py", line 57, in send
    s.starttls()
  File "/usr/lib/python3.4/smtplib.py", line 689, in starttls
    server_hostname=server_hostname)
  File "/usr/lib/python3.4/ssl.py", line 364, in wrap_socket
    _context=self)
  File "/usr/lib/python3.4/ssl.py", line 577, in __init__
    self.do_handshake()
  File "/usr/lib/python3.4/ssl.py", line 804, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: TLSV1_UNRECOGNIZED_NAME] tlsv1 unrecognized name (_ssl.c:600)

i have tried with gmail server and own mailserver, with port 25/465/587, and with/without starttls
on multiple server

username:password@url domains do not work after Python3k conversion.

With the py3k conversion ( ./convert-to-python3.sh ) for urlwatch URLs with the pattern http://username:password@localhost/ do not work.

urlwatch crashes with:

File "/nix/store/sv62w8czrr5pziwwkgi67xw2vbc3swrs-python3.4-urlwatch-1.18/bin/.urlwatch-wrapped", line 307, in <module>
raise exception
File "/nix/store/4z7g0rlc27nwyq1mfn57kiijnnqn2swk-python3-3.4.3/lib/python3.4/concurrent/futures/thread.py", line 54, in run
result = self.fn(*self.args, **self.kwargs)
File "/nix/store/sv62w8czrr5pziwwkgi67xw2vbc3swrs-python3.4-urlwatch-1.18/bin/.urlwatch-wrapped", line 288, in process_job
data = job.retrieve(timestamp, filter_func, headers, log)
File "/nix/store/sv62w8czrr5pziwwkgi67xw2vbc3swrs-python3.4-urlwatch-1.18/lib/python3.4/site-packages/urlwatch/handler.py", line 142, in retrieve
parts.password)).encode('base64').strip())
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs

This issue does not occur when using python2 instead.

Initially i opened a ticket at NixOS/nixpkgs#12021

Tests fail because of missing urls file

...E..
======================================================================
ERROR: test_handler.test_load_examples
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/tmp/urlwatch-2.0/test/test_handler.py", line 37, in test_load_examples
    txt_jobs = UrlsTxt(os.path.join(os.path.dirname(__file__), 'data', 'urls.txt')).load_secure()
  File "/tmp/urlwatch-2.0/lib/urlwatch/storage.py", line 155, in load_secure
    jobs = self.load()
  File "/tmp/urlwatch-2.0/lib/urlwatch/storage.py", line 202, in load
    return list(self._parse(open(self.filename)))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/urlwatch-2.0/test/data/urls.txt'

----------------------------------------------------------------------
Ran 6 tests in 0.059s

FAILED (errors=1)

Support for cookie auth

Some webpage need session cookie and currently, urlwatch does not seems to handle those case.
urllib2 need cookielib to handle cookies but requests manage this automatically.

option to debug filter output

It would be nice if urlwatch had an option to show the output seen by be urlwatch after applying the filters. This would simplify the development of custom filters and the setup of the predefined ones.

Running urlwatch on pages that send an HTTP 404 status code

Urlwatch always sends an error when checking a page that transmits a HTTP 404 code (page not found).

The response text is as follows:

01. ERROR: 404 test
===========================================================================

---------------------------------------------------------------------------
ERROR: 404 test (http://tickets.sandbagtickets.com/event/UY8831935ji)
---------------------------------------------------------------------------
404 Client Error: Not Found
---------------------------------------------------------------------------


--

However, http://tickets.sandbagtickets.com/event/UY8831935ji is a real page, it's just temporarily not available. Is there any way to have urlwatch watch a page that transmits a 404, and alert when it no longer transmits a 404? As opposed to simply returning an error every time.

--quiet flag

It would be useful if support for the --quiet flag was added back in. I sometimes run urlwatch manually, which is why I want reporting on stdout enabled, but I also run it in a cronjob, where I don't want it to write anything to stdout so that I won't get unnecessary emails from the cron daemon.

This could be worked around by keeping a duplicate config file, but I'd really rather not have one unnecessarily.

How to use sendmail

Hi there,

since i cannot use smtp mailer anymore, because keyring forces me to use an encrypted keystore, i tried to use the sendmail-feature.
I configured the sendmail path and i can send mail from the console with
echo test| sendmail -t [email protected]

But when i try to use urlwatch, i get an connection refused-error. How do i configure urlwatch with sendmail correctly?

Warn when user has POSIX locale and suggest using UTF-8 instead

If the system locale is set to POSIX (or C), then some features won't work properly. Suggest to the user that they set the environment to UTF-8 (or fallback to UTF-8 in any case?).

Documentation on reporters

I tried to configure urlwatch with another reporter as the default stdout reporter.
But I could not figure out how to change the reporter. I assume it should be changed in the urls.yaml.
The example file does not contain anything on reporters and I did not found any documentation on this.
Did I miss something?

urlwatch==2.3 package on PyPI is broken

I just wanted to report that the urlwatch package on PyPI was packaged incorrectly. I haven't looked into it much but it's missing a bunch of files. Appears to have been caused by 142a0e5.

Running pip install urlwatch in a fresh virtualenv presents the following:

Collecting urlwatch
  Using cached urlwatch-2.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-8gsgi5q8/urlwatch/setup.py", line 26, in <module>
        main_py = open(os.path.join(HERE, 'lib', PACKAGE_NAME, '__init__.py')).read()
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-build-8gsgi5q8/urlwatch/lib/urlwatch/__init__.py'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-8gsgi5q8/urlwatch/

Only report on persistent changes

I'm finding that I'm getting a lot of false positives. urlwatch detects changes in a webpage in a particular run. In the next run, it detects a complete reversion of these changes. I see this on multiple websites, with different sections of the page. I'm not sure what is causing it, but it might be due to corrupted incoming files.

Perhaps urlwatch could have the option of caching the last two versions when it detects changes, and only report when changes persist for two runs?

Please enter password for encrypted keyring

Hi,
how do i configure urlwatch to send e-mails (my smtp provider requires TLS) without having to interactively type a password? Currently i get a popup saying "Please enter password for encrypted keyring:" which means i can't run the command in crontab/automatically.

thanks.

Include the URL in the subject of the alert e-mail

From user feedback:

was wondering if there is a way to include the name of the URL changed in the subject of the alert email?

Usee Markdown for README

I think the README would benefit from Markdown, here is a simple conversion from yours (I might have missed stuff).
https://gist.github.com/sbraz/4df782cb3225500c63d0

Support for adding HTTP authentication inside an URL

$ urlwatch -e
***************************************************************************
ERROR: http://Student:**REMOVED STH LIKE "aBc"**@annaw.ii.uph.edu.pl/dydaktyka_pliki/PP/
***************************************************************************
InvalidURL("nonnumeric port: '**REMOVED STH LIKE "aBc"**@annaw.ii.uph.edu.pl'",)
nonnumeric port: ''**REMOVED STH LIKE "aBc"**@annaw.ii.uph.edu.pl'
***************************************************************************

What's there wrong?

Some workaround is use curl like that:
| curl 'http://Student:**REMOVED STH LIKE "aBc"**@annaw.ii.uph.edu.pl/dydaktyka_pliki/PP/' 2>/dev/null

error: no such option --edit-config

Have set up virtualenv on a Raspberry Pi and activated that and installed urlwatch.
On attempting to edit the config I get the following.

(urlwatch)pi@pi:~ $ urlwatch --edit-config
Usage: urlwatch [options]

Watch web pages and arbitrary URLs for changes

urlwatch: error: no such option: --edit-config
(urlwatch)pi@pi:~ $

Looking to use urlwatch with PushOver.

urlwatch==2.3 package on PyPI is broken

I just wanted to report that the urlwatch package on PyPI was packaged incorrectly. I haven't looked into it much but it's missing a bunch of files. Appears to have been caused by 142a0e5.

Running pip install urlwatch in a fresh virtualenv presents the following:

Collecting urlwatch
  Using cached urlwatch-2.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-8gsgi5q8/urlwatch/setup.py", line 26, in <module>
        main_py = open(os.path.join(HERE, 'lib', PACKAGE_NAME, '__init__.py')).read()
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-build-8gsgi5q8/urlwatch/lib/urlwatch/__init__.py'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-8gsgi5q8/urlwatch/

BeautifulSoup usage?

Hello,

First, thank you for your great product. I created a set of RPMs on AXIVO repository, available for CentOS 6 (and soon for CentOS 7 with BeautifulSoup4):

$ yum --enablerepo=axivo list | egrep 'urlwatch|futures|beautifulsoup'
python-beautifulsoup.noarch              3.2.1-1.el6                    @axivo  
python-futures.noarch                    2.1.6-1.el6                    @axivo  
urlwatch.noarch                          1.17-1.el6                     @axivo

Can you please give an example how to use BeautifulSoup in hooks.py, to filter a specific URL? For example, on your site, I want to check for latest update, using the class="filename" tag. I apologize for my lack of Python knowledge, I see several lines into hooks.py and I'm not sure if I should remove them all and use only the BeautifulSoup lines you posted on your site. Google returned no information, so I hope you post a detailed example here.

Thank you for your help.

urlwatch-2.1 new version migration fails

The urlwatch 2.x has changed their configuration and cache storing system, but the migration from 1.X fails with:

urlwatch --edit
No jobs file found

        Migrating cache: /home/user/.urlwatch/cache -> /home/user/.urlwatch/cache.db

Removing: 0141a464b24bbc03f2bbba2bf91fe94179572238
Removing: 35ecdde4fad96b22d3fd11cee70f2bb141296e7a
Removing: 6d501928989a55a0325b63f450309da105bc4be8
Removing: a6d62130ca1dae859c462d0b9325bee60e449a21
Removing: 7f8dc223248751c43f5fff47399ad808332bba88
Removing: cdc6c20b4ed496b8b185d37c7966f4e31a6dabc5
Removing: e382ca109a55e69d45da802b58a0d37d37dd5024
Removing: f0f6be08f546f22bbd3b22c9b1650ea426666ea5
Removing: cc3acca41c7ee65dadbd2693d969eb18729472b6
Removing: d952b8a505165d60824ccd73a34a15e7035a4b2e
Removing: b06d40c441961c272fc4bb13f9a0494d7c64be9e
Removing: 86c5f26c01db07316e61516f95626d246d7735c8
Removing: cce3f7a7af3039f374199a849fb7e24d306695d9
Removing: 4242a29b794b46739bcb7144e599182827e5b596
Removing: b3eaf7d9df6af694d067475dc5c3bffd845a5c42
Removing: 56a05b6aa9c51fc2e29bfed42bc33c72ca83aeb2
Removing: 85bf86ea5490220da6be0418c53348b6a2b8b110
Removing: 22c201209094bfa9b508182022933236b5039b3b
Removing: 1921dd46f0c55c2e45ff5f68a9cc483971bee7dd
Removing: 06f09294cf250d4ffa3ebf9371129f678eb3e173
Parsing failed:
======
[Errno 2] No such file or directory: '/usr/lib/python-exec/python3.4/share/urlwatch/examples/urls.yaml.example'
======

The file /home/slavko/.urlwatch/urls.yaml was NOT updated.
Your changes have been saved in /home/user/.urlwatch/urls.edit.yaml

It seems, that the urls.yaml.example file is installed into /usr/share/urlwatch/examples/urls.yaml.example, but the urlwatch expects it in the /usr/lib/python-exec/python3.4...:

Simple simlink /usr/lib/python-exec/python3.4/share/urlwatch -> /usr/share/urlwatch solves the problem...

It is reported by me on Funtoo bugs too, you can see it at https://bugs.funtoo.org/browse/FL-3128

os.getlogin() - Inappropriate ioctl for device

Hello. In the file handler.py, line 173, you use os.getlogin().
According to the os.getlogin() doc, it « Returns the user logged in to the controlling terminal of the process. ».

It means that if there is no controlling terminal, because urlwatch is launched by cron, or by a systemd.service for example, it will fails with this error:
OSError: [Errno 25] Inappropriate ioctl for device

You can find a "fix" for a similar issue in the gitpython repositery: swallat/GitPython@f362d10

Thank you for this program.

urlwatch not support some pages

When I use

urlwatch

get

===========================================================================
01. ERROR: https://www.gropp.org/?id=projects&sub=bwm-ng
02. ERROR: https://sourceforge.net/projects/atomiks/files/
03. ERROR: https://sourceforge.net/p/blobandconquer/code/ci/master/tree/
04. ERROR: https://www.7kfans.com/wiki/index.php/Download
05. ERROR: https://sourceforge.net/projects/blobwars/files/
===========================================================================

---------------------------------------------------------------------------
ERROR: https://www.gropp.org/?id=projects&sub=bwm-ng
---------------------------------------------------------------------------
[Errno 21] Is a directory
---------------------------------------------------------------------------


---------------------------------------------------------------------------
ERROR: https://sourceforge.net/projects/atomiks/files/
---------------------------------------------------------------------------
[Errno 21] Is a directory
---------------------------------------------------------------------------


---------------------------------------------------------------------------
ERROR: https://sourceforge.net/p/blobandconquer/code/ci/master/tree/
---------------------------------------------------------------------------
[Errno 21] Is a directory
---------------------------------------------------------------------------


---------------------------------------------------------------------------
ERROR: https://www.7kfans.com/wiki/index.php/Download
---------------------------------------------------------------------------
[Errno 21] Is a directory
---------------------------------------------------------------------------


---------------------------------------------------------------------------
ERROR: https://sourceforge.net/projects/blobwars/files/
---------------------------------------------------------------------------
[Errno 21] Is a directory
---------------------------------------------------------------------------


-- 
urlwatch 2.1, Copyright 2008-2016 Thomas Perl
Website: http://thp.io/2008/urlwatch/
watched 15 URLs in 0 seconds

Could you add support for these pages?

pip uninstall urlwatch deletes /usr/share/

This may have been a fluke, but in removing urlwatch with pip (to use Arch's python-urlwatch instead), pip deleted my whole /usr/share/. I was able to recover, but I don't want that to happen to anyone else. I don't have time to dig into exactly why it happened, but based on a conversation in another thread, there seems to be some issue with where/how urlwatch installs (and uninstalls).

SNI support

when urlwatch use to monitor https site that support sni, it will fail with following error message.

got URLError while loading url: <urlopen error [Errno 1] _ssl.c:510: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>

Encoding problem

Hi!
I get the following exception:

Traceback (most recent call last):
  File "/usr/bin/urlwatch", line 376, in <module>
    main(parser.parse_args())
  File "/usr/bin/urlwatch", line 343, in main
    report.finish()
  File "/usr/lib/python3.5/site-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 81, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 298, in submit
    print(self._red(line))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 63: ordinal not in range(256)

I traced it back to the following content of the website:

European Union’s
Please note the ' between n and s causing the hickup. Maybe all string operations must be moved to utf8?

Thanks!

Regex support for AutoMatchFilter

This would be nice and it's something I'll miss if I stop using the legacy filters. Example use case: tracking packages from websites that always share the same base URL.

Failed email reports in Python 3.x

I was attempting to set up email report delivery via SMTP for the first time and experienced the following error:

Traceback (most recent call last):
  File "/home/user/.local/bin/urlwatch", line 375, in <module>
    main(parser.parse_args())
  File "/home/user/.local/bin/urlwatch", line 342, in main
    report.finish()
  File "/home/user/.local/lib/python3.4/site-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/home/user/.local/lib/python3.4/site-packages/urlwatch/reporters.py", line 81, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/home/user/.local/lib/python3.4/site-packages/urlwatch/reporters.py", line 340, in submit
    mailer.send(msg)
  File "/home/user/.local/lib/python3.4/site-packages/urlwatch/mailer.py", line 57, in send
    s.starttls()
  File "/usr/lib/python3.4/smtplib.py", line 688, in starttls
    server_hostname=self._host)
  File "/usr/lib/python3.4/ssl.py", line 365, in wrap_socket
    _context=self)
  File "/usr/lib/python3.4/ssl.py", line 583, in __init__
    self.do_handshake()
  File "/usr/lib/python3.4/ssl.py", line 810, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: TLSV1_UNRECOGNIZED_NAME] tlsv1 unrecognized name (_ssl.c:600)

I was using smtp.gmail.com:587 but observed the same behavior with other mail servers as well.

I was able to reproduce this behavior in an interpreter with the following code:

import smtplib

s = smtplib.SMTP()
s.connect('smtp.gmail.com', 587)
s.ehlo()
s.starttls()

After some experimentation I was able to remedy it with a call to s._host = 'smtp.gmail.com':

import smtplib

s = smtplib.SMTP()
s.connect('smtp.gmail.com', 587)
s._host = 'smtp.gmail.com'
s.ehlo()
s.starttls()

Anyways, just wondering if there's a particular reason for the explicit call to connect()? connect() is invoked by smtplib.SMTP's __init__ and it takes care of _host and whatnot.

The following change to mailer.py seems to fix things.

52,53c52
<         s = smtplib.SMTP()
<         s.connect(self.smtp_server, self.smtp_port)

---
>         s = smtplib.SMTP(self.smtp_server, self.smtp_port)

Enabling pushover/HowTo?

Hi there,
is there any more detailed HowTo on how to enable pusghover in the config file? I could not identify any entries there which look related to pushover.
Thanks in Advance

Handling large emails

Hi!

after migrating to the new version, the second run (after enabling htl2text for all my 300 urls) resulted in a large email that was rejected from the server:

Traceback (most recent call last):
  File "/usr/bin/urlwatch", line 376, in <module>
    main(parser.parse_args())
  File "/usr/bin/urlwatch", line 343, in main
    report.finish()
  File "/usr/lib/python3.5/site-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 81, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 340, in submit
    mailer.send(msg)
  File "/usr/lib/python3.5/site-packages/urlwatch/mailer.py", line 65, in send
    s.sendmail(msg['From'], [msg['To']], msg.as_string())
  File "/usr/lib/python3.5/smtplib.py", line 857, in sendmail
    raise SMTPSenderRefused(code, resp, from_addr)
smtplib.SMTPSenderRefused: (552, b'5.2.3 Message size exceeds fixed maximum message size (15000000)', '[email protected]')

Maybe it is possible to split large emails into smaller parts?

Thanks!

sendmail binary for sending email

It would be useful if there was an option to use the sendmail binary instead of SMTP.

Urlwatch as a library

Can library support be added? E.g. call urlwatch and tasks programmatically as part of a django program etc.

watch binary files

It would be nice if urlwatch would be able to watch changes in binary files like images or PDFs by their hashed content.

urlwatch doesn't work in crontab mode

Hi,
I use gnome-scheduled to create crontab to execute urlwatch. But it doesn't work (doestn't even create cache.db)

Here is the output of "crontab -l"

Here is the output of "grep CRON /var/log/syslog"

However, there is no cache.db in /.urlwatch. If I run the command manually, everything is fine.

Do anyone have any solution for this situation?
Thank you very much.

Error when piping to less

Sometimes a stack trace occurs after quitting less (apparently only when there's actual output):

~ > urlwatch |less                                                                     
Traceback (most recent call last):
  File "/usr/bin/urlwatch", line 375, in <module>
    main(parser.parse_args())
  File "/usr/bin/urlwatch", line 342, in main
    report.finish()
  File "/usr/lib/python3.5/site-packages/urlwatch/handler.py", line 128, in finish
    ReporterBase.submit_all(self, self.job_states, duration)
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 89, in submit_all
    cls(report, cfg, job_states, duration).submit()
  File "/usr/lib/python3.5/site-packages/urlwatch/reporters.py", line 304, in submit
    print(self._green(line))
BrokenPipeError: [Errno 32] Broken pipe
zsh: exit 1     urlwatch | 
zsh: done       less

urlwatch 2.2
less 481
Arch Linux Arm

Install dependencies

What are your thoughts on having the hard dependencies (minidb and PyYAML) installed upon installation? This could be especially helpful for people installing or upgrading directly through the Python Package Index as I was. Personally I like to keep my pip dependency tree accurate so that at some later date I'm able to easily check what unused packages I'm able to uninstall without causing problems.

I'm not familiar with how to do this with distutils (or whether it is even possible) so I used setuptools to do the following:

diff --git a/setup.py b/setup.py
index 7d9707c..ee1a825 100644
--- a/setup.py
+++ b/setup.py
@@ -3,7 +3,7 @@
 # Minimalistic, automatic setup.py file for Python modules
 # Copyright (c) 2008-2016 Thomas Perl <thp.io/about>

-from distutils.core import setup
+from setuptools import setup

 import os
 import re
@@ -29,5 +29,6 @@ m['scripts'] = [PACKAGE_NAME]
 m['package_dir'] = {'': 'lib'}
 m['packages'] = ['.'.join(dirname.split(os.sep)[1:]) for dirname, _, files in os.walk('lib') if '__init__.py' in files]
 m['data_files'] = [(dirname, [os.path.join(dirname, fn) for fn in files]) for dirname, _, files in os.walk('share')]
+m['install_requires'] = ['minidb', 'PyYAML']

 setup(**m)

Automatically cleaning up cached content

Does urlwatch automatically clean up its cache? i.e. keep only the latest version of a page and delete any old versions.