webcomics / dosage Goto Github PK
View Code? Open in Web Editor NEWdosage is a comic strip downloader and archiver
Home Page: https://dosage.rocks/
License: MIT License
dosage is a comic strip downloader and archiver
Home Page: https://dosage.rocks/
License: MIT License
We should aim for a single-exe build for Windows, kinda like youtube-dl. There is a new version of py2exe that works with Python 3: https://pypi.python.org/pypi/py2exe/
When I install dosage
with pip:
pip install dosage
it still installs version 2.15.
Could you please update it, so that it will install this (new) package? Thanks!!
running ./dosage -v DumbingOfAge
yields
DumbingOfAge> Get strip URL http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/
DumbingOfAge> ERROR: URL content of http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/ with 2561571 bytes exceeds 2097152 bytes.
DumbingOfAge> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 783, in __bootstrap
DumbingOfAge> self.__bootstrap_inner()
DumbingOfAge> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
DumbingOfAge> self.run()
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 82, in run
DumbingOfAge> self.getStrips(scraperobj)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 96, in getStrips
DumbingOfAge> self._getStrips(scraperobj)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 123, in _getStrips
DumbingOfAge> out.exception(msg)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/output.py", line 65, in exception
DumbingOfAge> self.writelines(traceback.format_stack(), 1)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 159, in getStrips
DumbingOfAge> for strip in self.getStripsFor(url, maxstrips):
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 169, in getStripsFor
DumbingOfAge> data = self.getPage(url)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 353, in getPage
DumbingOfAge> content = getPageContent(url, cls.session)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 187, in getPageContent
DumbingOfAge> page = urlopen(url, session, max_content_bytes=max_content_bytes)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 314, in urlopen
DumbingOfAge> check_content_size(url, req.headers, max_content_bytes)
DumbingOfAge> File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 333, in check_content_size
DumbingOfAge> raise IOError(msg)
DumbingOfAge> IOError: URL content of http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/ with 2561571 bytes exceeds 2097152 bytes.
changing util.py:35
from 2MB to 3MB should solve this
Looks like they might have updated the site and the URLs have changed. Running from a git pull from a few minutes ago.
Example of the error I get:
Arcamax/Bizarro> Retrieving 1 strip
Arcamax/Bizarro> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/bizarro/.
As stated in #55
LoadingArtist> Retrieving 1 strip
LoadingArtist> WARN: found 5 images instead of 1 at http://www.loadingartist.com/ with patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.loadingartist\\.com\\/wp-content/uploads/\\d+/\\d+/[^"]+)"[^>]*[^>]*>']
LoadingArtist> WARN: choosing image http://www.loadingartist.com/wp-content/uploads/2016/06/garfield_01_large.jpg
LoadingArtist> Saved Comics/LoadingArtist/garfield_01_large.jpg (184.80KB).
As stated in #55
I think this could affect the non-german version of Gaia as well
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.sandraandwoo\\.com\\/gaiade\\/comics/\\d+-\\d+-\\d+-[^"]+)"[^>]*[^>]*>'] not found at URL http://www.sandraandwoo.com/gaiade/.
As stated in #55
ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/\\d+-\\d+-\\d+/)"[^>]*STR_Prev[^>]*>'] not found at URL http://dilbert.com/.
As stated in #55
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(cartoons/strip_\\d+[^"]+)"[^>]*[^>]*>'] not found at URL http://ruthe.de/.
Hi!
I am getting an an error when downloading any strips from Arcamax.com
Arcamax/BabyBlues> Retrieving 1 strip
Arcamax/BabyBlues> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/babyblues/.
Arcamax/Crankshaft> Retrieving 1 strip
Arcamax/Crankshaft> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/crankshaft/.
Arcamax/BeetleBailey> Retrieving 1 strip
Arcamax/BeetleBailey> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/beetlebailey/.
Is there an update I'm missing?
Thanks!
I seem to be having issues pulling down the xkcd strips, the last successful comic being on March 27.
I've just run dosage -v xkcd
, and have this as the output:
xkcd> Retrieving 1 strip (including adult content)
xkcd> Get strip URL http://xkcd.com/1586/
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1586/.
xkcd> File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
xkcd> self.__bootstrap_inner()
xkcd> File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
xkcd> self.run()
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 82, in run
xkcd> self.getStrips(scraperobj)
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 96, in getStrips
xkcd> self._getStrips(scraperobj)
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 111, in _getStrips
xkcd> for strip in scraperobj.getStrips(numstrips):
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 155, in getStrips
xkcd> for strip in self.getStripsFor(url, maxstrips):
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 174, in getStripsFor
xkcd> out.exception(msg)
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/output.py", line 65, in exception
xkcd> self.writelines(traceback.format_stack(), 1)
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 111, in getComicStrip
xkcd> imageUrls = fetchUrls(url, data, baseUrl, self.imageSearch)
xkcd> File "/usr/local/lib/python2.7/dist-packages/dosagelib/util.py", line 245, in fetchUrls
xkcd> raise ValueError("Patterns %s not found at URL %s." % (patterns, url))
xkcd> ValueError: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1586/.
Any ideas on how to fix?
As stated in #55
SafelyEndangered> Retrieving 1 strip
SafelyEndangered> WARN: found 3 images instead of 1 at http://www.safelyendangered.com/ with patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://www\\.safelyendangered\\.com/wp-content/uploads/\\d+/\\d+/[^"]+\\.[a-z]+).*"[^>]*[^>]*>']
SafelyEndangered> WARN: choosing image http://www.safelyendangered.com/wp-content/uploads/2016/01/menulogo-1.png
SafelyEndangered> ERROR: Pattern <\s*[iI][mM][gG]\s+[^>]*http://www\.safelyendangered\.com/wp-content/uploads[^>]*\s+[tT][iI][tT][lL][eE]\s*=\s*"([^"]+)"[^>]*[^>]*> not found at URL http://www.safelyendangered.com/.
...which is unfortunate, because this is the very first time I try using dosage, and I'm not sure how to fix it. I'm entirely open to doing so (and I'll probably blunder through on my own eventually, somehow), but a quick explanation of this error would help.
โ comics dosage -a AtomicRobo
NuklearPower/AtomicRobo> Retrieving all strips
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/atomic-robo/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/12/flight-of-the-terror-birds/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/10/the-lizard-man/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/03/rescue-mission/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/07/31/the-getaway/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/07/30/mxii/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2009/10/05/the-yonkers-devil/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2009/07/24/free-comic-book-day-2009/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2008/07/18/free-comic-book-day-2008/.
NuklearPower/AtomicRobo> WARN: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"([^"]+)"[^>]*[^>]*>Previous'] not found at URL http://www.nuklearpower.com/2008/07/18/free-comic-book-day-2008/. Assuming no previous comic strips exist.
I'm guessing the cause is simply that they changed their CDN domain?
Currently, the design of Dosage dictates that every comic the user can download is a subclass of Scraper. This makes writing many similar modules quite painful, since we need at least the class definition and some properties on the class. For example, comicfury.py
is 4117 lines for 987 modules. In the future, I would like to migrate to a structure, where a comic module is represented by an instance of the Scraper class, so one only needs one class for many similar comics. This would probably also help creating "dynamic" modules, which would also make #27 easier.
The code seems to be already there. Is there any reason NOT to add the option not to do everything in parallel?
I installed dosage on one of my machines and when it's run from /usr/local/bin/ it crashes:
<type 'exceptions.TypeError'> indirectStarter() takes exactly 1 argument (2 given)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dosagelib/cmd.py", line 322, in main
res = run(options)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/cmd.py", line 238, in run
return director.getComics(options)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 162, in getComics
options.adult, options.multimatch):
File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 231, in getScrapers
found_scrapers = scraper.find_scrapers(name, multiple_allowed=multiple_allowed)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 535, in find_scrapers
for scrapers in get_scrapers():
File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 567, in get_scrapers
_scrapers = sorted([x() for x in plugins], key=lambda p: p.name)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/loader.py", line 52, in get_plugins
for module in modules:
File "/usr/local/lib/python2.7/dist-packages/dosagelib/loader.py", line 40, in get_modules
yield importlib.import_module(name)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/plugins/wordpress.py", line 39, in <module>
starter=indirectStarter('http://www.alicecomics.com/', '//a[text()="Latest Alice!"]'))
TypeError: indirectStarter() takes exactly 1 argument (2 given)
System info:
dosage 89cfd9d
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Local time: 2016-05-19 07:13:20+000
sys.argv ['/usr/local/bin/dosage', '-b', 'foo', 'dilbert']
LANG = 'en_US.UTF-8'
running dosage from ~/git/dosage/ works.
I tried scrapping the repo here and cloning it again but I get the same errors.
[~/dosage] > dosage whomp --all
Whomp> Retrieving all strips
Whomp> Skipping existing file "Comics/Whomp/1482121824-2016-12-19-The-Fray-Of-The-Dodo.jpg".
Whomp> WARN: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/www\\.whompcomic\\.com\\/\\d+/\\d+/\\d+/[^"]+)"[^>]*navi-prev[^>]*>'] not found at URL http://www.whompcomic.com/. Assuming no previous comic strips exist.
As stated in #55
ERROR: Patterns ['<\\s*[aA]\\s+[^>]*btnPrev[^>]*\\s+[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/penny\\-arcade\\.com\\/comic\\/[^"]+)"[^>]*[^>]*>'] not found at URL http://penny-arcade.com/comic/.
I forgot to specify "-o json" for a comic, and tried to generate it by rerunning dosage with the option, but with already downloaded comic files.
Currently an empty 'images' entry is stored in the JSON output when a image is not saved, e.g. because it already exists.
Should this be an option, to always output the scraped image name, even if the file already existed? It would save you from redownloading all image files again.
Hi,
i am running dosage with the following command so that it creates a HTML page per day: /usr/local/bin/dosage -o html -o rss @ --adult
.
Today i noticed that dosage crashes when i run this command a second time a day and the HTML page already exists:
********** Oops, I did it again. *************
You have found an internal error in dosage. Please write a bug report
at http://wummel.github.io/dosage/issues and include at least the information below:
Not disclosing some of the information below due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .
<type 'exceptions.ValueError'> output file '/home/simonszu/Comics/html/comics-20160222.html' already exists
Traceback (most recent call last):
File "/usr/local/bin/dosage", line 304, in main
res = run(options)
File "/usr/local/bin/dosage", line 227, in run
return director.getComics(options)
File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 158, in getComics
events.getHandler().start()
File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 307, in start
handler.start()
File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 158, in start
raise ValueError('output file %r already exists' % fn)
ValueError: output file '/home/simonszu/Comics/html/comics-20160222.html' already exists
System info:
dosage 2.15
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Local time: 2016-02-22 14:05:43+002
sys.argv ['/usr/local/bin/dosage', '-o', 'html', '-o', 'rss', '@', '--adult']
LC_ALL = 'de_DE.UTF-8'
LC_CTYPE = 'de_DE.UTF-8'
LANG = 'de_DE.UTF-8'
******** dosage internal error, over and out ********
As stated in #55
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://www\\.johnnywander\\.com/files/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.johnnywander.com/.
Dilbert> ERROR: Patterns ['<\s_[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s_"(http://dilbert.com/strip/[0-9-]*)"[^>]Click to see[^>]>'] not found at URL http://dilbert.com/.
Traceback (most recent call last):
File "/home/vagrant/dosage/scripts/order-symlinks.py", line 74, in
create_symlinks(d)
File "/home/vagrant/dosage/scripts/order-symlinks.py", line 49, in create_symlinks
latest = work = unseen[0]
TypeError: 'dict_keys' object does not support indexing
I just fixed a severe bug in robots.txt handling. This made me aware of a big chunk of comics that were already "blocked" by robots.txt (e.g. all of GoComics). I haven't checked all comics yet, but there might be quite a bit more that will now fail...
This brings me to the question: Am I right in the assumption that Dosage is a robot and has to adhere to robots.txt? (I think, yes.) Should we try to get authors/publishers to make exceptions for Dosage?
Hi!
For some days I am getting strange errors while download comics:
Arcamax/BabyBlues> Retrieving all strips
Arcamax/BabyBlues> Saved Comics/Arcamax/BabyBlues/1244528.gif (101.60KB).
Arcamax/BabyBlues> ERROR: Could not save image at http://www.arcamax.com/thefunnies/babyblues/ to 1244528: ValueError('Expecting object: line 1032 column 6 (char 74222)',)
Arcamax/BabyBlues> Stop retrieval because image file already exists
OnTheFastrack> Retrieving all strips
OnTheFastrack> Saved Comics/OnTheFastrack/July-26-2015.gif (141.31KB).
OnTheFastrack> ERROR: Could not save image at http://onthefastrack.com/ to July-26-2015: ValueError('Expecting object: line 17031 column 8 (char 903967)',)
OnTheFastrack> Stop retrieval because image file already exists
The images are downloaded, then the error occurs and the images are not included in the html or rss output.
Can anyone help?
TIA
Mark
Thank you for the dosage fork!
Deinstalled dosage-2.15 package of Debian testing.
Installed dosage-master
Installed python-setuptools
Now Dilbert and KevinAndKell download correctly and show up in the resulting html file.
Still MotherGooseAndGrimm does not work, however.
Checking out my dosage manually:
$ dosage MotherGooseandGrimm:2015-07-20
Arcamax/MotherGooseAndGrimm> Retrieving 1 strip for index 2015-07-20
Arcamax/MotherGooseAndGrimm> ERROR: Patterns ['<\s_[iI][mM][gG]\s+(?:[^>]\s+)?[dD][aA][tT][aA][--][zZ][oO][oO][mM][--][iI][mM][aA][gG][eE]\s=\s_"(/newspics/[^"]+)"[^>][^>]>'] not found at URL http://www.arcamax.com/thefunnies/mothergooseandgrimm/2015-07-20.
Not life-threatening but would be nice to have.
Thanks again and greetings
Eike, Paraguay
The xkcd comic is not working for, and it has been like since around 19. March 2015. Is someone else experiencing this?
./dosage xkcd --adult
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\s_[iI][mM][gG]\s+(?:[^>]\s+)?[sS][rR][cC]\s=\s_"(http://imgs.xkcd.com/comics/[^"]+)"[^>][^>]>'] not found at URL http://xkcd.com/1510/.
I get the following pattern error when I run dosage-2.15 on xkcd. Other comics (e.g CalvinAndHobbes) works fine.
$ dosage-2.15/dosage xkcd --adult
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1688/.
As stated in #55
ERROR: Patterns ["<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*'(http\\:\\/\\/www\\.smbc\\-comics\\.com\\/comics/\\d{8}(?:\\w2?|-\\d)?\\.\\w{3})\\s*'[^>]*[^>]*>"] not found at URL http://www.smbc-comics.com/.
Looks like the url has changed...
Now it is: http://thedoghousediaries.com/dhdcomics/2015-04-13.png
Just wanted to share some thoughts.
When I started actively fixing broken comics I quickly realised a lot of them were hosted on Wordpress and grouped them together. Now there are almost 70 comics in there. As with most websites, a large number of webcomics are using CMSs.
For the fun of I've also made a webcrawler that searches for new webcomics (as they like to exchange links) but that probably doesn't belong in Dosage. If anybody wants to know more drop me a line.
Not sure if this is the right place to log this.
Tried installing .exe and the script itself. When .exe is installed and running any dosage command get error of
File "dosage", line 18, in
File "dosagelib\events.pyo", line 11, in
File "dosagelib\util.pyo", line 17, in
File "requests__init__.pyo", line 58, in
File "requests\utils.pyo", line 25, in
File "requests\compat.pyo", line 7, in
File "requests\packages__init__.pyo", line 3, in
File "requests\packages\urllib3__init__.pyo", line 16, in
File "requests\packages\urllib3\connectionpool.pyo", line 39, in
File "requests\packages\urllib3\request.pyo", line 12, in
File "requests\packages\urllib3\filepost.pyo", line 15, in
File "requests\packages\urllib3\fields.pyo", line 7, in
ImportError: No module named email.utils
When using script no idea where dosage is installing in order to use properly, can't seem to find it in any normal directories.
I am basically a python/github noob so i apologize if this is not the right place to log this or you have already fixed these issues. If so is there a detailed tutorial available on how to install and run properly.
Thank You
As stated in #55
I think this affects the non-german version of Sandra And Woo as well
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.sandraandwoo\\.com\\/woode\\/comics/\\d+-\\d+-\\d+-[^"]+)"[^>]*[^>]*>'] not found at URL http://www.sandraandwoo.com/woode/.
As stated in #55
ERROR: Patterns ['<\\s*[dD][iI][vV]\\s+(?:[^>]*\\s+)?[cC][lL][aA][sS][sS]\\s*=\\s*"box-content"[^>]*[^>]*>\\s*<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"[^"]+"[^>]*[^>]*>\\s*<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://kindofnormal\\.com/img/wumo/\\d+/\\d+/[^/"]+)"[^>]*[^>]*>'] not found at URL http://kindofnormal.com/wumo/.
Hey guys,
i have made a reinstall of dosage on a fresh host, and i noticed that several webcomics downloads fail because of an outdated regex or something. I think i will just create an issue for every comic, ok?
First: CyanideAndHappiness
ERROR: Patterns ['<\\s*[aA]\\s+[^>]*prev[^>]*\\s+[hH][rR][eE][fF]\\s*=\\s*"(/comics/\\d+/)"[^>]*[^>]*>'] not found at URL http://www.explosm.net/comics/.
As stated in #55
ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/www\\.lfgcomic\\.com\\/page/[-0-9]+/)"[^>]*feature-item-link[^>]*>'] not found at URL http://www.lfgcomic.com/.
It's not real clear why this is failing. The cwd is empty to start. I've tried both the 2.15 release and the current git master.
localhost Comics [0]$ /usr/src/dosage/dosage --adult --all --basepath . --no-downscale Oglaf
Oglaf> Retrieving all strips (including adult content)
Oglaf> Saved Comics/Oglaf/wax_loquacious.jpg (195.76KB).
Oglaf> Saved Comics/Oglaf/weathercock.jpg (207.85KB).
Oglaf> Saved Comics/Oglaf/upcycling.jpg (221.29KB).
Oglaf> Saved Comics/Oglaf/plenty.jpg (174.41KB).
Oglaf> Saved Comics/Oglaf/bachelorprince.jpg (202.84KB).
Oglaf> Saved Comics/Oglaf/spectrophilia.jpg (185.94KB).
Oglaf> Saved Comics/Oglaf/clumsyfetish.jpg (206.19KB).
Oglaf> Saved Comics/Oglaf/throne_of_heaven_Ni4CAm4.jpg (520.03KB).
Oglaf> Saved Comics/Oglaf/cultistfuck1_MNEdrri.jpg (502.72KB).
Oglaf> Saved Comics/Oglaf/cultistfuck2_QhHIAQD.jpg (514.69KB).
Oglaf> WARN: Already seen previous URL u'http://oglaf.com/could-happen/'
localhost Comics [0]$
Looks like they're no longer matching the pattern, though the main comic URLs appear to be unchanged - just the link to the actual image files has changed.
Example output:
# dosage --baseurl /comics/ -b /srv/www/htdocs/comics/ -o html -n 4 GoComics/CalvinandHobbes
MainThread> WARN: HTML output file '/srv/www/htdocs/comics/html/comics-20170109.html' already exists
MainThread> WARN: the page link of previous run will skip this file
MainThread> WARN: try to generate HTML output only once per day
GoComics/CalvinAndHobbes> ERROR: XPath //ul[@class="feature-nav"]//a[@class="prev"] not found at URL http://www.gocomics.com/calvinandhobbes.
As stated in #55
ERROR: Patterns ['src="(http\\:\\/\\/www\\.fonflatter\\.de\\/\\d+/fred_\\d+-\\d+-\\d+[^"]+)'] not found at URL http://www.fonflatter.de/.
As stated in #55
This is an example for all Arcamax webcomics which fail with similar errors
ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/babyblues/.
Attempts to pull comics from gocomics fail now (suspect that the recent change in gocomics site broke it):
export PYTHONPATH=$HOME/lib/python; $HOME/bin/dosage -b $HOME/Comics --adult @
GoComics/NonSequitur> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/nonsequitur.
GoComics/9ChickweedLane> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/9chickweedlane.
GoComics/Luann> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/luann.
GoComics/Pickles> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/pickles.
GoComics/ArloandJanis> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/arloandjanis.
GoComics/CalvinandHobbes> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/calvinandhobbes.
GoComics/GetFuzzy> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/getfuzzy.
Arcamax/Zits> Retrieving 1 strip
Arcamax/Zits> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/zits/.
Dilbert> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/\\d+-\\d+-\\d+/)"[^>]*STR_Prev[^>]*>'] not found at URL http://dilbert.com/.
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1786/.
Module throws Error: Spinnerette> ERROR: Patterns ['<\s[iI][mM][gG]\s+(?:[^>]\s+)?[sS][rR][cC]\s=\s"(comics/[^"]+)"[^>]comic[^>]>'] not found at URL http://www.spinnyverse.com/.
The url of the newest strip is "http://www.spinnyverse.com/comics/1433333331-2015-06-03.jpg"
I am using the latest master (commit 8ded28b), and I get this:
********** Oops, I did it again. *************
You have found an internal error in dosage. Please write a bug report
at https://github.com/webcomics/dosage/issues and include at least the information below:
Not disclosing some of the information below due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .
<type 'exceptions.TypeError'> indirectStarter() takes exactly 1 argument (2 given)
Traceback (most recent call last):
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/cmd.py", line 322, in main
res = run(options)
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/cmd.py", line 238, in run
return director.getComics(options)
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/director.py", line 162, in getComics
options.adult, options.multimatch):
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/director.py", line 231, in getScrapers
found_scrapers = scraper.find_scrapers(name, multiple_allowed=multiple_allowed)
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/scraper.py", line 535, in find_scrapers
for scrapers in get_scrapers():
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/scraper.py", line 566, in get_scrapers
plugins = list(loader.get_plugins(modules, Scraper))
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/loader.py", line 52, in get_plugins
for module in modules:
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/loader.py", line 40, in get_modules
yield importlib.import_module(name)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/plugins/wordpress.py", line 25, in <module>
add(name, 'http://hijinksensue.com/', starter=indirectStarter('http://hijinksensue.com/', starterXPath))
TypeError: indirectStarter() takes exactly 1 argument (2 given)
System info:
dosage 2.15.1.dev368
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Local time: 2016-06-03 08:00:49+001
sys.argv ['/home/usr/.local/bin/dosage', '--adult', '-o', 'html', '--baseurl', 'http://www.kierun.org/Comics/', '--no-downscale', 'Anythingaboutnothing', 'Carciphona', 'CtrlAltDel', 'CtrlAltDel/Sillies', 'CyanideAndHappiness', 'Dilbert', 'DresdenCodak', 'FowlLanguage', 'GirlGenius', 'IAmArg', 'Lackadaisy', 'LoadingArtist', 'MegaTokyo', 'MonsieurLeChien', 'NoNeedForBushido', 'Oglaf', 'PHDComics', 'PennyArcade', 'RedMeat', 'RomanticallyApocalyptic', 'QuestionableContent', 'ScandinaviaAndTheWorld', 'SMBC', 'Sithrah', 'Sinfest', 'StandStillStaySilent', 'SluggyFreelance', 'SomethingPositive', 'Spinnerette', 'TwoGuysAndGuy', 'xkcd', 'ZenPencils']
LANGUAGE = 'en_GB.UTF-8'
LANG = 'en_GB.UTF-8'
******** dosage internal error, over and out ********
In 86b31dc I replaced the static language list with the "correct" dependency to pycountry. The problem is that
I probably go ahead and revert most of 86b31dc or just use pycountry at runtime if it is already installed, falling back to a minimal internal list. Other ideas?
As stated in #55
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(/static/antville/ahoipolloi/images/[^"]+)"[^>]*[^>]*>'] not found at URL http://ahoipolloi.blogger.de/.
hey there.
i'm on linux mint 18 with fresh installed python 2.7 and pip.
when running dosage i get these errors:
Traceback (most recent call last): File "/usr/local/bin/dosage", line 18, in <module> from dosagelib import events, configuration, singleton, director File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 11, in <module> from . import rss, util, configuration File "/usr/local/lib/python2.7/dist-packages/dosagelib/util.py", line 17, in <module> import requests ImportError: No module named requests
It should be possible to rename comic modules and somehow keep the old name around to redirect users of the old module to the new module...
We could put those into a special "deprecated" module, so it's easy to remove them later.
We might also do this for modules that once where part of Dosage, but are removed later (site disappeared, comic was deleted, crawling is now blocked), so we can inform user about the reason why we don't support her favorite module anymore.
As stated in #55
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(strips/[F,f]rueh[_]?[S,s]trip_\\d+.jpg)"[^>]*[^>]*>'] not found at URL http://www.die-fruehreifen.de/index.php.
Some comics have some description text in our code. It's pretty useless for the general operation of Dosage itself and I would like to remove all descriptions (and all related code). Comments?
As stated in #55
ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1742/.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.