webcomics / dosage Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 59.0 27.67 MB

dosage is a comic strip downloader and archiver

Home Page: https://dosage.rocks/

License: MIT License

Python 99.78% Shell 0.22% Batchfile 0.01%

comic-downloader hacktoberfest scraper webcomic

dosage's People

Stargazers

Watchers

dosage's Issues

Restore py2exe build

We should aim for a single-exe build for Windows, kinda like youtube-dl. There is a new version of py2exe that works with Python 3: https://pypi.python.org/pypi/py2exe/

Installation of new version with pip

When I install dosage with pip:

pip install dosage

it still installs version 2.15.

Could you please update it, so that it will install this (new) package? Thanks!!

Error downloading

running ./dosage -v DumbingOfAge yields

DumbingOfAge> Get strip URL http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/
DumbingOfAge> ERROR: URL content of http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/ with 2561571 bytes exceeds 2097152 bytes.
DumbingOfAge>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 783, in __bootstrap
DumbingOfAge>     self.__bootstrap_inner()
DumbingOfAge>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
DumbingOfAge>     self.run()
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 82, in run
DumbingOfAge>     self.getStrips(scraperobj)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 96, in getStrips
DumbingOfAge>     self._getStrips(scraperobj)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/director.py", line 123, in _getStrips
DumbingOfAge>     out.exception(msg)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/output.py", line 65, in exception
DumbingOfAge>     self.writelines(traceback.format_stack(), 1)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 159, in getStrips
DumbingOfAge>     for strip in self.getStripsFor(url, maxstrips):
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 169, in getStripsFor
DumbingOfAge>     data = self.getPage(url)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/scraper.py", line 353, in getPage
DumbingOfAge>     content = getPageContent(url, cls.session)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 187, in getPageContent
DumbingOfAge>     page = urlopen(url, session, max_content_bytes=max_content_bytes)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 314, in urlopen
DumbingOfAge>     check_content_size(url, req.headers, max_content_bytes)
DumbingOfAge>   File "/Users/jschpp/exclude_from_backup/src/dosage/dosagelib/util.py", line 333, in check_content_size
DumbingOfAge>     raise IOError(msg)
DumbingOfAge> IOError: URL content of http://www.dumbingofage.com/2016/comic/book-6/02-that-perfect-girl/leverage-2/ with 2561571 bytes exceeds 2097152 bytes.

changing util.py:35 from 2MB to 3MB should solve this

Arcamax comics seem to be broken

Looks like they might have updated the site and the URLs have changed. Running from a git pull from a few minutes ago.

Example of the error I get:

Arcamax/Bizarro> Retrieving 1 strip
Arcamax/Bizarro> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/bizarro/.

LoadingArtist download fails

As stated in #55

LoadingArtist> Retrieving 1 strip
LoadingArtist> WARN: found 5 images instead of 1 at http://www.loadingartist.com/ with patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.loadingartist\\.com\\/wp-content/uploads/\\d+/\\d+/[^"]+)"[^>]*[^>]*>']
LoadingArtist> WARN: choosing image http://www.loadingartist.com/wp-content/uploads/2016/06/garfield_01_large.jpg
LoadingArtist> Saved Comics/LoadingArtist/garfield_01_large.jpg (184.80KB).

GaiaGerman download fails

As stated in #55

I think this could affect the non-german version of Gaia as well

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.sandraandwoo\\.com\\/gaiade\\/comics/\\d+-\\d+-\\d+-[^"]+)"[^>]*[^>]*>'] not found at URL http://www.sandraandwoo.com/gaiade/.

Dilbert download fails

As stated in #55

ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/\\d+-\\d+-\\d+/)"[^>]*STR_Prev[^>]*>'] not found at URL http://dilbert.com/.

Ruthe download fails

As stated in #55

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(cartoons/strip_\\d+[^"]+)"[^>]*[^>]*>'] not found at URL http://ruthe.de/.

Arcamax strips not downloading

Hi!

I am getting an an error when downloading any strips from Arcamax.com

Arcamax/BabyBlues> Retrieving 1 strip
Arcamax/BabyBlues> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/babyblues/.
Arcamax/Crankshaft> Retrieving 1 strip
Arcamax/Crankshaft> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/crankshaft/.
Arcamax/BeetleBailey> Retrieving 1 strip
Arcamax/BeetleBailey> ERROR: Patterns
['<\s[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s"(/newspics/[^"]+)"[^>]zoom[^>]>'] not found at
URL http://www.arcamax.com/thefunnies/beetlebailey/.

Is there an update I'm missing?

Thanks!

Problem with xkcd

I seem to be having issues pulling down the xkcd strips, the last successful comic being on March 27.

I've just run dosage -v xkcd, and have this as the output:

xkcd> Retrieving 1 strip (including adult content)
xkcd> Get strip URL http://xkcd.com/1586/
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1586/.
xkcd>   File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
xkcd>     self.__bootstrap_inner()
xkcd>   File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
xkcd>     self.run()
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 82, in run
xkcd>     self.getStrips(scraperobj)
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 96, in getStrips
xkcd>     self._getStrips(scraperobj)
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 111, in _getStrips
xkcd>     for strip in scraperobj.getStrips(numstrips):
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 155, in getStrips
xkcd>     for strip in self.getStripsFor(url, maxstrips):
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 174, in getStripsFor
xkcd>     out.exception(msg)
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/output.py", line 65, in exception
xkcd>     self.writelines(traceback.format_stack(), 1)
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 111, in getComicStrip
xkcd>     imageUrls = fetchUrls(url, data, baseUrl, self.imageSearch)
xkcd>   File "/usr/local/lib/python2.7/dist-packages/dosagelib/util.py", line 245, in fetchUrls
xkcd>     raise ValueError("Patterns %s not found at URL %s." % (patterns, url))
xkcd> ValueError: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1586/.

Any ideas on how to fix?

SafelyEndangered download fails

As stated in #55

SafelyEndangered> Retrieving 1 strip
SafelyEndangered> WARN: found 3 images instead of 1 at http://www.safelyendangered.com/ with patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://www\\.safelyendangered\\.com/wp-content/uploads/\\d+/\\d+/[^"]+\\.[a-z]+).*"[^>]*[^>]*>']
SafelyEndangered> WARN: choosing image http://www.safelyendangered.com/wp-content/uploads/2016/01/menulogo-1.png
SafelyEndangered> ERROR: Pattern <\s*[iI][mM][gG]\s+[^>]*http://www\.safelyendangered\.com/wp-content/uploads[^>]*\s+[tT][iI][tT][lL][eE]\s*=\s*"([^"]+)"[^>]*[^>]*> not found at URL http://www.safelyendangered.com/.

AtomicRobo doesn't download...

...which is unfortunate, because this is the very first time I try using dosage, and I'm not sure how to fix it. I'm entirely open to doing so (and I'll probably blunder through on my own eventually, somehow), but a quick explanation of this error would help.

➜  comics  dosage -a AtomicRobo
NuklearPower/AtomicRobo> Retrieving all strips
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/atomic-robo/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/12/flight-of-the-terror-birds/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/10/the-lizard-man/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/08/03/rescue-mission/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/07/31/the-getaway/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2010/07/30/mxii/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2009/10/05/the-yonkers-devil/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2009/07/24/free-comic-book-day-2009/.
NuklearPower/AtomicRobo> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://v\\.cdn\\.nuklearpower\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.nuklearpower.com/2008/07/18/free-comic-book-day-2008/.
NuklearPower/AtomicRobo> WARN: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"([^"]+)"[^>]*[^>]*>Previous'] not found at URL http://www.nuklearpower.com/2008/07/18/free-comic-book-day-2008/. Assuming no previous comic strips exist.

I'm guessing the cause is simply that they changed their CDN domain?

Plugin structure

Currently, the design of Dosage dictates that every comic the user can download is a subclass of Scraper. This makes writing many similar modules quite painful, since we need at least the class definition and some properties on the class. For example, comicfury.py is 4117 lines for 987 modules. In the future, I would like to migrate to a structure, where a comic module is represented by an instance of the Scraper class, so one only needs one class for many similar comics. This would probably also help creating "dynamic" modules, which would also make #27 easier.

Fetching comics in parallel

The code seems to be already there. Is there any reason NOT to add the option not to do everything in parallel?

dosage crashes when not run from git checkout

I installed dosage on one of my machines and when it's run from /usr/local/bin/ it crashes:

<type 'exceptions.TypeError'> indirectStarter() takes exactly 1 argument (2 given)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/cmd.py", line 322, in main
    res = run(options)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/cmd.py", line 238, in run
    return director.getComics(options)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 162, in getComics
    options.adult, options.multimatch):
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 231, in getScrapers
    found_scrapers = scraper.find_scrapers(name, multiple_allowed=multiple_allowed)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 535, in find_scrapers
    for scrapers in get_scrapers():
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/scraper.py", line 567, in get_scrapers
    _scrapers = sorted([x() for x in plugins], key=lambda p: p.name)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/loader.py", line 52, in get_plugins
    for module in modules:
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/loader.py", line 40, in get_modules
    yield importlib.import_module(name)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/plugins/wordpress.py", line 39, in <module>
    starter=indirectStarter('http://www.alicecomics.com/', '//a[text()="Latest Alice!"]'))
TypeError: indirectStarter() takes exactly 1 argument (2 given)
System info:
dosage 89cfd9d
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Local time: 2016-05-19 07:13:20+000
sys.argv ['/usr/local/bin/dosage', '-b', 'foo', 'dilbert']
LANG = 'en_US.UTF-8'

running dosage from ~/git/dosage/ works.

I tried scrapping the repo here and cloning it again but I get the same errors.

Whomp download fails

[~/dosage] > dosage whomp --all
Whomp> Retrieving all strips
Whomp> Skipping existing file "Comics/Whomp/1482121824-2016-12-19-The-Fray-Of-The-Dodo.jpg".
Whomp> WARN: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/www\\.whompcomic\\.com\\/\\d+/\\d+/\\d+/[^"]+)"[^>]*navi-prev[^>]*>'] not found at URL http://www.whompcomic.com/. Assuming no previous comic strips exist.

PennyArcade download fails

As stated in #55

ERROR: Patterns ['<\\s*[aA]\\s+[^>]*btnPrev[^>]*\\s+[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/penny\\-arcade\\.com\\/comic\\/[^"]+)"[^>]*[^>]*>'] not found at URL http://penny-arcade.com/comic/.

Image name not saved in JSON when already downloaded

I forgot to specify "-o json" for a comic, and tried to generate it by rerunning dosage with the option, but with already downloaded comic files.

Currently an empty 'images' entry is stored in the JSON output when a image is not saved, e.g. because it already exists.

Should this be an option, to always output the scraped image name, even if the file already existed? It would save you from redownloading all image files again.

dosage crashes when run a second time and the desired html file already exists

Hi,

i am running dosage with the following command so that it creates a HTML page per day: /usr/local/bin/dosage -o html -o rss @ --adult.

Today i noticed that dosage crashes when i run this command a second time a day and the HTML page already exists:


********** Oops, I did it again. *************

You have found an internal error in dosage. Please write a bug report
at http://wummel.github.io/dosage/issues and include at least the information below:

Not disclosing some of the information below due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .

<type 'exceptions.ValueError'> output file '/home/simonszu/Comics/html/comics-20160222.html' already exists
Traceback (most recent call last):
  File "/usr/local/bin/dosage", line 304, in main
    res = run(options)
  File "/usr/local/bin/dosage", line 227, in run
    return director.getComics(options)
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/director.py", line 158, in getComics
    events.getHandler().start()
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 307, in start
    handler.start()
  File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 158, in start
    raise ValueError('output file %r already exists' % fn)
ValueError: output file '/home/simonszu/Comics/html/comics-20160222.html' already exists
System info:
dosage 2.15
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Local time: 2016-02-22 14:05:43+002
sys.argv ['/usr/local/bin/dosage', '-o', 'html', '-o', 'rss', '@', '--adult']
LC_ALL = 'de_DE.UTF-8'
LC_CTYPE = 'de_DE.UTF-8'
LANG = 'de_DE.UTF-8'

 ******** dosage internal error, over and out ********

JohnnyWander download fails

As stated in #55

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://www\\.johnnywander\\.com/files/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://www.johnnywander.com/.

Dilbert downloader broken

Dilbert> ERROR: Patterns ['<\s_[aA]\s+(?:[^>]\s+)?[hH][rR][eE][fF]\s=\s_"(http://dilbert.com/strip/[0-9-]*)"[^>]Click to see[^>]>'] not found at URL http://dilbert.com/.

order-symlinks.py throws error on Python 3

Traceback (most recent call last):
File "/home/vagrant/dosage/scripts/order-symlinks.py", line 74, in
create_symlinks(d)
File "/home/vagrant/dosage/scripts/order-symlinks.py", line 49, in create_symlinks
latest = work = unseen[0]
TypeError: 'dict_keys' object does not support indexing

robots.txt handling...

I just fixed a severe bug in robots.txt handling. This made me aware of a big chunk of comics that were already "blocked" by robots.txt (e.g. all of GoComics). I haven't checked all comics yet, but there might be quite a bit more that will now fail...

This brings me to the question: Am I right in the assumption that Dosage is a robot and has to adhere to robots.txt? (I think, yes.) Should we try to get authors/publishers to make exceptions for Dosage?

Dosage fails with cryptic errors when dosage.json is unreadable

Hi!

For some days I am getting strange errors while download comics:

Arcamax/BabyBlues> Retrieving all strips
Arcamax/BabyBlues> Saved Comics/Arcamax/BabyBlues/1244528.gif (101.60KB).
Arcamax/BabyBlues> ERROR: Could not save image at http://www.arcamax.com/thefunnies/babyblues/ to 1244528: ValueError('Expecting object: line 1032 column 6 (char 74222)',)
Arcamax/BabyBlues> Stop retrieval because image file already exists
OnTheFastrack> Retrieving all strips
OnTheFastrack> Saved Comics/OnTheFastrack/July-26-2015.gif (141.31KB).
OnTheFastrack> ERROR: Could not save image at http://onthefastrack.com/ to July-26-2015: ValueError('Expecting object: line 17031 column 8 (char 903967)',)
OnTheFastrack> Stop retrieval because image file already exists

The images are downloaded, then the error occurs and the images are not included in the html or rss output.

Can anyone help?

TIA
Mark

changed from wummel/dosage-2.15 to webcomics/dosage; but still MGG issue

Thank you for the dosage fork!
Deinstalled dosage-2.15 package of Debian testing.
Installed dosage-master
Installed python-setuptools
Now Dilbert and KevinAndKell download correctly and show up in the resulting html file.

Still MotherGooseAndGrimm does not work, however.

Checking out my dosage manually:

$ dosage MotherGooseandGrimm:2015-07-20
Arcamax/MotherGooseAndGrimm> Retrieving 1 strip for index 2015-07-20
Arcamax/MotherGooseAndGrimm> ERROR: Patterns ['<\s_[iI][mM][gG]\s+(?:[^>]\s+)?[dD][aA][tT][aA][--][zZ][oO][oO][mM][--][iI][mM][aA][gG][eE]\s=\s_"(/newspics/[^"]+)"[^>][^>]>'] not found at URL http://www.arcamax.com/thefunnies/mothergooseandgrimm/2015-07-20.

Not life-threatening but would be nice to have.
Thanks again and greetings
Eike, Paraguay

xkcd comic pattern not found.

The xkcd comic is not working for, and it has been like since around 19. March 2015. Is someone else experiencing this?

./dosage xkcd --adult
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\s_[iI][mM][gG]\s+(?:[^>]\s+)?[sS][rR][cC]\s=\s_"(http://imgs.xkcd.com/comics/[^"]+)"[^>][^>]>'] not found at URL http://xkcd.com/1510/.

Cannot install with `pip install git+https://github.com/webcomics/dosage`

I installed dosage as root with pip install git+https://github.com/webcomics/dosage as suggested in #80, but pip failed. This is pip's log file: pip.log

System:
Debian 8.7
Python 2.7.9
pip 9.0.1

xkcd pattern error

I get the following pattern error when I run dosage-2.15 on xkcd. Other comics (e.g CalvinAndHobbes) works fine.

$ dosage-2.15/dosage xkcd --adult
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1688/.

SMBC download fails

As stated in #55

ERROR: Patterns ["<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*'(http\\:\\/\\/www\\.smbc\\-comics\\.com\\/comics/\\d{8}(?:\\w2?|-\\d)?\\.\\w{3})\\s*'[^>]*[^>]*>"] not found at URL http://www.smbc-comics.com/.

DogHouseDiaries broken.

Looks like the url has changed...
Now it is: http://thedoghousediaries.com/dhdcomics/2015-04-13.png

Big plans :)

Just wanted to share some thoughts.

When I started actively fixing broken comics I quickly realised a lot of them were hosted on Wordpress and grouped them together. Now there are almost 70 comics in there. As with most websites, a large number of webcomics are using CMSs.

My big plan

Identify other common CMSs and add support for them.
- If anybody knows any please let me know.
If dosage gets passed an URL it will try determine if it's using a CMS and download it using that.
- I've managed to do this for Wordpress already but I need to redo it properly.
- (optional) Make it notify us so we can add "official" support for that comic. Do you guys think anybody would mind this feature? Should it be opt-out/opt-in?
Write scripts to migrate as many existing comics to CMS logic as possible
- Should make everything way easier to maintain
(optional) Make a simple form for submitting comic support requests (comic name, URL)
- Google Forms makes this easy and allows anonymous submissions
- Get TobiX to add it to the website ;)

For the fun of I've also made a webcrawler that searches for new webcomics (as they like to exchange links) but that probably doesn't belong in Dosage. If anybody wants to know more drop me a line.

version 2.15 not working on windows

Not sure if this is the right place to log this.

Tried installing .exe and the script itself. When .exe is installed and running any dosage command get error of

File "dosage", line 18, in
File "dosagelib\events.pyo", line 11, in
File "dosagelib\util.pyo", line 17, in
File "requests__init__.pyo", line 58, in
File "requests\utils.pyo", line 25, in
File "requests\compat.pyo", line 7, in
File "requests\packages__init__.pyo", line 3, in
File "requests\packages\urllib3__init__.pyo", line 16, in
File "requests\packages\urllib3\connectionpool.pyo", line 39, in
File "requests\packages\urllib3\request.pyo", line 12, in
File "requests\packages\urllib3\filepost.pyo", line 15, in
File "requests\packages\urllib3\fields.pyo", line 7, in
ImportError: No module named email.utils

When using script no idea where dosage is installing in order to use properly, can't seem to find it in any normal directories.

I am basically a python/github noob so i apologize if this is not the right place to log this or you have already fixed these issues. If so is there a detailed tutorial available on how to install and run properly.

Thank You

SandraAndWooGerman download fails

As stated in #55
I think this affects the non-german version of Sandra And Woo as well

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http\\:\\/\\/www\\.sandraandwoo\\.com\\/woode\\/comics/\\d+-\\d+-\\d+-[^"]+)"[^>]*[^>]*>'] not found at URL http://www.sandraandwoo.com/woode/.

Wolffmorgenthaler download fails

As stated in #55

ERROR: Patterns ['<\\s*[dD][iI][vV]\\s+(?:[^>]*\\s+)?[cC][lL][aA][sS][sS]\\s*=\\s*"box-content"[^>]*[^>]*>\\s*<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"[^"]+"[^>]*[^>]*>\\s*<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://kindofnormal\\.com/img/wumo/\\d+/\\d+/[^/"]+)"[^>]*[^>]*>'] not found at URL http://kindofnormal.com/wumo/.

CyanideAndHappiness download fails

Hey guys,

i have made a reinstall of dosage on a fresh host, and i noticed that several webcomics downloads fail because of an outdated regex or something. I think i will just create an issue for every comic, ok?

First: CyanideAndHappiness

 ERROR: Patterns ['<\\s*[aA]\\s+[^>]*prev[^>]*\\s+[hH][rR][eE][fF]\\s*=\\s*"(/comics/\\d+/)"[^>]*[^>]*>'] not found at URL http://www.explosm.net/comics/.

LookingForGroup download fails

As stated in #55

ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(http\\:\\/\\/www\\.lfgcomic\\.com\\/page/[-0-9]+/)"[^>]*feature-item-link[^>]*>'] not found at URL http://www.lfgcomic.com/.

Oglaf archiving failure

It's not real clear why this is failing. The cwd is empty to start. I've tried both the 2.15 release and the current git master.

localhost Comics [0]$ /usr/src/dosage/dosage --adult --all --basepath . --no-downscale Oglaf
Oglaf> Retrieving all strips (including adult content)
Oglaf> Saved Comics/Oglaf/wax_loquacious.jpg (195.76KB).
Oglaf> Saved Comics/Oglaf/weathercock.jpg (207.85KB).
Oglaf> Saved Comics/Oglaf/upcycling.jpg (221.29KB).
Oglaf> Saved Comics/Oglaf/plenty.jpg (174.41KB).
Oglaf> Saved Comics/Oglaf/bachelorprince.jpg (202.84KB).
Oglaf> Saved Comics/Oglaf/spectrophilia.jpg (185.94KB).
Oglaf> Saved Comics/Oglaf/clumsyfetish.jpg (206.19KB).
Oglaf> Saved Comics/Oglaf/throne_of_heaven_Ni4CAm4.jpg (520.03KB).
Oglaf> Saved Comics/Oglaf/cultistfuck1_MNEdrri.jpg (502.72KB).
Oglaf> Saved Comics/Oglaf/cultistfuck2_QhHIAQD.jpg (514.69KB).
Oglaf> WARN: Already seen previous URL u'http://oglaf.com/could-happen/'
localhost Comics [0]$

gocomics.com URLs appear to have changed

Looks like they're no longer matching the pattern, though the main comic URLs appear to be unchanged - just the link to the actual image files has changed.

Example output:

# dosage --baseurl /comics/ -b /srv/www/htdocs/comics/ -o html -n 4 GoComics/CalvinandHobbes
MainThread> WARN: HTML output file '/srv/www/htdocs/comics/html/comics-20170109.html' already exists
MainThread> WARN: the page link of previous run will skip this file
MainThread> WARN: try to generate HTML output only once per day
GoComics/CalvinAndHobbes> ERROR: XPath //ul[@class="feature-nav"]//a[@class="prev"] not found at URL http://www.gocomics.com/calvinandhobbes.

FonFlatter download fails

As stated in #55

ERROR: Patterns ['src="(http\\:\\/\\/www\\.fonflatter\\.de\\/\\d+/fred_\\d+-\\d+-\\d+[^"]+)'] not found at URL http://www.fonflatter.de/.

Arcamax download fails

As stated in #55
This is an example for all Arcamax webcomics which fail with similar errors

ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/babyblues/.

gocomics changes broke dosage

Attempts to pull comics from gocomics fail now (suspect that the recent change in gocomics site broke it):

export PYTHONPATH=$HOME/lib/python; $HOME/bin/dosage -b $HOME/Comics --adult @
GoComics/NonSequitur> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/nonsequitur.
GoComics/9ChickweedLane> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/9chickweedlane.
GoComics/Luann> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/luann.
GoComics/Pickles> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/pickles.
GoComics/ArloandJanis> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/arloandjanis.
GoComics/CalvinandHobbes> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/calvinandhobbes.
GoComics/GetFuzzy> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/[^"]+/\\d+/\\d+/\\d+)"[^>]*prev[^>]*>'] not found at URL http://www.gocomics.com/getfuzzy.
Arcamax/Zits> Retrieving 1 strip
Arcamax/Zits> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/newspics/[^"]+)"[^>]*zoom[^>]*>'] not found at URL http://www.arcamax.com/thefunnies/zits/.
Dilbert> ERROR: Patterns ['<\\s*[aA]\\s+(?:[^>]*\\s+)?[hH][rR][eE][fF]\\s*=\\s*"(/\\d+-\\d+-\\d+/)"[^>]*STR_Prev[^>]*>'] not found at URL http://dilbert.com/.
xkcd> Retrieving 1 strip (including adult content)
xkcd> ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1786/.

Spinnerette does not load.

Module throws Error: Spinnerette> ERROR: Patterns ['<\s[iI][mM][gG]\s+(?:[^>]\s+)?[sS][rR][cC]\s=\s"(comics/[^"]+)"[^>]comic[^>]>'] not found at URL http://www.spinnyverse.com/.

The url of the newest strip is "http://www.spinnyverse.com/comics/1433333331-2015-06-03.jpg"

TypeError: indirectStarter() takes exactly 1 argument (2 given)

I am using the latest master (commit 8ded28b), and I get this:

********** Oops, I did it again. *************

You have found an internal error in dosage. Please write a bug report
at https://github.com/webcomics/dosage/issues and include at least the information below:

Not disclosing some of the information below due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .

<type 'exceptions.TypeError'> indirectStarter() takes exactly 1 argument (2 given)
Traceback (most recent call last):
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/cmd.py", line 322, in main
    res = run(options)
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/cmd.py", line 238, in run
    return director.getComics(options)
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/director.py", line 162, in getComics
    options.adult, options.multimatch):
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/director.py", line 231, in getScrapers
    found_scrapers = scraper.find_scrapers(name, multiple_allowed=multiple_allowed)
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/scraper.py", line 535, in find_scrapers
    for scrapers in get_scrapers():
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/scraper.py", line 566, in get_scrapers
    plugins = list(loader.get_plugins(modules, Scraper))
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/loader.py", line 52, in get_plugins
    for module in modules:
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/loader.py", line 40, in get_modules
    yield importlib.import_module(name)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/usr/.local/lib/python2.7/site-packages/dosagelib/plugins/wordpress.py", line 25, in <module>
    add(name, 'http://hijinksensue.com/', starter=indirectStarter('http://hijinksensue.com/', starterXPath))
TypeError: indirectStarter() takes exactly 1 argument (2 given)
System info:
dosage 2.15.1.dev368
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Local time: 2016-06-03 08:00:49+001
sys.argv ['/home/usr/.local/bin/dosage', '--adult', '-o', 'html', '--baseurl', 'http://www.kierun.org/Comics/', '--no-downscale', 'Anythingaboutnothing', 'Carciphona', 'CtrlAltDel', 'CtrlAltDel/Sillies', 'CyanideAndHappiness', 'Dilbert', 'DresdenCodak', 'FowlLanguage', 'GirlGenius', 'IAmArg', 'Lackadaisy', 'LoadingArtist', 'MegaTokyo', 'MonsieurLeChien', 'NoNeedForBushido', 'Oglaf', 'PHDComics', 'PennyArcade', 'RedMeat', 'RomanticallyApocalyptic', 'QuestionableContent', 'ScandinaviaAndTheWorld', 'SMBC', 'Sithrah', 'Sinfest', 'StandStillStaySilent', 'SluggyFreelance', 'SomethingPositive', 'Spinnerette', 'TwoGuysAndGuy', 'xkcd', 'ZenPencils']
LANGUAGE = 'en_GB.UTF-8'
LANG = 'en_GB.UTF-8'

 ******** dosage internal error, over and out ********

Remove pycountry dependency again...

In 86b31dc I replaced the static language list with the "correct" dependency to pycountry. The problem is that

pycountry is ~ 30 MB big, which is quite heavy for "just" language names
pycountry is not zip-save, this complicates building a single-file .exe for Windows

I probably go ahead and revert most of 86b31dc or just use pycountry at runtime if it is already installed, falling back to a minimal internal list. Other ideas?

AhoiPolloi download fails

As stated in #55

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(/static/antville/ahoipolloi/images/[^"]+)"[^>]*[^>]*>'] not found at URL http://ahoipolloi.blogger.de/.

ImportError: No module named requests

hey there.
i'm on linux mint 18 with fresh installed python 2.7 and pip.
when running dosage i get these errors:

Traceback (most recent call last): File "/usr/local/bin/dosage", line 18, in <module> from dosagelib import events, configuration, singleton, director File "/usr/local/lib/python2.7/dist-packages/dosagelib/events.py", line 11, in <module> from . import rss, util, configuration File "/usr/local/lib/python2.7/dist-packages/dosagelib/util.py", line 17, in <module> import requests ImportError: No module named requests

Renaming comics

It should be possible to rename comic modules and somehow keep the old name around to redirect users of the old module to the new module...

We could put those into a special "deprecated" module, so it's easy to remove them later.

We might also do this for modules that once where part of Dosage, but are removed later (site disappeared, comic was deleted, crawling is now blocked), so we can inform user about the reason why we don't support her favorite module anymore.

DieFruehreifen download fails

As stated in #55

ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(strips/[F,f]rueh[_]?[S,s]trip_\\d+.jpg)"[^>]*[^>]*>'] not found at URL http://www.die-fruehreifen.de/index.php.

Remove descriptions?

Some comics have some description text in our code. It's pretty useless for the general operation of Dosage itself and I would like to remove all descriptions (and all related code). Comments?

xkcd download fails

As stated in #55

 ERROR: Patterns ['<\\s*[iI][mM][gG]\\s+(?:[^>]*\\s+)?[sS][rR][cC]\\s*=\\s*"(http://imgs\\.xkcd\\.com/comics/[^"]+)"[^>]*[^>]*>'] not found at URL http://xkcd.com/1742/.

webcomics / dosage Goto Github PK

dosage's People

Stargazers

Watchers

Forkers

dosage's Issues

My big plan

Recommend Projects

Recommend Topics

Recommend Org