Giter Club home page Giter Club logo

blinkist-scraper's Introduction

blinkist-scraper

A python script to download book summaries and audio from Blinkist and generate some pretty output files.

Installation / Requirements

Make sure you're in your virtual environment of choice, then run

  • poetry install --no-dev if you have Poetry installed
  • pip install -r requirements.txt otherwise

This script uses ChromeDriver to automate the Google Chrome browser - therefore Google Chrome needs to be installed in order to work.

The script will automatically try to download and use the appropriate chromedriver distribution for your OS and Chrome version. If this doesn't work, download the right version for you from https://chromedriver.chromium.org/downloads and use the --chromedriver argument to specify its path at runtime.

Usage

usage: blinkistscraper [-h] [--language {en,de}] [--match-language]
                       [--cooldown COOLDOWN] [--headless] [--audio]
                       [--concat-audio] [--keep-noncat] [--no-scrape]
                       [--book BOOK] [--daily-book] [--books BOOKS]
                       [--book-category BOOK_CATEGORY]
                       [--categories CATEGORIES [CATEGORIES ...]]
                       [--ignore-categories IGNORE_CATEGORIES [IGNORE_CATEGORIES ...]]
                       [--create-html] [--create-epub] [--create-pdf]
                       [--save-cover] [--embed-cover-art] 
                       [--chromedriver CHROMEDRIVER] [--no-ublock] [--no-sandbox] [-v]
                       email password

positional arguments:
  email                 The email to log into your premium Blinkist account
  password              The password to log into your premium Blinkist account

optional arguments:
  -h, --help            show this help message and exit
  --language {en,de}    The language to scrape books in - either 'en' for
                        english or 'de' for german
  --match-language      Skip scraping books if not in the requested language
                        (not all book are avaible in german)
  --cooldown COOLDOWN   Seconds to wait between scraping books, and
                        downloading audio files. Can't be smaller than 1
  --headless            Start the automated web browser in headless mode.
                        Works only if you already logged in once
  --audio               Download the audio blinks for each book.
  --concat-audio        Concatenate the audio blinks into a single file and
                        tag it. Requires ffmpeg
  --keep-noncat         Keep the individual blink audio files, instead of
                        deleting them (works with '--concat-audio' only)
  --no-scrape           Don't scrape the website, only process existing json
                        files in the dump folder. Do not provide email or
                        password with this option.
  --book BOOK           Scrapes this book only, takes the Blinkist URL for the
                        book (e.g. https://www.blinkist.com/en/books/... or
                        https://www.blinkist.com/en/nc/reader/...)
  --daily-book          Scrapes the free daily book only.
  --books BOOKS         Scrapes the list of books, takes a txt file with the
                        list of Blinkist URL's for the books (e.g.
                        https://www.blinkist.com/en/books/... or
                        https://www.blinkist.com/en/nc/reader/...)
  --book-category BOOK_CATEGORY
                        When scraping a single book, categorize it under this
                        category (works with '--book' and '--daily-book' only)
  --categories CATEGORIES [CATEGORIES ...]
                        Only the categories whose label contains at least one
                        string here will be scraped. Case-insensitive; use
                        spaces to separate categories. (e.g. '--categories
                        entrep market' will only scrape books under
                        'Entrepreneurship' and 'Marketing & Sales')
  --ignore-categories IGNORE_CATEGORIES [IGNORE_CATEGORIES ...]
                        If a category label contains anything in
                        ignored_categories, books under that category will not
                        be scraped. Case-insensitive; use spaces to separate
                        categories. (e.g. '--ignored-categories entrep market'
                        will skip scraping of 'Entrepreneurship' and
                        'Marketing & Sales')
  --create-html         Generate a formatted html document for the book
  --create-epub         Generate a formatted epub document for the book
  --create-pdf          Generate a formatted pdf document for the book.
                        Requires wkhtmltopdf
  --save-cover          Save a copy of the Blink cover artwork in the folder
  --embed-cover-art     Embed the Blink cover artwork into the concatenated
                        audio file (works with '--concat-audio' only)
  --chromedriver CHROMEDRIVER
                        Path to a specific chromedriver executable instead of
                        the built-in one
  --no-ublock           Disable the uBlock Chrome extension. This will
                        completely skip the installation (and setup) of
                        ublock. If you want to use ublock content blocking, then
                        run the script again without this flag.
  --no-sandbox          When running as root (e.g. in Docker), Chrome requires
                        the '--no-sandbox' argument     
  -v, --verbose         Increases logging verbosity

Basic usage

python blinkistscraper email password where email and password are the login details to your premium Blinkist account.

The script uses Selenium with a Chrome driver to scrape the site automatically using the provided credentials. Sometimes during scraping, a captcha block-page will appear. When this happens, the script will try to pause and wait for the user to solve it. After some time (i.e. one minute), the script will time out. The output files are stored in the books folder, arranged in subfolders by category and by the book's title and author.

Customizing HTML output

The script builds a nice-looking html version of the book by using the 'book.html' and 'chapter.html' files in the 'templates' folder as a base. Every parameter between curly braces in those files (e.g. {title}) is replaced by the appropriate value from the book metadata (dumped in the dump folder upon scraping), following a 1-to-1 naming convention with the json parameters (.e.g {title} will be replaced by the title parameter, {who_should_read} but the who_should_read one and so on).

The special field {__chapters__} is replaced with all the book's chapters. Chapters are created by parsing each chapter object in the book metadata and using the chapter.html template file in the same fashion, replacing tokens with the parameters inside the chapter object.

Generating .pdf

Add the --create-pdf argument to the script to generate a .pdf file from the .html one. This requires the wkhtmltopdf tool to be installed and present in the PATH.

Downloading audio

The script download audio blinks as well when adding the --audio argument. This is done by waiting for a request to the Blinkist's audio endpoint in their library api for the first chapter's audio blink which is sent as soon as the user navigates to a book's reader page; then re-using the valid request's headers to build additional requests to the rest of the chapter's audio files. The files are downloaded as .m4a.

Concatenating audio files

Add the --concat-audio argument to the script to concatenate the individual audio blinks into a single file and tag it with the appropriate book title and author. Doing this will delete all individual blinks and replace them with one audio file (per book), only. To keep both the individual blink audio files, also, use the --keep-noncat argument together with the --concat-audio argument (i.e. --concat-audio --keep-noncat). This requires the ffmpeg tool to be installed and present in the PATH.

Processing book dumps with no scraping

During scraping, the script saves all book's metadata in json files inside the dump folder. Those can be used by the script to re-generate the .html, .epub and .pdf output files without having to scrape the website again. To do so, pass the --no-scrape argument to the script without providing an email or a password.

Scraping with a free account

If you don't have a Blinkist premium account, you can still scrape the free daily book. To do so automatically, pass the --daily-book argument - this behaves like scraping a single book.

Quirks & known Bugs

  • Some people have had troubles when dealing with long generated book files (> 260 characters in Windows). Although this should be handled gracefully by the script, if you keep seeing "FileNotFoundError" when trying to create the .html / .m4a files, try and turn on long filenames support on your system: https://www.itprotoday.com/windows-10/enable-long-file-name-support-windows-10, and make sure you have a recent distribution of ffmpeg if using it (old versions had some bugs in dealing with long filenames)

Support Buy me a coffee

If this tool has proven useful to you, consider buying me a coffee to support development of this and many other projects.

blinkist-scraper's People

Contributors

arminius4 avatar dependabot[bot] avatar dhavalsavalia avatar firstclasscitizenfcc avatar johndoe-dev00 avatar jonaschn avatar leoncvlt avatar nerrons avatar r42-leon avatar rocketinventor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blinkist-scraper's Issues

Some books are not scraped

Hello there, I hope all is well with you.
I have been using this script for a few months and I have to admit that it has significantly helped me in reading Blinkist on my kindle.
I have run into an annoying situation where I am unable to scrape some books.
For instance, I receive this error when the Chromedriver is started and the link to my book is opened.
Would you mind assisting me in completing this task?
best
image
image

Missing "uncategorized" category from scraper, thus not retrieving all books

When navigating the each category section, a list of books appears, which is used to gather all book urls and download them all afterwards. You would expect to be getting all books from Blinkist if you navigated the "Show all books" section from every existing category, right? Well... it's not the case.

Nonetheless, not all books seem to have a category, resulting in this (awesome) library to miss around 15% of all books (422 missing out of 2800). This problem is the cause of, for example, this issue.

Manual solution (shortcut to get all books)

I managed to craft a list of all currently existing books in Blinkist. I've done so through their own search engine. They use Algolia and from my previous experience, it can be queried using the client's x-algolia-application-id and x-algolia-api-key. Then simply call the /indexes/books-production/query endpoint. Then, I did a bit of clean-up on the JSON response and constructed each book URL. Here's the list that you can easily use with the --books argument.

how to skip audio file?

is there a command/argument I can give to skip the audio scraping, if I m interested in the ebook only?

Timed out receiving message from renderer: 298.991

I will get this timeout every once in a while. But when using the --book flag to download the book with this error, it seems to work fine.

Is there a way to redownload books without restarting? It seems there isn't a way to do this at the moment. It will scrape everything over again.

Or alternatively, there is currently a way to not re-scrape json files and only download the books with json files. It is possible to scrape all json files first before downloading? Then I can manually take out the json files with books I have already downloaded?
======versions===========
chromedriver-autoinstaller 0.2.2
colorama 0.4.4
EbookLib 0.17.1
requests 2.25.1
selenium 3.141.0
selenium-wire 4.3.2
python 3.8.10
===============Error=================
Message: timeout: Timed out receiving message from renderer: 298.991
(Session info: chrome=91.0.4472.124)
(Driver info: chromedriver=91.0.4472.101 (af52a90bf87030dd1523486a1cd3ae25c5d76c9b-refs/branch-heads/4472@{#1462}),platform=Windows NT 10.0.18363 x86_64)
Traceback (most recent call last):
File "D:\hxh103\blinker2\blinkistscraper_main_.py", line 412, in
main()
File "D:\hxh103\blinker2\blinkistscraper_main_.py", line 368, in main
dump_exists = scrape_book(
File "D:\hxh103\blinker2\blinkistscraper_main_.py", line 257, in scrape_book
audio_files = scraper.scrape_book_audio(
File "D:\hxh103\blinker2\blinkistscraper\scraper.py", line 513, in scrape_book_audio
driver.get(book_reader_url)
File "C:\Users\hxh103\Anaconda3\envs\blink2\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "C:\Users\hxh103\Anaconda3\envs\blink2\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\hxh103\Anaconda3\envs\blink2\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 298.991
(Session info: chrome=91.0.4472.124)
(Driver info: chromedriver=91.0.4472.101 (af52a90bf87030dd1523486a1cd3ae25c5d76c9b-refs/branch-heads/4472@{#1462}),platform=Windows NT 10.0.18363 x86_64)

'NoneType' object has no attribute 'split'

>python blinkistscraper @.* ********
←[2m[12:43:23]←[0m ←[34mINFO←[0m Starting scrape run...
←[2m[12:43:23]←[0m ←[34mINFO←[0m Initialising chromedriver at C:\Users***\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\chromedriver_autoinstaller\91\chromedriver.exe...

DevTools listening on ws://127.0.0.1:56424/devtools/browser/492b0c18-1e9f-431f-8080-49943064fb4b
←[2m[12:43:32]←[0m ←[34mINFO←[0m Logged into Blinkist. Loading Library...
←[2m[12:43:33]←[0m ←[31mERROR←[0m 'NoneType' object has no attribute 'split'
Traceback (most recent call last):
File "\blinkistscraper_main_.py", line 412, in
main()
File "
\blinkistscraper_main_.py", line 358, in main
categories = scraper.get_categories(
File "**\blinkistscraper\scraper.py", line 322, in get_categories
"label": " ".join(label.split()).replace("&", "&"), "url": href
AttributeError: 'NoneType' object has no attribute 'split'
←[2m[12:43:33]←[0m ←[41mCRITICAL←[0m Uncaught Exception. Exiting...

Any idea?

json.decoder.JSONDecodeError

Hi,

I keep getting this error after scraping few books (3 to 4 books). Please help

Traceback (most recent call last): File "main.py", line 62, in <module> audio_files = scraper.scrape_book_audio(driver, book_json) File "C:\Users\Fang Yuan\Downloads\blinkist-scraper-master\scraper.py", line 223, in scrape_book_audio audio_url = audio_request.json()['url'] File "C:\Python38\lib\site-packages\requests\models.py", line 888, in json return complexjson.loads( File "C:\Python38\lib\json\__init__.py", line 357, in loads return _default_decoder.decode(s) File "C:\Python38\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Python38\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Error in proxy2 when scrapping

I'm getting constant errors when scraping even on the start, when it's connecting and downloading the booklist. The message is allways the same but it happens multiple times nearly once per second.

Error making request
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/seleniumwire/proxy/proxy2.py", line 91, in proxy_request
conn.request(self.command, path, req_body, dict(req.headers))
File "/usr/lib/python3.7/http/client.py", line 1260, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1030, in _send_output
self.send(msg)
File "/usr/lib/python3.7/http/client.py", line 970, in send
self.connect()
File "/usr/local/lib/python3.7/dist-packages/seleniumwire/proxy/proxy2.py", line 368, in connect
super().connect()
File "/usr/lib/python3.7/http/client.py", line 1415, in connect
super().connect()
File "/usr/lib/python3.7/http/client.py", line 942, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Scraping works as asprected but I can't find, what causes this error. It's not depending on any of the possible arguments.

SyntaxError: invalid syntax

I get this syntax error while running main.py

File "main.py", line 59 print(f"[#] Processed {processed_books} books in {formatted_time}") ^ SyntaxError: invalid syntax

If I try commenting that line I get other syntax errors in other files... I guess there is something wrong with my configuration

FileNotFoundError

Program gets aborted at this particular book with FileNotFoundError

[.] Json dump for book https://www*blinkist*com/en/books/the-five-most-important-questions-you-will-ever-ask-about-your-organization-en already exixts, skipping scraping...
[.] Downloading audio file for blink 0...
[.] Downloading audio file for blink 1...
[.] Downloading audio file for blink 2...
[.] Downloading audio file for blink 3...
[.] Downloading audio file for blink 4...
[.] Downloading audio file for blink 5...
[.] Downloading audio file for blink 6...
[.] Combining audio files for the-five-most-important-questions-you-will-ever-ask-about-your-organization-en
[!] ffmpeg output file longer than 260 characters. Trying shorter filename...
[.] Generating .html for the-five-most-important-questions-you-will-ever-ask-about-your-organization-en
Traceback (most recent call last):
File "main.py", line 69, in
processed_books = process_book_json(book_json, processed_books)
File "main.py", line 29, in process_book_json
generator.generate_book_html(book_json)
File "C:\Users\Fang Yuan\Downloads\blinkist-scraper-master\generator.py", line 42, in generate_book_html
with open(html_file, 'w', encoding='utf-8') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: 'books\Entrepreneurship\Peter F Drucker - The Five Most Important Questions You Will Ever Ask About Your Organization\Peter F Drucker - The Five Most Important Questions You Will Ever Ask About Your Organization.html'

Stuck in Cloudflare hCaptcha loop.

Hello and first of all thank you very much for your work!

It looks, like this is exactly the code that I was looking for, but unfortunately I'm not able to get it running because I get stuck in an endless Cloudflare hCaptcha loop on https://www.blinkist.com/en/nc/login when I'm trying to execute it the first time.
The "One more step - Please complete the security check to access - I am human" appears before entering the login information and no matter how often I solve it, I always end up at the next Captcha (tried it for at least 9 times in a row).

My system:

  • Win 10
  • Chrome 87
  • Python 3.8
  • Venv with all requirements.txt modules installed.

I've already tried:

  • Running it on another Win 10 Laptop --> same problem
  • Different commands: python blinkistscraper email password / python main.py email password
  • Downloaded and specified ChromeDriver 87.0.4280.88 as argument
  • Downloaded Chrome 88 Beta and used ChromeDriver 88.0.4324.27
  • pip install --upgrade for all outdated modules
  • Different locations via VPN (Germany, Portugal and US)
  • Different Networks (DSL and Hotspot from Mobile Phone)
  • Ubuntu VM --> also getting stuck with the same problem

Unfortunately I don't have any other ideas at the moment and feel pretty lost/stupid.
Did you encounter this problem before and have an idea how to solve it?
Or are there some logfiles or something I can collect that might help in this case?

Thank you very much in advance!
Peter

cannot execute main.py

Hi. Really interested to use this script but I have serious difficulties to follow instructions as per README file.
Basic usage calls for python main.py without underscores.... but there is no such file.
Only similar file is main.py with underscores before and after main word in blinkistscraper folder but I guess it is not the right one.
Salvatores-MacBook-Pro:blinkistscraper salva$ python main.py user pass https://www.blinkist.com/en/nc/reader/the-box-en
File "main.py", line 7
log = logger.get(f"blinkistscraper")
^
SyntaxError: invalid syntax
Salvatores-MacBook-Pro:blinkistscraper salva$

Can you pls give more clear instruction on how to execute first command?
Thanks

blinkist-scraper no longer working - attribute error!?

Dear Leonardo, *

great script - used to backup some of my favorite books in January as epub and mp4. Worked like a charm.
Now it breaks shortly after initialising the (already updated) chromedriver:

C:\blinkistscraper>python main.py --books.txt --cooldown 5 --concat-audio --keep-noncat --save-cover --embed-cover-art
←[2m[13:10:21]←[0m ←[34mINFO←[0m Starting scrape run...
←[2m[13:10:22]←[0m ←[34mINFO←[0m Initialising chromedriver at ~\Python\Python38\lib\site-packages\chromedriver_autoinstaller\91\chromedriver.exe...

DevTools listening on ws://~
←[2m[13:10:52]←[0m ←[34mINFO←[0m Logged into Blinkist. Loading Library...

Traceback (most recent call last):
File "main.py", line 340, in
main()
File "main.py", line 303, in main
dump_exists = scrape_book(
File "main.py", line 217, in scrape_book
audio_files = scraper.scrape_book_audio(
File "~\scraper.py", line 504, in scrape_book_audio
del driver.requests
AttributeError: requests

Any(body) an idea what's the root cause resp. how to fix?
Thanks in advance!

Thank you first :)

Thank you first :)

i thought the book Homo Deus (https://www.blinkist.com/en/nc/reader/homo-deus-en) was in my dump & books folder, but i couldn't find it.
It occurs to other 2 books by Yuval Noah Harari: Sapiens, 21 Lessons for the 21st Century.

Would it be possible to be solved? ... Thank you so so much >_<

Homo Deus (https://www.blinkist.com/en/nc/reader/homo-deus-en)
Sapiens (https://www.blinkist.com/en/nc/reader/sapiens-en)
21 Lessons for the 21st Century (https://www.blinkist.com/en/nc/reader/21-lessons-for-the-21st-century-en)

Script Throws Exception for audio download of blinks

Hello there,
I recently discovered this repository and started using. I'm trying to download the audio blinks, but I keep getting the following exception.
Is there something I'm missing here, or is this for everyone?
Please advise what I can do to prevent this error.

$ python blinkistscraper --language en --match-language --audio myEmail myPassword
[15:41:38] INFO Starting scrape run...
[15:41:39] INFO Initialising chromedriver at C:\Users\Girish\AppData\Local\Programs\Python\Python37\lib\site-packages\chromedriver_autoinstaller\90\chromedriver.exe...
[15:41:48] ERROR Timeout waiting for ublock config overwrite alert
[15:42:03] INFO Logged into Blinkist. Loading Library...
[15:42:07] INFO Scraping categories: Entrepreneurship, Politics, Marketing & Sales, Science, Health & Nutrition, Personal Development, Economics, History, Communication Skills, Corporate Culture, Management & Leadership, Motivation & Inspiration, Money & Investments, Psychology, Productivity, Sex & Relationships, Technology & the Future, Mindfulness & Happiness, Parenting, Society & Culture, Nature & the Environment, Biography & Memoir, Career & Success, Education, Religion & Spirituality, Creativity, Philosophy
[15:42:07] INFO Getting all books for category Entrepreneurship...
[15:42:13] INFO Found 198 books
[15:42:13] ERROR requests
Traceback (most recent call last):
  File "blinkistscraper\__main__.py", line 412, in <module>
    main()
  File "blinkistscraper\__main__.py", line 373, in main
    match_language=match_language,
  File "blinkistscraper\__main__.py", line 258, in scrape_book
    driver, book_json, args.language
  File "blinkistscraper\scraper.py", line 504, in scrape_book_audio
    del driver.requests
AttributeError: requests
[15:42:13] CRITICAL Uncaught Exception. Exiting...


Running Into Chromedriver install error

Chromedriver is being installed earlier in proc from the autoinstaller dir then runs into another "missing" error when options.py is running. Unsure how to resolve.

Help is appreciated. Sorry if this is the wrong place to comment. This is my first time so any feedback is helpful.

$ python main.py --language en --categories market  --create-html --save-cover -v @@@@@ XXXXX
[15:45:32] INFO Starting scrape run...
[15:45:32] INFO Initialising chromedriver at C:\Python38\lib\site-packages\chromedriver_autoinstaller\84\chromedriver.exe...
Traceback (most recent call last):
  File "main.py", line 340, in <module>
    main()
  File "main.py", line 276, in main
    driver = scraper.initialize_driver(
  File "C:\Users\USER\Documents\GitHub\blinkist-scraper\blinkistscraper\scraper.py", line 67, in initialize_driver
    chrome_options.add_extension(
  File "C:\Python38\lib\site-packages\selenium\webdriver\chrome\options.py", line 131, in add_extension
    raise IOError("Path to the extension doesn't exist")
OSError: Path to the extension doesn't exist

Feature Request: Embed cover art into concat audio file

First, thanks for this! This is great.

Wondering tho if it'd be possible to embed the album art into the .m4a file, or even (maybe both?) download the linked jpeg file in the HTML output file created into the containing folder?

reCaptcha not taking

See error message. This happens after chromedriver>cloudflare stage during initial login.

image

Can't get past Captcha

Hi there,
I can't get past Captcha. Tried all possible options described in other reported issue but none seems to help. I'm choosing all the busses, boats and trains but Captcha reappears over and over as if it did not work.
Could anyone please look into this please?
Thanks
SF.
Captcha issue

Enable retries for audio scraping

Audio downloads seem to be a bit iffy - sometimes the script just fails to capture the audio request to the first chapter, or a request just times out. Those are handled gracefully but it means that no audio for that book is downloaded. Look into adding a --retry-limit flag so that in case of a failed audio download, the script waits a bit and retries as many times as defined by the flag.

Not all Blinkist books can be Downloaded

I Used this library about a year ago and it worked like a charm. but there is a little problem. it uses Blinkist categories to download the Blinks and I noticed that not all the available books are listed in categories and around half of them are missing.
I suggest that instead of using categories to download all the books we use Blinkist sitemap. now you can conveniently find the Blinkist sitemap in https://www.blinkist.com/en/sitemap. I don't have the knowledge to solve this issue but I hope somebody can do it.

German downloads

Is there any way to download German books and audios as well?
Could you add another flag to choose the language maybe?

I also ran into JSONDecodeErrors at some point in the program. I don't know when and why this happens exactly.

[.] Scraping book at https://www.blinkist.com/en/books/disrupted-en

Traceback (most recent call last):
  File ".\main.py", line 62, in <module>
    audio_files = scraper.scrape_book_audio(driver, book_json)
  File "[...]\scraper.py", line 223, in scrape_book_audio
    audio_url = audio_request.json()['url']
  File "[...]\Anaconda3\lib\site-packages\requests\models.py", line 889, in json
    self.content.decode(encoding), **kwargs
  File "[...]\Anaconda3\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "[...]\Anaconda3\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "[...]\Anaconda3\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

PS: you should add Chrome browser to dependencies, as chomedriver needs Chrome browser to be installed.

stuck on recaptcha

the recaptcha stage carries on forever it keep asking to select images in a loop ... does recaptcha detect more than we know?

scrape my library

Would it be possible to scrape my all library using an argument like "library" and eventually skip what has been already scraped? This way I could setup a crono scraping and every time I have a new book in my library it will be downloaded.

infinite captcha: bug or scraper not working any more?

Hi all,

I have just tried the blinkist scraper with my free account. It does not work any more. It just pops up the captcha window. I insert them many times, then the captcha windows crashes and I get the following messages:

INFO Please solve captcha to proceed!
ERROR Error. Captcha needs to be solved within 1 minute
ERROR Unable to login into Blinkist
INFO Processed 0 books in 00:01:10

Any hints?

Infinite captcha

Hi! Thanks for amazing repo!
When I run python blinkistscraper email pass chrome asks to solve the captcha, after solving the first captcha logs print INFO Logged into Blinkist. Loading Library... but nothing happens, also chrome repeatedly asks to pass captchas again and again.
How to solve this issue?

Issue with --concat-audio

A lot of blinks download the audio, then when it comes to combining, it must error but all the audio files get deleted.

string indices must be integers

idk how to describe this, but from one day to the other I get this:

DevTools listening on ws://127.0.0.1:58651/devtools/browser/ef75920d-da4e-4d69-be34-8a83fa0e3391
←[2m[13:09:45]←[0m ←[34mINFO←[0m Please solve captcha to proceed!
←[2m[13:10:13]←[0m ←[34mINFO←[0m Logged into Blinkist. Loading Library...
←[2m[13:10:13]←[0m ←[31mERROR←[0m string indices must be integers
Traceback (most recent call last):
File "C:\Users\x8251\Blinkist Scraper\blinkistscraper_main_.py", line 412, in
main()
File "C:\Users\x8251\Blinkist Scraper\blinkistscraper_main_.py", line 336, in main
scrape_book(
File "C:\Users\x8251\Blinkist Scraper\blinkistscraper_main_.py", line 248, in scrape_book
book_json, dump_exists = scraper.scrape_book_data(
File "C:\Users\x8251\Blinkist Scraper\blinkistscraper\scraper.py", line 391, in scrape_book_data
if os.path.exists(get_book_dump_filename(book_url)) and not force:
File "C:\Users\x8251\Blinkist Scraper\blinkistscraper\utils.py", line 23, in get_book_dump_filename
return os.path.join("dump", book_json_or_url["slug"] + ".json")
TypeError: string indices must be integers
←[2m[13:10:13]←[0m ←[41mCRITICAL←[0m Uncaught Exception. Exiting...

Captha taking longer than expected

I am experiencing "This is taking longer than expected; please reload the page."
Loading

Not sure what's wrong. it happened on on the third captcha

Chromedriver issues on raspberry - unknown error: DevToolsActivePort file doesn't exist

I m trying to run it on my raspberry pi4 but it seems that it does not like the chromedriver.
I had to arg the chromedriver location and I also tried with the latest version from https://github.com/electron/electron/releases but with same results.

Any idea pls?

pi@raspberrypi:~/blinkist-scraper $ python3 blinkistscraper/main.py user pass --book https://www.blinkist.com/en/nc/reader/on-saudi-arabia-en --chromedriver /usr/bin/chromedriver
[18:30:10] INFO Starting scrape run...
[18:30:10] INFO Initialising chromedriver at /usr/bin/chromedriver...
Traceback (most recent call last):
File "blinkistscraper/main.py", line 340, in
main()
File "blinkistscraper/main.py", line 279, in main
chromedriver_path=args.chromedriver,
File "/home/pi/blinkist-scraper/blinkistscraper/scraper.py", line 81, in initialize_driver
options=chrome_options,
File "/home/pi/.local/lib/python3.7/site-packages/seleniumwire/webdriver/browser.py", line 86, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in init
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=78.0.3904.108 (4b26898a39ee037623a72fcfb77279fce0e7d648-refs/branch-heads/3904@{#889}),platform=Linux 5.4.51-v7l+ armv7l)

502 Bad Gateway

when chromedriver opens up it gives following response - opening any site in it gives same response (my usual chrome works fine)

Error response
Error code: 502

Message: Bad Gateway.

Error code explanation: 502 - Invalid responses from another server/proxy.

Cloudflare blocking scraping

Hello, I have just started using this library and all seems to be correctly set up. I ran python blinkistscraper email password with my credentials and Cloudflare unfortunately detects (I assume) an automated activity and blocks me from navigating to Blinkist.com on the browser instance that got opened by the script.

Any ideas?

Improve mac / linux compatibility

Right now the script is hardcoded to use chromedriver.exe which obviously only works on windows. Should make the chromedriver path into a proper argument so people can use the windows or linux executable as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.