dipu-bd / lightnovel-crawler Goto Github PK
View Code? Open in Web Editor NEWGenerate and download e-books from online sources.
Home Page: https://pypi.org/project/lightnovel-crawler/
License: GNU General Public License v3.0
Generate and download e-books from online sources.
Home Page: https://pypi.org/project/lightnovel-crawler/
License: GNU General Public License v3.0
While using wuxiaworld.co as source for search, bot generated error
raise Exception('No results for: %s' % self.user_input)
If choose "custom range index" or "custom range url" error message "❗ Error: 'NoneType' object is not iterable" generated
readlightnovel.org crawler is duplicating the text. There is a hidden div in the HTML source, which does not contain ads, and a visible DIV. The crawler get's both and is not filtering the ADS sub-div either.
Installed using pip on windows. Created the folders manually. See #10 .
.\ebook_crawler.exe webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = LI1UErwQVWiDGnvCmmpgBamfOpavmDGUCZJqglkP
Getting book name and chapter list...
1646 chapters found
Traceback (most recent call last):
File "c:\program files\python36\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe_main.py", line 9, in
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler_init.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 44, in start
novel_to_kindle(self.output_path)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 83, in novel_to_kindle
for file_name in sorted(os.listdir(output_path)):
UnboundLocalError: local variable 'output_path' referenced before assignment
I think 4 source available is really amazing but more source available is great. I think wuxiaworld.co. and novelplanet is perfect candidate for new source to be packed into epub.
Make it more intelligence by following this workflow:
Implemented it for:
X represents the sites that does not support searching or could not implement searching
This has happen multiple time now it fails to create everything except html and text files.
Please add support for readlightnovel.org
Hi, A beginner here,
Can you please provide us an instruction for mac users? thank you
Hi Sudipto,
After trying and testing the telegram bot i found that after upload zip file finished session not closed, and crawler instance not destroyed so when i call /start again, i have to call command /cancel first so bot can accept new job, is that the right flow?
And the other problem i found is even after call /cancel and i can call command /start again volume and chapter number is counted (past session + this session).
Is crawler instance passed betwen session?
While i try to create epub from webnovel i got error invalid literal for int() with base 10:
python3 main.py webnovel 10377938706023605 https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad https://www.webnovel.com/book/10377938706023605/30154170363336851/Last-Wish-System/Crossing-the-Border
Getting CSRF Token from https://www.webnovel.com/book/10377938706023605
CSRF Token = 9eJJFX5txT0r9s3004p1rDY61DZrTfvslGGHmp61
Getting book name and chapter list...
148 chapters found
Traceback (most recent call last):
File "main.py", line 2, in
main()
File "/home/yudi/book/Web Scrapper/ebook_crawler/init.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 43, in start
self.get_chapter_bodies()
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 83, in get_chapter_bodies
start = int(self.start_chapter)
ValueError: invalid literal for int() with base 10: 'https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad'
Thanks for helping
I think add styling to generated epub is interesting, and epub generated can be looks prettier
The new format for WW novels is
http://www.wuxiaworld.com/novel/desolate-era/de-book-x-chapter-x/
, with the old format being
http://www.wuxiaworld.com/desolate-era-index/de-book-x-chapter-x/.
Can you edit the wuxia code to change that?
When i use discord bot. if i ask it to generate book for novel with more than 200 chapter it usually generate error like this
[ERROR] (asyncio) Task was destroyed but it is pending! task: <Task pending coro=<Client._run_event() running at /home/yudi/.local/lib/python3.6/site-packages/discord/client.py:307> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_connect_done(15)(), <TaskWakeupMethWrapper object at 0x7f52f3c029a8>()]>>
and bot will destroyed(closed). But i think this not happened with lightnovel-crawler bot in readme link. Is there something i miss when deploying discord bot?
Discord bot works great in 1 on 1 chat but not on channel chat because it can read novel url or search item in chat. Unlike telegram bot, in discord while in channel they don't have reply feature. Maybe we need shorter command feature for chat in channel but also retain conversation like request for 1 on 1 chat. shorter comman is similar to argument in console bot for example
for searching we can do :
!lncrawl search novel_title novel_source --> to generate novel url
for generating book
!lncrawl format_book novel_url pack_by_volume all -> to generate all chapter in format_book format
etc
All arguments should be optional. If the arguments are not valid, the current interactive interface will be shown.
This is a good start: https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df
I was wondering if we need to add function to make this script able to translate to chosen language from text scrapped from novel source, maybe we can use googletrans api to do this. Hoping this program can give surprise to reader from many different language.
I've noticed it isn't supported I was wondering if you could add it. It has very similar layout to some of the other supported sites like BoxNovel, Readinglightnovel and Novelplanet.
�[?25hCreated: I’m in Hollywood.epub
Failed to generate mobi for I’m in Hollywood.epub
Traceback (most recent call last):
File "c:\program files (x86)\python36-32\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files (x86)\python36-32\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Program Files (x86)\Python36-32\Scripts\lightnovel-crawler.exe_main.py", line 9, in
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler_init.py", line 65, in main
start_app(crawler_list)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app_init_.py", line 34, in start_app
Program().run(crawler())
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\program.py", line 38, in run
bind_books(self)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\bind_books.py", line 56, in bind_books
file.write(text)
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 48: illegal multibyte sequence
Here is an interesting project for this: https://github.com/ChrisKnott/Eel
When I compile chapters from LNMTL the chapter numbers don't match what they actually are on the site. For instance I downloaded the last 10 chapters of Martial God Asura and it had them as 3123-3132 when they should be 3620-3629 which is what they are on the site. The chapters them selves match, just not the chapter numbers.
Hi, Sudipto
Recently i need to add 2 source to this great project. Both of the source is Indonesia language machine translation novel provider. They are lnindo and idqidian, both have quite a lot of collection. Some of my friend ask me to create epub from that site. I have already create i and test it. May i add it in pull request?
Best Regard
Yudi Lee
I'm trying to integrate and automate using the windows version of the crawler. when I try to batch out the executable, it stops after each execution due to the requirement for the "Enter" key to be pressed. I tried the -f and --suppress flags, but execution still requires the "Enter" key press. Adding this flag/parameter would help automation.
LNMTL doesnt seem to be working. For example I trried the last 10 chapters of Martial God Asura and it came back with this:
"? Enter an url or novel name to find: https://lnmtl.com/novel/martial-god-asura
Retrieving novel info...
NOVEL: Martial God Asura
? Which chapters to download? Last 10 chapters
Getting cover image...
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3623
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3621
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3624
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3620
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3622
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3625
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3628
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3627
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3626
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3629
�[KDownloading chapters |████████████████████████████████| 10/10
�[?25hCreated: 10 text files
Created: 10 html files
Created: Martial God Asura.epub
Created: Martial God Asura.mobi"
And the files contain no text. I have been able to get other sites to work fine just not LNMTL.
I have been trying with and without the login option. If it works than can you please give an example of the commands to get it to work, thanks.
I've noticed it before but thought it was a one off. Sometimes I get this kind of error/message in my books.
This is a sentence in the book. "Hello, I'm a example." said Me.
code from sekindo - Readlightnovel.org In-article - outstream
code from sekindo
/339474670/ReadLightNovel/InStory_1
This is a sentence in the book. "Hello, I'm a example." said Me.
As you can see I get this weird message and duplicate sentences, there's also others through out as well at random points, I've listed them below.
/339474670/ReadLightNovel/InStory_3
/339474670/ReadLightNovel/InStory_2
/339474670/ReadLightNovel/BottomStory
I don't know what causes it, I was downloading this novel https://www.readlightnovel.org/mo-tian-ji
If you need anymore info let me know and I will try and provide it.
This is gonna be painful... but necessary.
Many of source book has synopsis and infor regarding the book. maybe we can crawled it and app intro pages to generated book, maybe adding info that this book generated using this script, etc
For issue #52 we can add uploader for google drive and share google drive link using send message. I have already create function to do that in my forked repository. should i pull request to master or to other branch repository?
I'm trying to download from this url https://lnmtl.com/novel/forty-millenniums-of-cultivation
This is one of the novels on the site that needs logging in to read.
exact command I'm using
lncrawl --login <username> <password> -s https://lnmtl.com/novel/forty-millenniums-of-cultivation
username and password redacted for obvious reasons.
followed onscreen prompts for output directory and selected option for first 10 chapters
"body is empty" for every chapter; the generated file are indeed empty
lncrawl --version
returns 2.7.6
also i tried downloading the above novel by just going through prompts instead of using option flags and i never got prompted to login despite the novel requiring it.
Processing: _novel\8093990805004205
!! Failed to bind: _novel\8093990805004205
This is with when providing the chapter numbers.
python3 main.py webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = BEwNDFH7yoADt2uvgWH9Y2ZdxMHKanalcugCh9WI
Getting book name and chapter list...
1646 chapters found
Processing: _novel\8093990805004205
Creating: NA\NA_v.epub
Amazon kindlegen(Windows) V2.9 build 1029-0897292
A command line e-book compiler
Copyright Amazon.com and its Affiliates 2014
Info:I9007:option: -c2: Kindle Huffdic compression
Error(opfparser):E20004: the id in the spine does not match any item in the manifest: cover
That is without giving chapter numbers.
Visiting: https://lnmtl.com/chapter/the-amber-sword-book-3-chapter-531-1
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "./__main__.py", line 74, in <module>
main()
File "./__main__.py", line 28, in main
end_url=sys.argv[4] if len(sys.argv) > 4 else ''
File "./EbookCrawler/lnmtl.py", line 57, in start
self.crawl_chapters(browser)
File "./EbookCrawler/lnmtl.py", line 89, in crawl_chapters
self.parse_chapter(browser)
File "./EbookCrawler/lnmtl.py", line 112, in parse_chapter
chapter_no = re.search(r'chapter-\d+$', url).group().strip('chapter-')
AttributeError: 'NoneType' object has no attribute 'group'
There are two chapters labeled # 531 on the website.
readlightnovel.org crawler is not crawling the title of each chapter. Instead of the title, there is a "dot".
Tested on:
https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband
Discord won't allow uploading files larger than 8MB. But most of the compressed file has size > 8MB. Need another solution to send files to discord.
the provided windows binary simply exits after entering the login information with your github link reference
Checked on a fresh install of windows with no python or any other possible dependency installed but pure windows
UPDATE---
calling the exe with all the proper parameters instead of using the interactive menu seems to make it work absolutely fine
While converting scrapper to new style, i found that in the new style template chapter title is scrapped from TOC page not from chapter page. In old template the crawler get it from chapter page. In some source in toc they only write the title in number like in boxnovel and novel planet. I think we need to get chapter title from chapter page rather than from toc page.
all chapters download fine, but some chapters are missing from output formats when opting to generate separate files for each volume.
on version 2.7.7
was downloading https://lnmtl.com/novel/forty-millenniums-of-cultivation
opted to generate separate file for each volume
should have 34 volumes
only 33 volumes (for all output formats except json)
with the json format
all the volume_title fields show 1 volume higher than they are supposed to be (except the very last volume)
ex. the json file for the very first chapter:
{"id": 1, "url": "https://lnmtl.com/chapter/forty-millenniums-of-cultivation-chapter-1", "volume": 1, "title": "Chapter #1 - Magical Artifact Graveyard", "volume_title": "Volume 2",...
Laptop@Lenovo ~/novel
$ ebook_crawler webnovel 7931338406001705 1 10 false
Traceback (most recent call last):
File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\laptop\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe\__main__.py", line 9, in <module>
File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\__init__.py", line 44, in main
volume=volume,
File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 49, in start
novel_to_mobi(self.output_path)
File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 110, in novel_to_mobi
for file_name in sorted(os.listdir(epub_path)):
FileNotFoundError: [WinError 3] The system cannot find the path specified: '7931338406001705\\epub'
Getting CSRF Token from https://www.webnovel.com/book/7931338406001705
CSRF Token = mNpxvSI9mDTA9EixtmCelAFwmC3Ifgv4TzZOQRqM
Getting book name and chapter list...
1159 chapters found
7931338406001705 does not exists
I get this error. I am using cygwin 64bit on Windows10. Install using pip install ebook-crawler
I think there is a problem with readlightnovel.com
the rest of the sites are okay.
i Tried: book_crawler readln full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830 https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835
and this is what i get from terminal.
Visiting https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband
Getting book name and chapter list...
[Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband] 1842 chapters found
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1831
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1832
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1833
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1834
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01836.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01835.json
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1841
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01834.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01837.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01833.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01838.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01839.json
complete
Processing: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19
Creating: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/epub/Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband_v19.epub
Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 128, in novel_to_mobi
generator(KINDLEGEN_PATH_MAC)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 136, in novel_to_mobi
generator('kindlegen')
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'kindlegen'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/bin/ebook_crawler", line 11, in
sys.exit(main())
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/init.py", line 51, in main
volume=volume,
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/readln.py", line 49, in start
novel_to_mobi(self.output_path)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 138, in novel_to_mobi
if err[1].errno == errno.ENOENT:
TypeError: 'FileNotFoundError' object is not subscriptable
after the result i get is only an epub with all the titles with no content.
Same for when installed using python or running from source.
Convert all old style crawlers to new simplified style. And add them to ebook_crawler/__init__.py
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.