dipu-bd / lightnovel-crawler Goto Github PK

View Code? Open in Web Editor NEW

1.4K 40.0 284.0 32.48 MB

Generate and download e-books from online sources.

Home Page: https://pypi.org/project/lightnovel-crawler/

License: GNU General Public License v3.0

Python 99.11% Shell 0.28% CSS 0.25% Batchfile 0.08% HTML 0.04% Dockerfile 0.09% Procfile 0.01% JavaScript 0.16%

lightnovel termux web-scraper console-app python lightnovel-crawler discord telegram kindle-books

lightnovel-crawler's People

Contributors

Stargazers

Watchers

Forkers

yonkyunior yudilee domdomdoy jriiringan moua kasebrat masroore exloner krabbyos y010204025 jdtcoder drewbitt hhy5277 dbosst huangmhao nguyenhaimd takishima elnino217 juh9870 tidux danruto diogenes412 anwaralrabee iosylvana spartinus maradude varunprashar5 diogenes895 dban1 uisufian preownedfin nghos3 shanikairusha mmtftr mhussaincov dupegal mraleksander njncalub avadasma epicpkmn11 kuwoyuki hamzarahmani captaincarmnlg galunid pringlays skyme5 devilld josephjsm ancientcatz death916 typotami pkpkp-pk covetedpirate007 gasbarroni8 newaccforhc frybin cafaggi fsociety046 leaderwings borgsquared thanhtoan1196 raj-apoorv kumarashantanu amritoo jakedraddy prespik legodxd pustakp zunzelf nntin akangkshyap primemaster-git mchubby victoirewood sphere-bit evanblack01 k4noe vkforvcut prodiptopantho junqili259 victorobahor this-is-not-a-backup taz85 fortheplayer marcandjulien superlufi watzeedzad rizkiv1 jewyswan joyceestrada13 hauntty ol416 ireneholton akhiljithk theclonedude wanghaisheng yespuri rieja69 damare01 budikesuma

lightnovel-crawler's Issues

Create a pypi package

Make api more robust
Add setup.py
Upload to pypi

Search for wuxiaworld.co generated error

While using wuxiaworld.co as source for search, bot generated error
raise Exception('No results for: %s' % self.user_input)

❗ Error: 'NoneType' object is not iterable

If choose "custom range index" or "custom range url" error message "❗ Error: 'NoneType' object is not iterable" generated

readlightnovel.org spider duplicate text

readlightnovel.org crawler is duplicating the text. There is a hidden div in the HTML source, which does not contain ads, and a visible DIV. The crawler get's both and is not filtering the ADS sub-div either.

output_path referenced before assignment

Installed using pip on windows. Created the folders manually. See #10 .

.\ebook_crawler.exe webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = LI1UErwQVWiDGnvCmmpgBamfOpavmDGUCZJqglkP
Getting book name and chapter list...
1646 chapters found
Traceback (most recent call last):
File "c:\program files\python36\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe_main.py", line 9, in
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler_init.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 44, in start
novel_to_kindle(self.output_path)
File "C:\Users\Karl\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 83, in novel_to_kindle
for file_name in sorted(os.listdir(output_path)):
UnboundLocalError: local variable 'output_path' referenced before assignment

Need New Source

I think 4 source available is really amazing but more source available is great. I think wuxiaworld.co. and novelplanet is perfect candidate for new source to be packed into epub.

Implement novel searching feature

Make it more intelligence by following this workflow:

Take any string as input
If it a supported URL. start the crawler immediately
Otherwise display a list of websites that supports searching. User has to choose one.
Display search results from selected website. User will select one from the result list.
Start the crawler using the lightnovel URL

Implemented it for:

X represents the sites that does not support searching or could not implement searching

readlightnovel failed to create epub

This has happen multiple time now it fails to create everything except html and text files.

readlightnovel.org

Please add support for readlightnovel.org

MAc user

Hi, A beginner here,
Can you please provide us an instruction for mac users? thank you

Add version checking and update option

Telegram Bot when upload file finished not destroy crawler instance

Hi Sudipto,

After trying and testing the telegram bot i found that after upload zip file finished session not closed, and crawler instance not destroyed so when i call /start again, i have to call command /cancel first so bot can accept new job, is that the right flow?

And the other problem i found is even after call /cancel and i can call command /start again volume and chapter number is counted (past session + this session).

Is crawler instance passed betwen session?

Webnovel error invalid literal for int() with base 10:

While i try to create epub from webnovel i got error invalid literal for int() with base 10:

python3 main.py webnovel 10377938706023605 https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad https://www.webnovel.com/book/10377938706023605/30154170363336851/Last-Wish-System/Crossing-the-Border
Getting CSRF Token from https://www.webnovel.com/book/10377938706023605
CSRF Token = 9eJJFX5txT0r9s3004p1rDY61DZrTfvslGGHmp61
Getting book name and chapter list...
148 chapters found
Traceback (most recent call last):
File "main.py", line 2, in
main()
File "/home/yudi/book/Web Scrapper/ebook_crawler/init.py", line 34, in main
end_chapter=sys.argv[4] if len(sys.argv) > 4 else ''
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 43, in start
self.get_chapter_bodies()
File "/home/yudi/book/Web Scrapper/ebook_crawler/webnovel.py", line 83, in get_chapter_bodies
start = int(self.start_chapter)
ValueError: invalid literal for int() with base 10: 'https://www.webnovel.com/book/10377938706023605/27858104469219628/Last-Wish-System/Yale-Roanmad'

Thanks for helping

Add css file to styling epub

I think add styling to generated epub is interesting, and epub generated can be looks prettier

WuxiaWorld new URL format

The new format for WW novels is
http://www.wuxiaworld.com/novel/desolate-era/de-book-x-chapter-x/
, with the old format being
http://www.wuxiaworld.com/desolate-era-index/de-book-x-chapter-x/.

Can you edit the wuxia code to change that?

Add discord bot

Discord bot closed when processing

When i use discord bot. if i ask it to generate book for novel with more than 200 chapter it usually generate error like this
[ERROR] (asyncio) Task was destroyed but it is pending! task: <Task pending coro=<Client._run_event() running at /home/yudi/.local/lib/python3.6/site-packages/discord/client.py:307> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_connect_done(15)(), <TaskWakeupMethWrapper object at 0x7f52f3c029a8>()]>>
and bot will destroyed(closed). But i think this not happened with lightnovel-crawler bot in readme link. Is there something i miss when deploying discord bot?

Automate windows executable building task

Here is a good source: https://github.com/brentvollebregt/auto-py-to-exe
Windows defender is warning the generated exe as unsafe. It needs to be fixed.
Chrome is blocking the exe saying it is not commonly downloaded and unsafe. Should I worry about it?

Discord bot cannot get novel url in channel chat

Discord bot works great in 1 on 1 chat but not on channel chat because it can read novel url or search item in chat. Unlike telegram bot, in discord while in channel they don't have reply feature. Maybe we need shorter command feature for chat in channel but also retain conversation like request for 1 on 1 chat. shorter comman is similar to argument in console bot for example
for searching we can do :
!lncrawl search novel_title novel_source --> to generate novel url
for generating book
!lncrawl format_book novel_url pack_by_volume all -> to generate all chapter in format_book format
etc

Accept arguments like v1.x.x

All arguments should be optional. If the arguments are not valid, the current interactive interface will be shown.

This is a good start: https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df

Need ability to do machine translation using google trans api while scrapping

I was wondering if we need to add function to make this script able to translate to chosen language from text scrapped from novel source, maybe we can use googletrans api to do this. Hoping this program can give surprise to reader from many different language.

Request: novelfull.com

I've noticed it isn't supported I was wondering if you could add it. It has very similar layout to some of the other supported sites like BoxNovel, Readinglightnovel and Novelplanet.

UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 48: illegal multibyte sequence

�[?25hCreated: I’m in Hollywood.epub
Failed to generate mobi for I’m in Hollywood.epub
Traceback (most recent call last):
File "c:\program files (x86)\python36-32\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\program files (x86)\python36-32\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Program Files (x86)\Python36-32\Scripts\lightnovel-crawler.exe_main.py", line 9, in
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler_init.py", line 65, in main
start_app(crawler_list)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app_init_.py", line 34, in start_app
Program().run(crawler())
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\program.py", line 38, in run
bind_books(self)
File "c:\program files (x86)\python36-32\lib\site-packages\lightnovel_crawler\app\bind_books.py", line 56, in bind_books
file.write(text)
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 48: illegal multibyte sequence

Create a GUI

Here is an interesting project for this: https://github.com/ChrisKnott/Eel

Add https://wuxiaworld.online

https://wuxiaworld.online

Rename project to `lightnovel-crawler` - it is a more meaningful name

LNMTL chapter numbers are wrong

When I compile chapters from LNMTL the chapter numbers don't match what they actually are on the site. For instance I downloaded the last 10 chapters of Martial God Asura and it had them as 3123-3132 when they should be 3620-3629 which is what they are on the site. The chapters them selves match, just not the chapter numbers.

I already create 2 crawler for Indonesian source site for mtl novel. May i add it in this project source crawler?

Hi, Sudipto

Recently i need to add 2 source to this great project. Both of the source is Indonesia language machine translation novel provider. They are lnindo and idqidian, both have quite a lot of collection. Some of my friend ask me to create epub from that site. I have already create i and test it. May i add it in pull request?

Best Regard

Yudi Lee

include function/parameter to bypass "Press Enter to exit"

I'm trying to integrate and automate using the windows version of the crawler. when I try to batch out the executable, it stops after each execution due to the requirement for the "Enter" key to be pressed. I tried the -f and --suppress flags, but execution still requires the "Enter" key press. Adding this flag/parameter would help automation.

LNMTL not working

LNMTL doesnt seem to be working. For example I trried the last 10 chapters of Martial God Asura and it came back with this:

"? Enter an url or novel name to find: https://lnmtl.com/novel/martial-god-asura
Retrieving novel info...
NOVEL: Martial God Asura
? Which chapters to download? Last 10 chapters
Getting cover image...
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3623
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3621
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3624
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3620
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3622
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3625
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3628
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3627
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3626
�[KBody is empty: https://lnmtl.com/chapter/martial-god-asura-book-8-chapter-3629
�[KDownloading chapters |████████████████████████████████| 10/10
�[?25hCreated: 10 text files
Created: 10 html files
Created: Martial God Asura.epub
Created: Martial God Asura.mobi"

And the files contain no text. I have been able to get other sites to work fine just not LNMTL.
I have been trying with and without the login option. If it works than can you please give an example of the commands to get it to work, thanks.

https://m.chinesefantasynovels.com

code from sekindo

I've noticed it before but thought it was a one off. Sometimes I get this kind of error/message in my books.

This is a sentence in the book. "Hello, I'm a example." said Me.
code from sekindo - Readlightnovel.org In-article - outstream
code from sekindo
/339474670/ReadLightNovel/InStory_1
This is a sentence in the book. "Hello, I'm a example." said Me.

As you can see I get this weird message and duplicate sentences, there's also others through out as well at random points, I've listed them below.

/339474670/ReadLightNovel/InStory_3
/339474670/ReadLightNovel/InStory_2
/339474670/ReadLightNovel/BottomStory

I don't know what causes it, I was downloading this novel https://www.readlightnovel.org/mo-tian-ji
If you need anymore info let me know and I will try and provide it.

Write tests for all crawlers

This is gonna be painful... but necessary.

Add Intro Page to generated book

Many of source book has synopsis and infor regarding the book. maybe we can crawled it and app intro pages to generated book, maybe adding info that this book generated using this script, etc

Add uploader for google drive

For issue #52 we can add uploader for google drive and share google drive link using send message. I have already create function to do that in my forked repository. should i pull request to master or to other branch repository?

LNMTL login issues

I'm trying to download from this url https://lnmtl.com/novel/forty-millenniums-of-cultivation
This is one of the novels on the site that needs logging in to read.
exact command I'm using
lncrawl --login <username> <password> -s https://lnmtl.com/novel/forty-millenniums-of-cultivation
username and password redacted for obvious reasons.
followed onscreen prompts for output directory and selected option for first 10 chapters
"body is empty" for every chapter; the generated file are indeed empty
lncrawl --version returns 2.7.6

also i tried downloading the above novel by just going through prompts instead of using option flags and i never got prompted to login despite the novel requiring it.

Novel binding fails

Processing: _novel\8093990805004205
!! Failed to bind: _novel\8093990805004205

This is with when providing the chapter numbers.

python3 main.py webnovel 8093990805004205
Getting CSRF Token from https://www.webnovel.com/book/8093990805004205
CSRF Token = BEwNDFH7yoADt2uvgWH9Y2ZdxMHKanalcugCh9WI
Getting book name and chapter list...
1646 chapters found
Processing: _novel\8093990805004205
Creating: NA\NA_v.epub

Amazon kindlegen(Windows) V2.9 build 1029-0897292
A command line e-book compiler
Copyright Amazon.com and its Affiliates 2014

Info:I9007:option: -c2: Kindle Huffdic compression
Error(opfparser):E20004: the id in the spine does not match any item in the manifest: cover

That is without giving chapter numbers.

lnmtl.com terminates when chapter_no not purely numeric

Visiting: https://lnmtl.com/chapter/the-amber-sword-book-3-chapter-531-1
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./__main__.py", line 74, in <module>
    main()
  File "./__main__.py", line 28, in main
    end_url=sys.argv[4] if len(sys.argv) > 4 else ''
  File "./EbookCrawler/lnmtl.py", line 57, in start
    self.crawl_chapters(browser)
  File "./EbookCrawler/lnmtl.py", line 89, in crawl_chapters
    self.parse_chapter(browser)
  File "./EbookCrawler/lnmtl.py", line 112, in parse_chapter
    chapter_no = re.search(r'chapter-\d+$', url).group().strip('chapter-')
AttributeError: 'NoneType' object has no attribute 'group'

There are two chapters labeled # 531 on the website.

readlightnovel.org - missing chapter title

readlightnovel.org crawler is not crawling the title of each chapter. Instead of the title, there is a "dot".

Tested on:
https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband

Discord bot file size limit

Discord won't allow uploading files larger than 8MB. But most of the compressed file has size > 8MB. Need another solution to send files to discord.

Windows Binary Does nothing after inputting the novel link

the provided windows binary simply exits after entering the login information with your github link reference
Checked on a fresh install of windows with no python or any other possible dependency installed but pure windows

UPDATE---
calling the exe with all the proper parameters instead of using the interactive menu seems to make it work absolutely fine

Chapter Title from TOC and from chapter page is different

While converting scrapper to new style, i found that in the new style template chapter title is scrapped from TOC page not from chapter page. In old template the crawler get it from chapter page. In some source in toc they only write the title in number like in boxnovel and novel planet. I think we need to get chapter title from chapter page rather than from toc page.

LNMTL partially missing output when splitting by volume

all chapters download fine, but some chapters are missing from output formats when opting to generate separate files for each volume.

on version 2.7.7
was downloading https://lnmtl.com/novel/forty-millenniums-of-cultivation
opted to generate separate file for each volume
should have 34 volumes
only 33 volumes (for all output formats except json)

with the json format
all the volume_title fields show 1 volume higher than they are supposed to be (except the very last volume)
ex. the json file for the very first chapter:
{"id": 1, "url": "https://lnmtl.com/chapter/forty-millenniums-of-cultivation-chapter-1", "volume": 1, "title": "Chapter #1 - Magical Artifact Graveyard", "volume_title": "Volume 2",...

Windows cygwin: FileNotFoundError: [WinError 3] The system cannot find the path specified

Laptop@Lenovo ~/novel
$ ebook_crawler webnovel 7931338406001705 1 10 false
Traceback (most recent call last):
  File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\laptop\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\Scripts\ebook_crawler.exe\__main__.py", line 9, in <module>
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\__init__.py", line 44, in main
    volume=volume,
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\webnovel.py", line 49, in start
    novel_to_mobi(self.output_path)
  File "C:\Users\laptop\AppData\Roaming\Python\Python36\site-packages\ebook_crawler\binding.py", line 110, in novel_to_mobi
    for file_name in sorted(os.listdir(epub_path)):
FileNotFoundError: [WinError 3] The system cannot find the path specified: '7931338406001705\\epub'
Getting CSRF Token from  https://www.webnovel.com/book/7931338406001705
CSRF Token = mNpxvSI9mDTA9EixtmCelAFwmC3Ifgv4TzZOQRqM
Getting book name and chapter list...
1159 chapters found
7931338406001705 does not exists

I get this error. I am using cygwin 64bit on Windows10. Install using pip install ebook-crawler

Add new sources

Support more output formats

Readlightnovel.com problems

I think there is a problem with readlightnovel.com
the rest of the sites are okay.
i Tried: book_crawler readln full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830 https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835

and this is what i get from terminal.
Visiting https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband
Getting book name and chapter list...
[Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband] 1842 chapters found
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1830
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1831
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1832
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1833
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1834
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01836.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01835.json
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1835
Downloading https://www.readlightnovel.org/full-marks-hidden-marriage-pick-up-a-son-get-a-free-husband/chapter-1841
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01834.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01837.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01833.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01838.json
Saving Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19/01839.json
complete
Processing: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/json/19
Creating: Full Marks Hidden Marriage Pick Up a Son, Get a Free Husband/epub/Full Marks Hidden Marriage: Pick Up a Son, Get a Free Husband_v19.epub
Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 128, in novel_to_mobi
generator(KINDLEGEN_PATH_MAC)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 136, in novel_to_mobi
generator('kindlegen')
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 123, in
generator = lambda kindlegen: call([kindlegen, epub_file])
File "/anaconda3/lib/python3.5/subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "/anaconda3/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/anaconda3/lib/python3.5/subprocess.py", line 1289, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'kindlegen'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda3/bin/ebook_crawler", line 11, in
sys.exit(main())
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/init.py", line 51, in main
volume=volume,
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/readln.py", line 49, in start
novel_to_mobi(self.output_path)
File "/anaconda3/lib/python3.5/site-packages/ebook_crawler/binding.py", line 138, in novel_to_mobi
if err[1].errno == errno.ENOENT:
TypeError: 'FileNotFoundError' object is not subscriptable

after the result i get is only an epub with all the titles with no content.