Giter Club home page Giter Club logo

fictiondown's Introduction

FictionDown

FictionDown 是一个命令行界面的小说爬取工具

用于批量下载盗版网络小说,该软件仅用于数据分析的样本采集,请勿用于其他用途

该软件所产生的文档请勿传播,请勿用于数据评估外的其他用途

License release_version last-commit Download Count goproxy.cn

godoc QQ 群

Go travis-ci Go Report Card

文档

文档目前「指南」部分已完成,你可以在这里查看。

特性

  • 以起点为样本,多站点多线程爬取校对
  • 支持导出 txt,以兼容大多数阅读器
  • 支持导出 epub(还有些问题,某些阅读器无法打开)
  • 支持导出 markdown,可以用 pandoc 转换成 epub,附带 epub 的metadata,保留书本信息、卷结构、作者信息
  • 内置简单的广告过滤(现在还不完善)
  • 用 Golang 编写,安装部署方便,可选的外部依赖:Chromedp
  • 支持断点续爬,强制结束再爬会在上次结束的地方继续

站点支持

  • 是否正版:✅ 为正版站点 ❌ 为盗版站点
  • 是否分卷:✅ 章节分卷 ❌ 所有章节放在一个卷中不分卷
  • 站内搜索:✅ 完全支持 ❌ 不支持 ❔ 站点支持但软件未适配 ⚠️ 站点支持,但不可用或维护中 ⛔ 站点支持搜索,但没有好的适配方案(比如用 Google 做站内搜索)
站点名称 网址 是否正版 是否分卷 支持站内搜索 代码文件
起点中文网 www.qidian.com sites\com_qidian\main.go
笔趣阁 www.b520.cc sites\cc_b520\main.go
顶点小说 www.ddyueshu.com sites\com_ddyueshu\main.go
全本小说网 www.qb5.la sites\la_qb5\main.go
新八一中文网 www.81new.net sites\net_new81\main.go
书迷楼 www.shumil.co sites\co_shumil\main.go
完本神站 www.wanben.org site\org_wanben\main.go
38 看书 www.mijiashe.com ⚠️ sites\com_mijiashe\main.go

使用注意

  • 起点和盗版站的页面可能随时更改,可能会使抓取匹配失效,如果失效请提 issue
  • 生成的 EPUB 文件可能过大,市面上大多数阅读器会异常卡顿或者直接崩溃
  • 某些过于老的书或者作者频繁修改的书,盗版站都没有收录,也就无法爬取,如能找此书可用的盗版站请提 issue,并写出书名和正版站链接、盗版站链接

工作流程

  1. 输入起点链接
  2. 获取到书本信息,开始爬取每章内容,遇到 vip 章节放入Example中作为校对样本
  3. 手动设置笔趣阁等盗版小说的对应链接,tamp字段
  4. 再次启动,开始爬取,只爬取 VIP 部分,并跟Example进行校对
  5. 手动编辑对应的缓存文件,手动删除广告和某些随机字符(有部分是关键字,可能会导致 pandoc 内存溢出或者样式错误)
  6. conv -f md生成 markwown
  7. 用 pandoc 转换成 epub,pandoc -o xxxx.epub xxxx.md

Example

> ./FictionDown --url https://book.qidian.com/info/3249362 d # 获取正版信息

# 有时会发生`not match volumes`的错误,请启用Chromedp或者PhantomJS
# Use Chromedp
> ./FictionDown --url https://book.qidian.com/info/3249362 -d chromedp d
# Use PhantomJS
> ./FictionDown --url https://book.qidian.com/info/3249362 -d phantomjs d

> vim 一世之尊.FictionDown # 加入盗版小说链接
> ./FictionDown -i 一世之尊.FictionDown d # 获取盗版内容
# 爬取完毕就可以输出可阅读的文档了
> ./FictionDown -i 一世之尊.FictionDown conv -f txt
# 转换成epub有两种方式
# 1.输出markdown,再用pandoc转换成epub
> ./FictionDown -i 一世之尊.FictionDown conv -f md
> pandoc -o 一世之尊.epub 一世之尊.md
# 某些阅读器需要对章节进行定位,需要加上--epub-chapter-level=2
> pandoc -o 一世之尊.epub --epub-chapter-level=2 一世之尊.md
# 2.直接输出epub(调用Pandoc)
> ./FictionDown -i 一世之尊.FictionDown conv -f epub

可直接根据搜索结果直接下载(当存在至少一个正版源时可用)

> ./FictionDown s -d -k "诡秘之主"

站内搜索,然后填入

> ./FictionDown --url https://book.qidian.com/info/3249362 d # 获取正版信息

# 有时会发生`not match volumes`的错误,请启用Chromedp或者PhantomJS
# Use Chromedp
> ./FictionDown --url https://book.qidian.com/info/3249362 --driver chromedp d
# Use PhantomJS
> ./FictionDown --url https://book.qidian.com/info/3249362 --driver phantomjs d

> ./FictionDown -i 一世之尊.FictionDown s -k 一世之尊 -p # 搜索然后放入
> ./FictionDown -i 一世之尊.FictionDown d # 获取盗版内容
# 爬取完毕就可以输出可阅读的文档了
> ./FictionDown -i 一世之尊.FictionDown conv -f txt
# 转换成epub有两种方式
# 1.输出markdown,再用pandoc转换成epub
> ./FictionDown -i 一世之尊.FictionDown conv -f md
> pandoc -o 一世之尊.epub 一世之尊.md
# 2.直接输出epub(某些阅读器会报错)
> ./FictionDown -i 一世之尊.FictionDown conv -f epub

未实现

  • 爬取正版的时候带上Cookie,用于爬取已购买章节
  • 支持 晋江文学城
  • 支持 纵横中文网
  • 支持有毒小说网
  • 支持刺猬猫(即“欢乐书客”)
  • 整理 main 包中的面条逻辑
  • 整理命令行参数风格
  • 完善广告过滤
  • 简化使用步骤
  • 优化 log 输出
  • 对于特殊章节,支持手动指定盗版链接或者跳过忽略
  • 外部加载匹配规则,让用户可以自己添加正/盗版源
  • 支持章节更新
  • 章节匹配过程优化

Usage

NAME:
   FictionDown - https://github.com/ma6254/FictionDown

USAGE:
    [global options] command [command options] [arguments...]

AUTHOR:
   ma6254 <[email protected]>

COMMANDS:
     download, d, down  下载缓存文件
     check, c, chk      检查缓存文件
     edit, e            对缓存文件进行手动修改
     convert, conv      转换格式输出
     pirate, p          检索盗版站点
     search, s          检索盗版站点
     help, h            Shows a list of commands or help for one command

GLOBAL OPTIONS:
   -u value, --url value     图书链接
   --tu value, --turl value  资源网站链接
   -i value, --input value   输入缓存文件
   --log value               log file path
   --driver value, -d value  请求方式,support: none,phantomjs,chromedp
   --help, -h                show help
   --version, -v             print the version

安装和编译

程序为单执行文件,命令行 CLI 界面

包管理为 gomod

go install github.com/ma6254/FictionDown@latest

交叉编译这几个平台的可执行文件:linux/arm linux/amd64 darwin/amd64 windows/amd64

make multiple_build

fictiondown's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fictiondown's Issues

win10管理员运行程序内存溢出

FictionDown.exe -i .\一念永恒-耳根-起点中文网.FictionDown s -k 一念永恒 -p
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x18 pc=0x9acbec]

goroutine 1 [running]:
github.com/ma6254/FictionDown/site.Type1SearchAfter.func1(0xc00007a0c0, 0xc, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/runner/work/FictionDown/FictionDown/site/sites.go:200 +0x24c
github.com/ma6254/FictionDown/site.Search(0xc00007a0c0, 0xc, 0xc000163980, 0xc00007a0c0, 0xc, 0xc0000a57a0, 0x4efadf)
/home/runner/work/FictionDown/FictionDown/site/site.go:238 +0x13c
main.glob..func6(0xc0000b8dc0, 0x100, 0xc0000b8dc0)
/home/runner/work/FictionDown/FictionDown/search.go:33 +0x7f
github.com/urfave/cli.HandleAction(0xa2c980, 0xb279f0, 0xc0000b8dc0, 0xc000163900, 0x0)
/home/runner/go/pkg/mod/github.com/urfave/[email protected]/app.go:490 +0xcf
github.com/urfave/cli.Command.Run(0xaf472e, 0x6, 0x0, 0x0, 0x1118610, 0x1, 0x1, 0xb00290, 0x12, 0x0, ...)
/home/runner/go/pkg/mod/github.com/urfave/[email protected]/command.go:210 +0x99d
github.com/urfave/cli.(*App).Run(0x1121fc0, 0xc0000ae000, 0x7, 0x8, 0x0, 0x0)
/home/runner/go/pkg/mod/github.com/urfave/[email protected]/app.go:255 +0x6b6
main.main()
/home/runner/work/FictionDown/FictionDown/main.go:87 +0x125
image

无法读取起点章节,内容为空

bookurl: https://book.qidian.com/info/1025813823/
bookname: 仙朝纪元
author: 西城冷月
coverurl: https://bookcover.yuewen.com/qdbimg/349573/1025813823/180
description: |-
旧世之末,余火回光!
龙蛇起陆的仙道盛景、缱绻多情的绝代佳人,春色绚烂下,是那腐朽的灰败。
仙人在沉沦中徘徊,旧神在欲望中复苏……
建仙朝、铸仙鼎,口含天宪,言出法随,叫那天地换个新纪元!
这是一个幽幽长夜之内,一点星火乍起,煦照九天十地,三界六道……的故事。
tmap: []
volumes:

  • name: 作品相关
    isvip: false
    chapters: []
  • name: 潜龙勿用
    isvip: false
    chapters: []
  • name: 潜龙勿用
    isvip: true
    chapters: []
  • name: 见龙在田
    isvip: true
    chapters: []
  • name: 终日乾乾
    isvip: true
    chapters: []

runtime error 搜索各站点时出现运行时错误

R:\down\FictionDown_0.1.3_Windows_x86_64.tar>FictionDown s -k '赛博剑仙铁雨'
2021/09/04 00:37:18 搜索站点: 新八一中文网 https://www.81new.net/ 404 404 Not Found
2021/09/04 00:37:19 搜索站点: 结果: 0 笔趣阁1 https://www.biquge5200.cc/
2021/09/04 00:37:20 搜索站点: 结果: 0 起点中文网 https://www.qidian.com/
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x18 pc=0x9acbec]

goroutine 1 [running]:
github.com/ma6254/FictionDown/site.Type1SearchAfter.func1(0xc00002c340, 0x14, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/runner/work/FictionDown/FictionDown/site/sites.go:200 +0x24c
github.com/ma6254/FictionDown/site.Search(0xc00002c340, 0x14, 0xc00013d9e0, 0xc00002c340, 0x14, 0xc0000897a0, 0x4efadf)
        /home/runner/work/FictionDown/FictionDown/site/site.go:238 +0x13c
main.glob..func6(0xc000092f20, 0x0, 0xc000092f20)
        /home/runner/work/FictionDown/FictionDown/search.go:33 +0x7f
github.com/urfave/cli.HandleAction(0xa2c980, 0xb279f0, 0xc000092f20, 0xc00013d900, 0x0)
        /home/runner/go/pkg/mod/github.com/urfave/[email protected]/app.go:490 +0xcf
github.com/urfave/cli.Command.Run(0xaf472e, 0x6, 0x0, 0x0, 0x1118610, 0x1, 0x1, 0xb00290, 0x12, 0x0, ...)
        /home/runner/go/pkg/mod/github.com/urfave/[email protected]/command.go:210 +0x99d
github.com/urfave/cli.(*App).Run(0x1121fc0, 0xc000054100, 0x4, 0x4, 0x0, 0x0)
        /home/runner/go/pkg/mod/github.com/urfave/[email protected]/app.go:255 +0x6b6
main.main()
        /home/runner/work/FictionDown/FictionDown/main.go:87 +0x125

软件版本: v0.1.3
运行环境: windows x64, linux x64
网络环境: 海外IP

稳定复现

Windows下通过pandoc转换输出epub发生错误

环境

软件版本:commit 1c10eae tag: v0.1.3
Pandoc版本:

PS C:\Users\mjc\git\FictionDown\release> pandoc -v
pandoc.exe 2.9.2
Compiled with pandoc-types 1.20, texmath 0.12.0.1, skylighting 0.8.3.2
Default user data directory: C:\Users\mjc\AppData\Roaming\pandoc
Copyright (C) 2006-2019 John MacFarlane
Web:  https://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

操作系统:任意Windows版本

复现方法

PS C:\Users\mjc\git\FictionDown\release> .\FictionDown.exe -i .\诡秘之主-爱潜水的乌贼-笔趣阁1.FictionDown conv -f epub
2020/02/19 00:05:36 Loading cache file: .\诡秘之主-爱潜水的乌贼-笔趣阁1.FictionDown
2020/02/19 00:05:36 Start Conversion: Format:"epub" OutPath:"诡秘之主.epub"
2020/02/19 00:05:36 Save Cover Image: "C:\\Users\\mjc\\AppData\\Local\\Temp\\book_cover_126653631.jpg"
2020/02/19 00:05:40 中间文件转换完成: "诡秘之主.epub.md"
2020/02/19 00:05:40 调用Pandoc: "C:\\ProgramData\\chocolatey\\bin\\pandoc.exe" []string{"pandoc", "--epub-chapter-level", "2", "-o", "诡秘之主.epub", "诡秘之主.epub.md"}   
pandoc.exe: C:_cover_126653631.jpg: openBinaryFile: does not exist (No such file or directory)
exit status 1

或者

PS C:\Users\mjc\git\FictionDown\release> pandoc -o a.epub 诡秘之主.md
pandoc.exe: C:_cover_703999991.jpg: openBinaryFile: does not exist (No such file or directory)

MetaData部分

title: 诡秘之主
description: |-
  蒸汽与机械的浪潮中,谁能触及非凡?历史和黑暗的迷雾里,又是谁在耳语?我从诡秘中醒来,睁眼看见这个世界:
  枪械,大炮,巨舰,飞空艇,差分机;魔药,占卜,诅咒,倒吊人,封印物……光明依旧照耀,神秘从未远离,这是一段“愚者”的传说。
creator: 爱潜水的乌贼
lang: zh-CN
cover-image: C:\Users\mjc\AppData\Local\Temp\book_cover_703999991.jpg

推测为Pandoc和go-yaml的YAML实现不一致导致

已向Pandoc提交Issue:jgm/pandoc#6150

无法下载,起点

您好,我是第一次用这个FictionDown,我想用它下载"诡秘之主",带到kindle上二刷.

由于没有用过go语言相关程序,我又害怕出错,所有我的安装方式为:
1.打开了v2rayNG翻墙
2.下载安装了go语言支持(.msi for amd64)
3.go env -w GO111MODULE=on
4.go env -w GOPROXY=https://goproxy.cn,direct
5.go get -v github.com/ma6254/FictionDown@latest

似乎是安装成功了
image

然而接下来无论是我尝试搜索
image

还是我尝试提供网站直接下载
image

似乎运行并不正常,是我没安装好吗?

平台:
win10 企业版 LTSC 1809(os内部版本 17763.1637)
go version go1.15.6 windows/amd64

2021.1.9

[Enhanced] 顶点小说网域名更新,Xpath不需要变动

www.booktxt.net 301 跳转到 www.ddxstxt8.com

  • 章节目录结构Xpath等均不需要变动
  • https://github.com/ma6254/FictionDown/blob/35edca3576102a93f6c2a894e9b232155cbf92e5/sites/booktxt_net/main.go下的
		Match: []string{
			`https://www\.booktxt\.net/\d+_\d+/*`,
			`https://www\.booktxt\.net/\d+_\d+/\d+\.html/*`,
			`http://www\.booktxt\.net/book/goto/id/\d+`,
		},

需要替换为301跳转域名

chromedp更新了

chromedp更新后,方法名改了,所有调用chromedp的地方基本全不行了

dep ensure failed

$ dep ensure -v
(1/12) Wrote github.com/benbjohnson/phantomjs@master
(2/12) Wrote github.com/gofrs/[email protected]
(3/12) Wrote github.com/bmaupin/[email protected]
(4/12) Wrote golang.org/x/[email protected]
(5/12) Wrote github.com/go-yaml/[email protected]
(6/12) Wrote github.com/mattn/[email protected]
(7/12) Failed to write golang.org/x/net@master
(8/12) Failed to write golang.org/x/sys@master
(9/12) Failed to write gopkg.in/cheggaaa/[email protected]
(10/12) Failed to write github.com/antchfx/[email protected]
(11/12) Failed to write github.com/antchfx/xpath@master
(12/12) Failed to write github.com/urfave/[email protected]
grouped write of manifest, lock and vendor: error while writing out vendor tree: failed to write dep tree: failed to export golang.org/x/net: fatal: failed to unpack tree object 3a22650c66bd7f4fb6d1e8072ffd7b75c8a27898
: exit status 128

$ dep version
dep:
 version     : devel
 build date  : 
 git hash    : 
 go version  : go1.9.4
 go compiler : gc
 platform    : linux/amd64
 features    : ImportDuringSolve=false
$ go version
go version go1.11 linux/amd64

自定义书源

有个未实现的功能就是自定义书源,这个刚好能用上。

example failed

donwload release and install phantomjs, then run example:

$ ./FictionDown --url https://book.qidian.com/info/3249362 d 
2019/03/11 16:11:02 Init PhantomJS
2019/03/11 16:11:03 URL: "https://book.qidian.com/info/3249362"
2019/03/11 16:11:03 Close PhantomJS
2019/03/11 16:11:03 failed
$ phantomjs --version
1.9.8

生成多平台的可执行文件

Flag --rm-dist has been deprecated, please use --clean instead
• starting release...
⨯ release failed after 0s error=yaml: unmarshal errors:
line 41: field replacements not found in type config.Archive

感谢您的项目,提一个小小的建议

q(≧▽≦q)感谢您的项目,解决了在下的痛。
但是提个小建议:能否在发布release时给文件签名呢?
(目的:

  1. 防止您的权益受到侵害,毕竟国内有很多无良,从github上盗窃项目,套层壳然后收费售卖......
  2. 不知道您有没有用过kms pico呢?作者发布在一个国外的论坛上(被墙了),然后很多人建立仿站在文件里藏上挖矿⛏病毒......希望您能签名......最好给一个MD5码哦~)
    最后的最后,再次感谢orz!

加个规则

www.zhuishubang.com

Code

package site

import (
	"fmt"
	"io"
	"net/url"
	"strings"

	"github.com/antchfx/htmlquery"
	"github.com/ma6254/FictionDown/store"
	"golang.org/x/text/encoding/simplifiedchinese"
	"golang.org/x/text/transform"
)

type wwwZhuishubangCom struct {
}

func (b *wwwZhuishubangCom) BookInfo(body io.Reader) (s *store.Store, err error) {
	body = transform.NewReader(body, simplifiedchinese.GBK.NewDecoder())
	doc, err := htmlquery.Parse(body)
	if err != nil {
		return
	}

	s = &store.Store{}

	// Book Name
	node_title := htmlquery.Find(doc, `//div[@class="bookPhr"]/h2`)
	if len(node_title) == 0 {
		err = fmt.Errorf("No matching title")
		return
	}
	s.BookName = htmlquery.InnerText(node_title[0])

	// Description
	node_desc := htmlquery.Find(doc, `//*[@class="introCon"]/p`)
	if len(node_desc) == 0 {
		err = fmt.Errorf("No matching desc")
		return
	}
	s.Description = strings.Replace(
		htmlquery.OutputHTML(node_desc[0], false),
		"<br/>", "\n",
		-1)

	// Author
	var author = htmlquery.Find(doc, `//div[@class="bookPhr"]/dl/dd`)
	s.Author = htmlquery.OutputHTML(author[0], false)

	// Contents
	node_content := htmlquery.Find(doc, `//div[@class="chapterCon"]/ul/li/a`)
	if len(node_desc) == 0 {
		err = fmt.Errorf("No matching contents")
		return
	}

	var vol = store.Volume{
		Name:     "正文",
		Chapters: make([]store.Chapter, 0),
	}

	//for  _, v := range node_content {
  for idx:=len(node_content)-1;idx>=0;idx--{
    v:=node_content[idx]
		//fmt.Printf("href: %v\n", chapter_u)
		chapterURL, err := url.Parse(htmlquery.SelectAttr(v, "href"))
		if err != nil {
			return nil, err
		}

		vol.Chapters = append(vol.Chapters, store.Chapter{
			Name: strings.TrimSpace(htmlquery.InnerText(v)),
			URL:  chapterURL.String(),
		})
	}
	s.Volumes = append(s.Volumes, vol)

	s.CoverURL = htmlquery.SelectAttr(htmlquery.FindOne(doc, `//*[@class="bookImg"]/img`), "src")

	return
}

func (b *wwwZhuishubangCom) Chapter(body io.Reader) ([]string, error) {
	body = transform.NewReader(body, simplifiedchinese.GBK.NewDecoder())
	doc, err := htmlquery.Parse(body)
	if err != nil {
		return nil, err
	}

	M := []string{}
	//list
	// nodeContent := htmlquery.Find(doc, `//div[@id="content"]/text()`)
	nodeContent := htmlquery.Find(doc, `//div[@class="articleCon"]/p/text()`)
	if len(nodeContent) == 0 {
		err = fmt.Errorf("No matching content")
		return nil, err
	}
	for _, v := range nodeContent {
		t := htmlquery.InnerText(v)
		t = strings.TrimSpace(t)

		switch t {
		case
			"本↘书↘首↘发↘追↘书↘帮↘http://m.zhuishubang.com/",
			"":
			continue
		}

		M = append(M, t)
	}

	return M, nil
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.