howie6879 / liuli Goto Github PK
View Code? Open in Web Editor NEW一站式构建多源、干净、个性化的阅读环境(Build a multi-source, clean and personalized reading environment in one stop.)
Home Page: https://liuli.io
License: Apache License 2.0
一站式构建多源、干净、个性化的阅读环境(Build a multi-source, clean and personalized reading environment in one stop.)
Home Page: https://liuli.io
License: Apache License 2.0
以我的博客为例,可自动将以下内容识别为rss提供地址进行订阅:
大佬你好!现在版本有个问题如下:
如有公众号每天发布同样标题的文章,但内容不同时,备份到Github时会有冲突。
建议html后加时间日期予以区分,防止覆盖或因标题重复而不备份。
from src.classifier import model_predict_factory
model_resp = model_predict_factory(
model_name="charcnn", model_path="", input_dict={"text": doc_name}
)
一些网站对资源做了跨域这块的限制,如微信图片,目前考虑解决方案:
由于2c有MongoDB依赖,为了镜像部署方便,所以使用 docker-compose 进行一键安装
比如小说类型支持目录识别
关注的部分公众号没有认证,搜狗搜索不到,无法生成链接。如:丁香园内分泌时间(等丁香园开头的部分公众号)
参考的https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA
刚起了demo试着爬一下微信公众号的内容,但是日志里显示执行失败了。
Loading .env environment variables...
[2022:05:09 10:55:45] INFO Liuli Schedule(v0.2.4) task(default@liuli_team) started successfully :)
[2022:05:09 10:55:45] INFO Liuli Task(default@liuli_team) schedule time:
00:10
12:10
21:10
[2022:05:09 10:55:45] ERROR Liuli 执行失败!'doc_source'
文章里给你docker compose配置文件里使用的liuli schedule镜像版本是不带playwright的,我看文章里提供的default的json里描述的使用playwright爬取微信内容,尝试着更改为了带playwright的版本,也显示执行失败。
移除textrank4zh,直接使用jieba分词
测试脚本如下:
from src.collector.wechat_feddd.start import WeiXinSpider
WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
WeiXinSpider.start()
错误原因:
数据清洗时,期望的数据格式是 2022-03-21 20:59
,但实际抓取回来的数据是 2022-03-22 20:37:12
,导致 clean_doc_ts
函数报错。如下图
文章持久化到MongoDB,基于此持久化数据提供API供用户访问
考虑如下方式:
Pipenv 安装完的Playwright是没有初始化的,似乎没有任何脚本会自动处理这个步骤。
要自己进Shell运行一次Playwright install才能运行起来。
PS:我用的是代码安装。
截取一部分生成的RSS信息如下,此处的 updated
日期,为liuli在周期性运行的过程中更新时的时间,即使对于一条很久以前的RSS信息,它的 updated
也会被更新到当前时间。
<entry>
<id>liuli_wechat - 谷歌开发者 - 社区说|TensorFlow 在工业视觉中的落地</id>
<title>社区说|TensorFlow 在工业视觉中的落地 </title>
<updated>2022-05-28T13:17:35.903720+00:00</updated>
<author>
<name>liuli_wechat - GDG</name>
</author>
<content/>
<link href="https://ddns.ysmox.com:8766/backup/liuli_wechat/谷歌开发者/%E7%A4%BE%E5%8C%BA%E8%AF%B4%EF%BD%9CTensorFlow%20%E5%9C%A8%E5%B7%A5%E4%B8%9A%E8%A7%86%E8%A7%89%E4%B8%AD%E7%9A%84%E8%90%BD%E5%9C%B0" rel="alternate"/>
<published>2022-05-25T17:30:46+08:00</published>
</entry>
这样会引起一些问题,在某些RSS订阅器上(如Tiny Tiny RSS),其时间轴上是根据 updated
来排序,而并非 published
,如此一来,无法有效地区分当前的RSS哪些内容是最近生成的,哪些又是以前生成过的。
所以希望保留 updated
的时间不变(如第一次存到mongodb中时,记录当前时间;若周期性更新时则不改变其值)或者与 published
保持一致。
最后,希望我已经清楚地表达了我的问题和请求,谢谢!
根据https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA的提示进行安装。
实际文件和代码如下:
pro.env文件的内容:
PYTHONPATH=${PYTHONPATH}:${PWD}
LL_M_USER="liuli"
LL_M_PASS="liuli"
LL_M_HOST="liuli_mongodb"
LL_M_PORT="27017"
LL_M_DB="admin"
LL_M_OP_DB="liuli"
LL_FLASK_DEBUG=0
LL_HOST="0.0.0.0"
LL_HTTP_PORT=8765
LL_WORKERS=1
# 上面这么多配置不用改,下面的才需要各自配置
# 请填写你的实际IP
LL_DOMAIN="http://172.17.0.1:8765"
# 请填写微信分发配置
LL_WECOM_ID="自定义"
LL_WECOM_AGENT_ID="自定义"
LL_WECOM_SECRET="自定义"
default.json
的内容如下:
{
"name": "default",
"author": "liuli_team",
"collector": {
"wechat_sougou": {
"wechat_list": [
"老胡的储物柜"
],
"delta_time": 5,
"spider_type": "playwright"
}
},
"processor": {
"before_collect": [],
"after_collect": [{
"func": "ad_marker",
"cos_value": 0.6
}, {
"func": "to_rss",
"link_source": "github"
}]
},
"sender": {
"sender_list": ["wecom"],
"query_days": 7,
"delta_time": 3
},
"backup": {
"backup_list": ["mongodb"],
"query_days": 7,
"delta_time": 3,
"init_config": {},
"after_get_content": [{
"func": "str_replace",
"before_str": "data-src=\"",
"after_str": "src=\"https://images.weserv.nl/?url="
}]
},
"schedule": {
"period_list": [
"00:10",
"12:10",
"21:10"
]
}
}
docker-compose.yml文件的内容如下:
version: "3"
services:
liuli_api:
image: liuliio/api:v0.1.3
restart: always
container_name: liuli_api
ports:
- "8765:8765"
volumes:
- ./pro.env:/data/code/pro.env
depends_on:
- liuli_mongodb
networks:
- liuli-network
liuli_schedule:
image: liuliio/schedule:v0.2.4
restart: always
container_name: liuli_schedule
volumes:
- ./pro.env:/data/code/pro.env
- ./liuli_config:/data/code/liuli_config
depends_on:
- liuli_mongodb
networks:
- liuli-network
liuli_mongodb:
image: mongo:3.6
restart: always
container_name: liuli_mongodb
environment:
- MONGO_INITDB_ROOT_USERNAME=liuli
- MONGO_INITDB_ROOT_PASSWORD=liuli
ports:
- "27027:27017"
volumes:
- ./mongodb_data:/data/db
command: mongod
networks:
- liuli-network
networks:
liuli-network:
driver: bridge
报错内容如下:
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule | Loading .env environment variables...
liuli_schedule | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule | Loading .env environment variables...
liuli_schedule exited with code 0
我感觉是python路径的问题。我的python路径是:
which python3 # /usr/bin/python3
我的VPS中没有${PYTHONPATH}这个系统变量:
echo ${PYTHONPATH} # NULL
请问大佬,我应该如何改正?
对于一些类别的文章会重复更新,目前备份器机制是插入后就不再更新,需要抽象一个参数由用户控制。
目前计划支持将文章输出到如下终端:
更多分发终端需求大家可在评论区请求支持
为了提升模型的识别准确率,我希望大家能尽力贡献一些广告样本,请看样本文件:.files/datasets/ads.csv,我设定格式如下:
title | url | is_process |
---|---|---|
广告文章标题 | 广告文章连接 | 0 |
字段说明:
0
即可来个实例:
一般广告会重复在多个公众号投放,填写的时候麻烦查一下是否存在此条记录,真的真的希望大家能一起合力贡献,亲,来个PR贡献你的力量吧!
当前广告反馈需要用户自行收集提交,流程相对有点门槛,考虑提供更方便的形式进行广告提交。
关于公众号文章备份,由于公众号文章每篇就HTML而言比较大,而且持久化的目的可能对应着在线浏览的需求。最后,持久化到数据库然后服务化其实对只有小型服务器的用户来说还是有压力的。
目前想法是基于github的page功能构建一个公众号文章存储浏览模块,比如,当采集器获取到raw html就直接push到 用户的github某个仓库,比如 github.com/howie6879/2c_wechat_html,然后基于page提供页面浏览服务。
大家怎么看,有没有其他的想法?
# 初始化
2c init
# 启动
2c start
# 关闭
2c stop
[2022:05:27 08:11:47] INFO Request <GET: https://weixin.sogou.com/weixin?type=1&query=丁爸20%情报分析师的工具箱&ie=utf8&s_from=input&_sug_=n&_sug_type_=>
liuli_schedule | [2022:05:27 08:11:48] ERROR SGWechatSpider <Item: Failed to get target_item's value from html.>
liuli_schedule | Traceback (most recent call last):
liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/spider.py", line 197, in _process_async_callback
liuli_schedule | async for callback_result in callback_results:
liuli_schedule | File "/data/code/src/collector/wechat/sg_ruia_start.py", line 58, in parse
liuli_schedule | async for item in SGWechatItem.get_items(html=html):
liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/item.py", line 127, in get_items
liuli_schedule | raise ValueError(value_error_info)
liuli_schedule | ValueError: <Item: Failed to get target_item's value from html.>
比如链接地址
https://author.baidu.com/home?from=bjh_article&app_id=1669728810290752
最好能分开里面的子栏目,文章、视频
根据 tag 触发 镜像打包工作 @LeslieLeung
运行日志如下,请问这是啥问题。
[2022:02:18 10:51:54] INFO Liuli Schedule(v0.2.1) task(default@liuli_team) started successfully :)
[2022:02:18 10:51:54] INFO Liuli Task(default@liuli_team) schedule time:
00:10
12:10
21:10
[2022:02:18 10:51:54] ERROR Liuli 执行失败!'doc_source'
如题...个人使用这种轻量的数据库省去维护烦恼...
支持源考虑范围如下:
为 Liuli 项目提供正式的官方文档阅读地址
对Raw HTML存储前进行压缩,取出后解压。
项目名称来源,群友 @ Sngxpro 提供:
代号:琉璃(Liuli)
英文:RuriElysion
or:RuriWorld
slogan:琉璃开净界,薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》
寓意:构建一方净土如东方琉璃净世界。《药师经》云:「然彼佛土,一向清净,无有女人,亦无恶趣,及苦音声。」
代号:琉璃(Liuli)
英文:RuriElysion
or:RuriWorld
slogan:琉璃开净界,薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》
寓意:构建一方净土如东方琉璃净世界。《药师经》云:「然彼佛土,一向清净,无有女人,亦无恶趣,及苦音声。」
以下内容需要更改:
@123seven 正在做这个工作
如果按照教程手动添加pro.env文件,无法启动docker,但是如果不手动添加文件,启动docker的话会自动创建pro.env文件夹,然后docker会循环输出如下日志
Loading .env environment variables...
Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
Warning: file PIPENV_DOTENV_LOCATION=./pro.env does not exist!!
Not loading environment variables.
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/code/src/liuli_schedule.py", line 84, in run_liuli_schedule
ll_config = json.load(load_f)
File "/usr/local/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/local/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
增加树莓派部署支持,对镜像进行瘦身。
Liuli需要引入调度框架进行采集器、处理器、分发器、备份器等任务工作流管控,调研结果如下:
from src.classifier import model_predict_factory
model_resp = model_predict_factory(
model_name="cos", model_path="", input_dict={"text": doc_name, "cos_value": 0.5}
)
当前镜像用 docker-playwright-python 作为基础镜像,劣势是体积比较大,用户体验不是很好。
@zyd16888 将利用mcr.microsoft.com/playwright:focal
进行瘦身。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.