Giter Club home page Giter Club logo

liuli's Issues

分类器:支持CharCNN分类模型

from src.classifier import model_predict_factory

model_resp = model_predict_factory(
    model_name="charcnn", model_path="", input_dict={"text": doc_name}
)

docker-compose 支持

由于2c有MongoDB依赖,为了镜像部署方便,所以使用 docker-compose 进行一键安装

搜狗无法搜索未认证公众号

关注的部分公众号没有认证,搜狗搜索不到,无法生成链接。如:丁香园内分泌时间(等丁香园开头的部分公众号)

爬取微信公众号的Demo执行失败

参考的https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA
刚起了demo试着爬一下微信公众号的内容,但是日志里显示执行失败了。

Loading .env environment variables...
[2022:05:09 10:55:45] INFO  Liuli Schedule(v0.2.4) task(default@liuli_team) started successfully :)
[2022:05:09 10:55:45] INFO  Liuli Task(default@liuli_team) schedule time:
 00:10
 12:10
 21:10
[2022:05:09 10:55:45] ERROR Liuli 执行失败!'doc_source'

文章里给你docker compose配置文件里使用的liuli schedule镜像版本是不带playwright的,我看文章里提供的default的json里描述的使用playwright爬取微信内容,尝试着更改为了带playwright的版本,也显示执行失败。

多人使用,推送消息的时候有人不喜欢我关注的微信公众号

前些日子搭建了这个系统,于是把同事拉过来关注了推送机器人,结果竟然说我关注的微信公众号太多了,天天给他推送一大堆他不喜欢的玩意儿!
image
希望能针对分发器功能进行不同的人不同的分发,估计比较难,但我相信难不倒大佬们!!!

抓取公众号文章时,时间格式清洗出错

测试脚本如下:

from src.collector.wechat_feddd.start import WeiXinSpider
WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
WeiXinSpider.start()

错误原因:
数据清洗时,期望的数据格式是 2022-03-21 20:59,但实际抓取回来的数据是 2022-03-22 20:37:12,导致 clean_doc_ts函数报错。如下图
image

希望增加功能,取消生成的RSS中的updated的变动

截取一部分生成的RSS信息如下,此处的 updated 日期,为liuli在周期性运行的过程中更新时的时间,即使对于一条很久以前的RSS信息,它的 updated 也会被更新到当前时间。

<entry>
    <id>liuli_wechat - 谷歌开发者 - 社区说|TensorFlow 在工业视觉中的落地</id>
    <title>社区说|TensorFlow 在工业视觉中的落地 </title>
    <updated>2022-05-28T13:17:35.903720+00:00</updated>
    <author>
        <name>liuli_wechat - GDG</name>
    </author>
    <content/>
    <link href="https://ddns.ysmox.com:8766/backup/liuli_wechat/谷歌开发者/%E7%A4%BE%E5%8C%BA%E8%AF%B4%EF%BD%9CTensorFlow%20%E5%9C%A8%E5%B7%A5%E4%B8%9A%E8%A7%86%E8%A7%89%E4%B8%AD%E7%9A%84%E8%90%BD%E5%9C%B0" rel="alternate"/>
    <published>2022-05-25T17:30:46+08:00</published>
</entry>

这样会引起一些问题,在某些RSS订阅器上(如Tiny Tiny RSS),其时间轴上是根据 updated 来排序,而并非 published,如此一来,无法有效地区分当前的RSS哪些内容是最近生成的,哪些又是以前生成过的。

所以希望保留 updated 的时间不变(如第一次存到mongodb中时,记录当前时间;若周期性更新时则不改变其值)或者与 published 保持一致。

最后,希望我已经清楚地表达了我的问题和请求,谢谢!

liuli_schedule exited with code 0

根据https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA的提示进行安装。

实际文件和代码如下:

pro.env文件的内容:

PYTHONPATH=${PYTHONPATH}:${PWD}
LL_M_USER="liuli"
LL_M_PASS="liuli"
LL_M_HOST="liuli_mongodb"
LL_M_PORT="27017"
LL_M_DB="admin"
LL_M_OP_DB="liuli"
LL_FLASK_DEBUG=0
LL_HOST="0.0.0.0"
LL_HTTP_PORT=8765
LL_WORKERS=1
# 上面这么多配置不用改,下面的才需要各自配置
# 请填写你的实际IP
LL_DOMAIN="http://172.17.0.1:8765"
# 请填写微信分发配置
LL_WECOM_ID="自定义"
LL_WECOM_AGENT_ID="自定义"
LL_WECOM_SECRET="自定义"

default.json的内容如下:

{
    "name": "default",
    "author": "liuli_team",
    "collector": {
        "wechat_sougou": {
            "wechat_list": [
                "老胡的储物柜"
            ],
            "delta_time": 5,
            "spider_type": "playwright"
        }
    },
    "processor": {
        "before_collect": [],
        "after_collect": [{
            "func": "ad_marker",
            "cos_value": 0.6
        }, {
            "func": "to_rss",
            "link_source": "github"
        }]
    },
    "sender": {
        "sender_list": ["wecom"],
        "query_days": 7,
        "delta_time": 3
    },
    "backup": {
        "backup_list": ["mongodb"],
        "query_days": 7,
        "delta_time": 3,
        "init_config": {},
        "after_get_content": [{
            "func": "str_replace",
            "before_str": "data-src=\"",
            "after_str": "src=\"https://images.weserv.nl/?url="
        }]
    },
    "schedule": {
        "period_list": [
            "00:10",
            "12:10",
            "21:10"
        ]
    }
}

docker-compose.yml文件的内容如下:

version: "3"
services:
  liuli_api:
    image: liuliio/api:v0.1.3
    restart: always
    container_name: liuli_api
    ports:
      - "8765:8765"
    volumes:
      - ./pro.env:/data/code/pro.env
    depends_on:
      - liuli_mongodb
    networks:
      - liuli-network
  liuli_schedule:
    image: liuliio/schedule:v0.2.4
    restart: always
    container_name: liuli_schedule
    volumes:
      - ./pro.env:/data/code/pro.env
      - ./liuli_config:/data/code/liuli_config
    depends_on:
      - liuli_mongodb
    networks:
      - liuli-network
  liuli_mongodb:
    image: mongo:3.6
    restart: always
    container_name: liuli_mongodb
    environment:
      - MONGO_INITDB_ROOT_USERNAME=liuli
      - MONGO_INITDB_ROOT_PASSWORD=liuli
    ports:
      - "27027:27017"
    volumes:
      - ./mongodb_data:/data/db
    command: mongod
    networks:
      - liuli-network

networks:
  liuli-network:
    driver: bridge

报错内容如下:

liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule exited with code 0

我感觉是python路径的问题。我的python路径是:

which python3 # /usr/bin/python3

我的VPS中没有${PYTHONPATH}这个系统变量:

echo ${PYTHONPATH} # NULL

请问大佬,我应该如何改正?

关于备份器强制更新

对于一些类别的文章会重复更新,目前备份器机制是插入后就不再更新,需要抽象一个参数由用户控制。

[Help Wanted!]更多的广告样本

为了提升模型的识别准确率,我希望大家能尽力贡献一些广告样本,请看样本文件:.files/datasets/ads.csv,我设定格式如下:

title url is_process
广告文章标题 广告文章连接 0

字段说明:

  • title:文章标题
  • url:文章链接,如果微信文章想、请先验证是否失效
  • is_process:表示是否进行样本处理,默认填0即可

来个实例:

2c_ads_csv_demo

一般广告会重复在多个公众号投放,填写的时候麻烦查一下是否存在此条记录,真的真的希望大家能一起合力贡献,亲,来个PR贡献你的力量吧!

广告反馈收集机制

当前广告反馈需要用户自行收集提交,流程相对有点门槛,考虑提供更方便的形式进行广告提交。

文章持久化到Github Page

关于公众号文章备份,由于公众号文章每篇就HTML而言比较大,而且持久化的目的可能对应着在线浏览的需求。最后,持久化到数据库然后服务化其实对只有小型服务器的用户来说还是有压力的。

目前想法是基于github的page功能构建一个公众号文章存储浏览模块,比如,当采集器获取到raw html就直接push到 用户的github某个仓库,比如 github.com/howie6879/2c_wechat_html,然后基于page提供页面浏览服务。

大家怎么看,有没有其他的想法?

带有空格的公众号采集总是失败

[2022:05:27 08:11:47] INFO Request <GET: https://weixin.sogou.com/weixin?type=1&query=丁爸20%情报分析师的工具箱&ie=utf8&s_from=input&_sug_=n&_sug_type_=>
liuli_schedule | [2022:05:27 08:11:48] ERROR SGWechatSpider <Item: Failed to get target_item's value from html.>
liuli_schedule | Traceback (most recent call last):
liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/spider.py", line 197, in _process_async_callback
liuli_schedule | async for callback_result in callback_results:
liuli_schedule | File "/data/code/src/collector/wechat/sg_ruia_start.py", line 58, in parse
liuli_schedule | async for item in SGWechatItem.get_items(html=html):
liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/item.py", line 127, in get_items
liuli_schedule | raise ValueError(value_error_info)
liuli_schedule | ValueError: <Item: Failed to get target_item's value from html.>

Liuli 项目需要一个 logo

项目名称来源,群友 @ Sngxpro 提供:

代号:琉璃(Liuli)

英文:RuriElysion
 or:RuriWorld

slogan:琉璃开净界,薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》

寓意:构建一方净土如东方琉璃净世界。《药师经》云:「然彼佛土,一向清净,无有女人,亦无恶趣,及苦音声。」

2C 重命名为 Liuli

代号:琉璃(Liuli)

英文:RuriElysion
 or:RuriWorld

slogan:琉璃开净界,薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》

寓意:构建一方净土如东方琉璃净世界。《药师经》云:「然彼佛土,一向清净,无有女人,亦无恶趣,及苦音声。」

以下内容需要更改:

  • 代码中涉及 2c 的
  • 镜像
  • 文档

希望能在RSS订阅里面包含~原始文章链接

image

目前打算写一个脚本,通过全文获取API来去获取全文,在根据自定义的格式寄给我的gmail...这样除了newsletter之外,一些RSS订阅和微信公众号都可以直接在spark阅读...

然而我找到的全文获取的付费api要求有些高,RSS里面的link格式不行,就算经过decodeURIComponent函数转换也还是格式不正确。

如果RSS订阅有原始网页的连接,就可以抓取用原始链接来获取全文而不会出错!

希望作者可以给与支持!感谢:)

0.24版本参照教程无法启动schedule

如果按照教程手动添加pro.env文件,无法启动docker,但是如果不手动添加文件,启动docker的话会自动创建pro.env文件夹,然后docker会循环输出如下日志
Loading .env environment variables...
Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
Warning: file PIPENV_DOTENV_LOCATION=./pro.env does not exist!!
Not loading environment variables.
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/code/src/liuli_schedule.py", line 84, in run_liuli_schedule
ll_config = json.load(load_f)
File "/usr/local/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/local/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None

调度框架

Liuli需要引入调度框架进行采集器、处理器、分发器、备份器等任务工作流管控,调研结果如下:

分类器:支持余弦相似度模型

from src.classifier import model_predict_factory

model_resp = model_predict_factory(
    model_name="cos", model_path="", input_dict={"text": doc_name, "cos_value": 0.5}
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.