Giter Club home page Giter Club logo

allnewsspider's Introduction

env

必须是 python 3.6.6 64 bit windows/linux/mac

如果要爬取纽约时报,则需要使用urllib3的1.25.11版本,否则代理可能出错

intro

新浪新闻,腾讯新闻,搜狐新闻,澎湃新闻。

短期目前旨在爬取所有新闻门户网站的新闻,每个门户网站爬虫开箱即用,并自动保存到同目录下的 csv/excel 文件中,禁止将所得数据商用。

长期目标是打造一个信息流聚合平台,或者进行更高层面的比如社会舆情、新闻地理可视化等的处理。

集成该爬虫的网站已上线,体验地址:

https://xt98.tech:9494 http://buyixiao.xyz

http://8.142.38.214

项目长期维护,欢迎 star,项目更多信息欢迎关注个人微信公众号 【月小水长】

how to use

每个文件夹下的代码就是对应平台的新闻爬虫

py 文件直接运行

pyd 文件需要,假设为 pengpai_news_spider.pyd

  1. 将 pyd 文件下载到本地,新建项目,把 pyd 文件放进去

  2. 项目根目录下新建 runner.py,写入以下代码即可运行并抓取

    import pengpai_news_spider
    pengpai_news_spider.main()

todo

1、百度新闻爬虫,已完成,已发布

2、澎拜新闻爬虫,已完成,已发布

3、腾讯新闻爬虫,已完成,已发布

4、新浪新闻爬虫,已完成,已发布

5、纽约时报爬虫,已完成,已发布

6、泰晤士报爬虫,已完成,已发布

7、BBC新闻爬虫,已完成,已发布

allnewsspider's People

Contributors

awxiaoxian2020 avatar inspurer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

allnewsspider's Issues

无法找到库文件,需要帮助

平台:vscode,liunx
image

程序:
import sys,os
sys.path.append("/othercode/lizi/1/AllNewsSpider/pengpai")
print(sys.path)
print(os.path.realpath('.'))
import pengpai_news_spider
pengpai_news_spider.main()
控制台:
(mypy366) coder@codercom-code-server1:/othercode/lizi/1/AllNewsSpider/pengpai$ /home/coder/micromamba/envs/mypy366/bin/python /othercode/lizi/1/AllNewsSpider/pengpai/runner.py
['/othercode/lizi/1/AllNewsSpider/pengpai', '/home/coder/micromamba/envs/mypy366/lib/python36.zip', '/home/coder/micromamba/envs/mypy366/lib/python3.6', '/home/coder/micromamba/envs/mypy366/lib/python3.6/lib-dynload', '/home/coder/micromamba/envs/mypy366/lib/python3.6/site-packages', '/othercode/lizi/1/AllNewsSpider/pengpai']
/othercode/lizi/1/AllNewsSpider/pengpai
Traceback (most recent call last):
File "/othercode/lizi/1/AllNewsSpider/pengpai/runner.py", line 13, in
import pengpai_news_spider
ModuleNotFoundError: No module named 'pengpai_news_spider'

pyd module找不到问题

您好,感谢构建爬虫。在使用thetime爬虫的时候已经把thetime_news_spider.pyd跟runner.py放到一个目录里面了,但是还是显示No module named 'thetime_news_spider'。求解答!

Screen Shot 2021-04-03 at 3 12 56 PM

请问两个问题

一、澎湃新闻等spider可以设置关键字搜索吗?

二、百度新闻能否获取新闻全文?

谢谢!

纽约时报爬取失败

1641401308(1)

您好,尝试爬取纽约时报的时候,运行了一下就自动终止了。尝试过不同的关键词和起止日期,都是一样的错误(如图)

能否帮忙看下?谢谢

可以考虑分享源码吗?

@inspurer作者你好,非常感谢你的分享。
我注意到sina和tencent新闻爬取的分类只有科技、娱乐、军事和财经四类,我想请问有办法把所有的分类爬取下来吗,譬如体育、汽车、教育等等。另外,除了.pyd文件外,您可以分享下源码吗,谢谢。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.