liuroy / zhihu_spider Goto Github PK
View Code? Open in Web Editor NEW知乎爬虫
知乎爬虫
不知道是不是本人啊。。。
使用celery异步下载用户头像
厉害了
防止被屏蔽
两个人如果互相关注对方,这样会导致爬虫进入死循环~
能给个示范吗
首先看作者的说明:
爬虫程序依赖mongo和rabbitmq,因此这两个服务必须正常运行和配置。为了加快下载效率,图片
下载是异步任务,因此在启动爬虫进程执行需要启动异步worker,启动方式是进入zhihu_spider/ zhihu目录后执行下面命令:
celery -A zhihu.tools.async worker --loglevel=info
进入zhihu_spider后执行docker-compose up
,进入container后和本地运行方法相同,依
次启动mongo、rabbitmq、异步任务、爬虫进程即可。docker采用的image可以参见我的另一个项> 目spider-docker获取。
完全语言描述,没有执行命令的说明,对命令也没有解释。对于新手,scrapy还一知半解,然后也没用过mongo和rabbitmq,根本无从下手。如何启动?启动的哪段代码?启动后在哪查看结果或者退出?一点说明都没有。不要瞧不起新手,只是学的晚而已。这个代码的说明文档真不敢恭维。
us.codecraft.webmagic.selector.RegexSelector#line73
public RegexResult selectGroup(String text) {
Matcher matcher = regex.matcher(text);
if (matcher.find()) {
String[] groups = new String[matcher.groupCount() + 1];//这里为什么要groupCount()+1?这样的做法很奇怪会返回匹配成功的字符串的最后一个字符
for (int i = 0; i < groups.length; i++) {
groups[i] = matcher.group(i);
}
return new RegexResult(groups);
}
return RegexResult.EMPTY_RESULT;
}
知乎改版了,已经不是_xsrf验证方式可以解决的,变得更复杂了已经
RT
https://repo.mongodb.com/yum/redhat/7/mongodb-enterprise/stable/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
Trying other mirror.
To address this issue please refer to the below knowledge base article
https://access.redhat.com/articles/1320623
If above article doesn't help to resolve this issue please create a bug on https://bugs.centos.org/
One of the configured repositories failed (MongoDB Enterprise Repository),
and yum doesn't have enough cached data to continue. At this point the only
safe thing yum can do is fail. There are a few ways to work "fix" this:
1. Contact the upstream for the repository and get them to fix the problem.
2. Reconfigure the baseurl/etc. for the repository, to point to a working
upstream. This is most often useful if you are using a newer
distribution release than is supported by the repository (and the
packages for the previous distribution release still work).
3. Run the command with the repository temporarily disabled
yum --disablerepo=mongodb-enterprise ...
4. Disable the repository permanently, so yum won't use it by default. Yum
will then just ignore the repository until you permanently enable it
again or use --enablerepo for temporary usage:
yum-config-manager --disable mongodb-enterprise
or
subscription-manager repos --disable=mongodb-enterprise
5. Configure the failing repository to be skipped, if it is unavailable.
Note that yum will try to contact the repo. when it runs most commands,
so will have to try and fail each time (and thus. yum will be be much
slower). If it is a very temporary problem though, this is often a nice
compromise:
yum-config-manager --save --setopt=mongodb-enterprise.skip_if_unavailable=true
failure: repodata/repomd.xml from mongodb-enterprise: [Errno 256] No more mirrors to try.
https://repo.mongodb.com/yum/redhat/7/mongodb-enterprise/stable/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
如题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.