Comments (15)
看起来是没有用户抓了,好友页面抓取了吗?
from cola.
我改了程序,没有让它抓好友页面,包括关注和粉丝。
from cola.
cola抓取微博的机制是这样的,首先把初始用户(starts)压入队列;对于一个正在抓取的用户,他的好友将会被压入这个队列;下一个抓取的从队列中获取。因此如果没有抓好有页面,就只能抓取初始用户了。
from cola.
系统非常稳定.抓取了几十个用户的微博.非常感谢。
但是,不知道怎么才能让cola继续抓取这些用户后来添加的新微博内容。现在,好像cola抓取完成后,就会反复提示“no budget left to process”, 跟顶楼用户遇到的情况一样.不会抓取新增加的微博内容.即使关掉cola,重新再运行,也还是这样。
develop分支.单机模式。基本都是默认设置.
from cola.
@windch:几十个用户抓取成功没法说明系统是稳定的吧?
另外,cola的逻辑是抓取当前时间用户所发的所有微博,每个用户看作一个bundle,此用户抓取完毕,即完成了该bundle。如果需要抓取新发表的微博,应该自己写一些逻辑。
from cola.
@windch develop分支应该是支持增量抓取的,配置里inc为yes就是支持。
原理是一个bundle在抓取结束后会被push到增量抓取队列,这个队列会被分配到一定的时间片来运行。
from cola.
见微知著.所以说cola非常稳定:)
是develop分支.默认配置inc为yes.我再运行一次,好像还是没有抓取最新的微博.
start uid有65个。
$ python init.py
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
...
...
^CCatch interrupt signal, start to stop
Counters during running:
{'error_urls': 20,
'finishes': 65,
'pages': 7321,
'secs': 15064.990124702454}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 175.00 seconds for running
from cola.
ok
这个可能是个bug,你先check一下/tmp/cola/worker/<job_id>/mq/inc下有没有文件并且文件有内容。
from cola.
/tmp下没有cola目录。我是单机运行python init.py
$ ll /tmp/cola/worker//mq/inc
ls: cannot access /tmp/cola/worker//mq/inc: No such file or directory
$ ll /tmp/cola
ls: cannot access /tmp/cola: No such file or directory
from cola.
你得先到/tmp/cola下去,worker下是个job id,是根据job name生成的。
from cola.
找到了。在/tmp/user/1000/cola/worker/8ZcGfAqHmzc/mq/inc
下,有个文件9223372036854775807
, 大小为4194304。
from cola.
ok,那head一下这个文件,看看有没有数据
from cola.
head 9223372036854775807
�^pccopy_reg
reconstructor
p1
(cweibo.bundle
WeiboUserBundle
p2
c__builtin_
object
p3
NtRp4
tail 9223372036854775807
ag12
ag43
ag647
ag54
asbsbsg1253
g1255
sS'last_error_page_times'
p1259
I0
sb.
from cola.
那应该是有数据的,那到了inc的时候应该就能取到数据。这个问题,你重新开一个issue,把问题描述一下啊,我近期修复。
from cola.
多谢!
from cola.
Related Issues (20)
- json.loads(br.response().read())["data"] HOT 1
- windows下coca无法启动分布式程序 HOT 1
- 遇到执行weibosearch的时候包不存在包问题 HOT 1
- 在CentOS 6中无法运行
- instances设置为大于core个数时,会出问题,过一段时间就会停止爬取了
- 在parser中获取网页html信息时卡住出不来
- 抓取网页出现HTTP ERROR处理问题
- 在抓取过程中突然卡住三四个小时,ctrl C不会退出。应该是mq处理出现问题了 HOT 1
- Failed to save to db, weakly-referenced object no longer exists HOT 2
- ValueError: No JSON object could be decoded HOT 8
- 怎么设置要爬取的用户 HOT 4
- 爬取follow列表的问题 HOT 2
- 爬取新浪微博出错 HOT 3
- 看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把
- 分布式爬取中,worker的主备mq同步问题
- 任务现场保存问题,任务现场保存在tmp里面,重启pc tmp会被清空
- 不太明白weibo.yaml里面的部分配置,有详细的一对一解释吗? HOT 2
- 还有更新的打算么? HOT 2
- 任务执行完成后为什么始终不退出 HOT 5
- Fix simple typo: falese -> false
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cola.