Giter Club home page Giter Club logo

Comments (15)

qinxuye avatar qinxuye commented on August 16, 2024

看起来是没有用户抓了,好友页面抓取了吗?

from cola.

hitalex avatar hitalex commented on August 16, 2024

我改了程序,没有让它抓好友页面,包括关注和粉丝。

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

cola抓取微博的机制是这样的,首先把初始用户(starts)压入队列;对于一个正在抓取的用户,他的好友将会被压入这个队列;下一个抓取的从队列中获取。因此如果没有抓好有页面,就只能抓取初始用户了。

from cola.

 avatar commented on August 16, 2024

系统非常稳定.抓取了几十个用户的微博.非常感谢。

但是,不知道怎么才能让cola继续抓取这些用户后来添加的新微博内容。现在,好像cola抓取完成后,就会反复提示“no budget left to process”, 跟顶楼用户遇到的情况一样.不会抓取新增加的微博内容.即使关掉cola,重新再运行,也还是这样。

develop分支.单机模式。基本都是默认设置.

from cola.

hitalex avatar hitalex commented on August 16, 2024

@windch:几十个用户抓取成功没法说明系统是稳定的吧?

另外,cola的逻辑是抓取当前时间用户所发的所有微博,每个用户看作一个bundle,此用户抓取完毕,即完成了该bundle。如果需要抓取新发表的微博,应该自己写一些逻辑。

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

@windch develop分支应该是支持增量抓取的,配置里inc为yes就是支持。
原理是一个bundle在抓取结束后会被push到增量抓取队列,这个队列会被分配到一定的时间片来运行。

from cola.

 avatar commented on August 16, 2024

见微知著.所以说cola非常稳定:)

是develop分支.默认配置inc为yes.我再运行一次,好像还是没有抓取最新的微博.
start uid有65个。

$ python init.py
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
...
...
^CCatch interrupt signal, start to stop
Counters during running:
{'error_urls': 20,
'finishes': 65,
'pages': 7321,
'secs': 15064.990124702454}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 175.00 seconds for running

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

ok

这个可能是个bug,你先check一下/tmp/cola/worker/<job_id>/mq/inc下有没有文件并且文件有内容。

from cola.

 avatar commented on August 16, 2024

/tmp下没有cola目录。我是单机运行python init.py

$ ll /tmp/cola/worker//mq/inc
ls: cannot access /tmp/cola/worker//mq/inc: No such file or directory
$ ll /tmp/cola
ls: cannot access /tmp/cola: No such file or directory

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

你得先到/tmp/cola下去,worker下是个job id,是根据job name生成的。

from cola.

 avatar commented on August 16, 2024

找到了。在/tmp/user/1000/cola/worker/8ZcGfAqHmzc/mq/inc下,有个文件9223372036854775807, 大小为4194304。

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

ok,那head一下这个文件,看看有没有数据

from cola.

 avatar commented on August 16, 2024

head 9223372036854775807

�^pccopy_reg
reconstructor
p1
(cweibo.bundle
WeiboUserBundle
p2
c__builtin
_
object
p3
NtRp4

tail 9223372036854775807

ag12
ag43
ag647
ag54
asbsbsg1253
g1255
sS'last_error_page_times'
p1259
I0
sb.

from cola.

qinxuye avatar qinxuye commented on August 16, 2024

那应该是有数据的,那到了inc的时候应该就能取到数据。这个问题,你重新开一个issue,把问题描述一下啊,我近期修复。

from cola.

 avatar commented on August 16, 2024

多谢!

from cola.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.