qianlitp / crawlergo Goto Github PK
View Code? Open in Web Editor NEWA powerful browser crawler for web vulnerability scanners
License: GNU General Public License v3.0
A powerful browser crawler for web vulnerability scanners
License: GNU General Public License v3.0
环境:windows10, crawlergo 0.1.2, chrome
单个测试效果不错,准备进行批量爬取。
单进程跑了一天发现电脑卡死,看了下cpu资源耗尽,后台发现chrome进程一大堆。
应该是crawlergo任务结束了,部分chrome并没有正常关闭。
试了几个网站都提示timeout,网络没问题
2052 ◯ ./crawlergo -c /opt/bugbounty/chrome-linux/chrome -t 20 http://testphp.vulnweb.com/
Crawling GET https://testphp.vulnweb.com/
Crawling GET http://testphp.vulnweb.com/
ERRO[0005] navigate timeout context deadline exceeded
ERRO[0005] http://testphp.vulnweb.com/
--[Mission Complete]--
GET http://testphp.vulnweb.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
GET https://testphp.vulnweb.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam
crawlergo参数填充的功能有没有可能加入
https://github.com/s0md3v/Arjun
类似的隐藏参数发现的功能。
从目前的测试来看,如果我设置postdata 是'username=admin&password=password', 那只尝试一次, 而且忽略其他在页面里面一同出现的paramter, 后续的username password 继续使用默认的KeeTeam 等等。 能否支持设定username=admin 以后, 所有在username 出现的地方都使用admin 而不用KeeTeam? password 类似。
如题, 因为对这个几个新增选项的最佳配置不太熟悉, 在希望能最大限度的爬取接口的目标下, 怎么配置参数比较合理, 谢谢.
目标站点为:https://www.che168.com/
爬取了两天了,还未结束, 所以希望作者能帮忙看一下是什么原因.
因为crawlergo是串联在自己写的一个程序中的,程序一直在爬,导致无法结束.
后续应该如何约束最大爬取时间,或深度?
部分爬取URL如下:
http://www.che168.com/suihua/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/baoma/baoma5xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/jiangsu/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/
http://www.che168.com/nanjing/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/aodi/aodia6l/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/xuzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
http://www.che168.com/wuxi/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/baoma/baoma3xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/changzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
http://www.che168.com/suzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=1009
请问如何在发送爬虫请求的时候走代理池,避免被封ip
具体报错信息如下:
$ ./crawlergo -c /Applications/Chrome.app -t 20 https://www.baidu.com
INFO[0000] Init crawler task, host: www.baidu.com, max tab count: 20, max crawl count: 200.
INFO[0000] filter mode: smart
INFO[0000] Start crawling.
INFO[0000] filter repeat, target count: 2
INFO[0000] Crawling GET http://www.baidu.com/
INFO[0000] Crawling GET https://www.baidu.com/
WARN[0000] navigate timeout fork/exec /Applications/Chrome.app: permission deniedhttp://www.baidu.com/
WARN[0000] navigate timeout fork/exec /Applications/Chrome.app: permission deniedhttps://www.baidu.com/
INFO[0000] closing browser.
crawlergo和Chrome.app已加执行权限,Chrome.app为“Google Chrome.app”更改而来。
Mac版本10.14.5,Chrome版本 80.0.3987.87(正式版本) (64 位),python3.7.6
CentOS Linux release 7.6.1810 (Core)
[root@VM_0_17_centos data]# ./crawlergo -c /root/.local/share/pyppeteer/local-chromium/575458/chrome-linux/chrome -t 10 http://testphp.vulnweb.com
Crawling GET https://testphp.vulnweb.com/
Crawling GET http://testphp.vulnweb.com/
ERRO[0000] navigate timeout 'Fetch.enable' wasn't found (-32601)
ERRO[0000] https://testphp.vulnweb.com/
ERRO[0000] navigate timeout 'Fetch.enable' wasn't found (-32601)
ERRO[0000] http://testphp.vulnweb.com/
--[Mission Complete]--
GET http://testphp.vulnweb.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam
GET https://testphp.vulnweb.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
部分LOG如下:
alling _exit(1). Core file will not be generated.
http:///components
WARN[0006] navigate timeout context deadline exceededhttp://A
WARN[0006] navigate timeout context deadline exceededhttp://A
INFO[0009] Crawling GET http://A/api.php
INFO[0009] Crawling GET http://A/uc_client/
WARN[0021] navigate timeout unable to execute *log.EnableParams: context deadline exceededhttp://A/connect.php
WARN[0021] navigate timeout unable to execute *log.EnableParams: context deadline exceededhttp://A/*?mod=misc*
INFO[0021] closing browser.
> 此次卡住了,无输出,按下enter也没反应,卡了24小时都无反应
这种情况已经出现了3次了, 无法定位出原因, 因为同样的python代码有时候]没有任何问题,有的时候就会卡住.
有些基准的爬虫需求,主要考虑到可能在爬取大型网站会存在一些问题。
内存大小限制;速率限制;chrome数量限制;cpu限制;
针对第二点,目前的方法是先用dirsearch fuzz出来的path, 筛选成list , 然后list加到一个string里面再用subprocess调用crawlergo, 这样的弊端也很显然.
不知道作者后期有无这方面的规划,thanks!
目前我的方法是拼接, 比如 http://www.A.com, 已知了两个路径: /path_a,/path_b
那么命令为: crawlergo -c chrome http://www.A.com/ http://www.A.com/path_a http://www.A.com/path_b
有两个问题:
当然后期能有参数支持多路径作为入口最好不过.
Originally posted by @djerrystyle in #31 (comment)
环境:
Darwin ZBMAC-C02VQ02-5.local 17.2.0 Darwin Kernel Version 17.2.0: Fri Sep 29 18:27:05 PDT 2017; root:xnu-4570.20.62~3/RELEASE_X86_64 x86_64
命令:
./crawlergo -c /Applications/Chromium.app/Contents/MacOS/Chromium -f smart -o json -t 5 http://www.baidu.com
报错:
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 93421 [running]:
sync.(*WaitGroup).Wait(0xc0093f59a0)
C:/Go/src/sync/waitgroup.go:132 +0xae
ioscan-ng/src/tasks/crawlergo/engine.(*Tab).Start.func3(0xc0093f5800)
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:229 +0x34
created by ioscan-ng/src/tasks/crawlergo/engine.(*Tab).Start
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:227 +0x4f1
除此之外,选择-o json并未输出结果,所有请求均为timeout
--push-to-proxy的建议
如果是文件名,内容是
http://56.67.8.0:9900
socks5://35.88.324.9:8080
目的是可以同时推给多个被动代理
crawlergo -c /pachong/chrome -t 20 http://testphp.vulnweb.com/
crawlergo -c \pachong\chrome -t 20 http://testphp.vulnweb.com/
在win环境下 都报错
./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com:80/
执行后: 瞬间返回如下:
--[Mission Complete]--
{"req_list":null,"all_domain_list":[xxxxx],"all_req_list":[xxxxx]}
但是:
./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com/
Crawling GET http://A.B.com/
DEBU[0000]
DEBU[0006] context deadline exceeded
--[Mission Complete]--
{"req_list":[xxxxx],"all_domain_list":[xxxxx],"sub_domain_list":[xxxxx]}
希望可以支持代理配置,这样可以方便在不同网络环境下进行测试,虽可以通过 proxychains 等方法实现,但是不如原生支持来的方便:)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x16b9e8f]
goroutine 975 [running]:
ioscan-ng/src/tasks/crawlergo/engine.(*Tab).InterceptRequest(0xc00062c1c0, 0xc0005e5d80)
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/intercept_request.go:42 +0x25f
created by ioscan-ng/src/tasks/crawlergo/engine.NewTab.func1
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:90 +0x2e8
在windows下连续跑了4-5天,产生了40G+的chrome临时文件,看了几个文件名好像是CrashpadMetrics-active.pma
感觉需要处理一下生成的临时文件。
像tuchong这样一个用户给一个子域名的,https://shenan.tuchong.com/work,https://seatory.tuchong.com/posts。
如果有大量的页面要爬取,即有很多 target,每次开子进程运行 ./crawlergo target
开销有点大,能不能一次性加入爬取列表?
navigate timeout context deadline exceeded
想本地做个dedecms的爬虫测试,直接就报了这个错误 是哪里操作不当嘛?
cmd = ["E:/exploit/spider/crawlergo/crawlergo", "-c", "E:/exploit/spider/crawlergo/chrome-win/chrome.exe","-t", "5","-f","smart", "-m", "1", "--output-mode", "json", 'https://www.baidu.com']
rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
result["all_req_list"][9]['url']
'https://,Wn=/'请问这个链接是怎么触发的?
能否增加一个爬虫入口(url列表) 像awvs一样,因为有一些页面爬虫爬不到?
核心是爬虫,求开源...😂
crawlergo 在抓取的时候卡死了... 环境是Windows10,cpu跟内存消耗也不大,temp目录下一直清空着着也不是空间的原因...
0.12的版本下载后只有5.1M,不知道是精简过多了。执行后就直接退出了
➜ crawlergo mv ~/Downloads/crawlergo ./
➜ crawlergo chmod +x crawlergo
➜ crawlergo ./crawlergo
[1] 9838 killed ./crawlergo
➜ crawlergo ./crawlergo -h
[1] 9845 killed ./crawlergo -h
➜ crawlergo ./crawlergo
[1] 9852 killed ./crawlergo
➜ crawlergo
POST表单中的参数怎么获取?
自动填的0kee不一定符合该input的type,导致一些表单触发不了,就抓不到目标页面。
看fate0大佬发的视频中测试http://testphp.vulnweb.com/AJAX/index.php 这个网站可以获取到很多链接, 但crawlergo只能获取到4个,页面a标签的js发出的请求是否可以考虑获取到。
--ignore-url-keywords setup,login 设置两个关键字, 没有ignore呀, 要是只设置一个login,是可以的
time="2020-02-21T21:57:58+08:00" level=info msg="Crawling GET http://10.154.159.162/dvwa/login.php"
time="2020-02-21T21:57:58+08:00" level=info msg="Crawling GET http://10.154.159.162/dvwa/"
root@ubuntu:~/Desktop/crawlergo# ./crawlergo -c /Desktop/crawlergo/chrome-linux/chrome -t 20 http://testphp.vulnweb.com/
INFO[0000] Init crawler task, host: testphp.vulnweb.com, max tab count: 20, max crawl count: 200.
INFO[0000] filter mode: smart
INFO[0000] Start crawling.
INFO[0000] filter repeat, target count: 2
INFO[0000] Crawling GET https://testphp.vulnweb.com/
WARN[0000] navigate timeout fork/exec /Desktop/crawlergo/chrome-linux/chrome: no such file or directoryhttps://testphp.vulnweb.com/
INFO[0000] Crawling GET http://testphp.vulnweb.com/
WARN[0000] navigate timeout fork/exec /Desktop/crawlergo/chrome-linux/chrome: no such file or directoryhttp://testphp.vulnweb.com/
INFO[0000] closing browser.
运行操作如上 已经赋予crawlergo文件的+x 权限请问这是什么情况?
出现./crawlergo: cannot execute binary file: Exec format error的原因是什么呢
这个参数是不是不能添加cookie的?
--custom-headers Headers 自定义HTTP头,使用传入json序列化之后的数据,这个是全局定义,将被用于所有请求
感谢大佬分享如此好用的爬虫工具
在使用过程中我发现在需要对一些需要认证页面现在的爬取有一点无力的感觉, 提供的Header的客制化只能应付一些利用Cookie作为凭据的场景, 在一些SPA的场景中, 作为凭据的Token往往会放在 浏览器的LocalStorage 或者 作为一个固定的数据附加在提交的Body中, 希望可以提供这两块地方的客制化. 希望大佬可以将上述的特性放在后期的更新中, 因为现在的很多页面大部分都需要认证的功能, 如果只是单一的爬取非认证的页面能得到信息比较有限.
再次更新感谢大佬的分享🙏 🙏 🙏 🙏 !
$ crawlergo.exe -c .\GoogleChromePortable64\GoogleChromePortable.exe http://www.baidu.com
Crawling GET https://www.baidu.com/
Crawling GET http://www.baidu.com/
time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n"
time="2019-12-31T10:56:43+08:00" level=error msg="https://www.baidu.com/"
time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done."
time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n"
time="2019-12-31T10:56:43+08:00" level=error msg="http://www.baidu.com/"
time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err"
time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done."
time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target"
time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err"
time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target"
--[Mission Complete]--
GET http://www.baidu.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam
GET https://www.baidu.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
默认 header 中存在"Spider-Name": "crawlergo-0KeeTeam"
,这个标识很容易被规则拦截。
通过配置googlemini浏览器的代理不起作用,想直接启动添加代理也不可以(chromium --proxy-server="socks5://127.0.0.1:1080")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.