atomicptr / crab Goto Github PK
View Code? Open in Web Editor NEW๐ฆ A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.
License: MIT License
๐ฆ A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.
License: MIT License
A few months ago the snap part of the pipeline broke and I haven't been able to get it working again since then so I decided to remove snap support for now (might still push updates manually we'll see...).
Since I don't use Ubuntu nor Snap my interest in this is also kinda low anyway. If someone wants to help fix this I'm open to it.
When running crab crawl:sitemap https://www.domain.com/sitemap.xml --header user-agent=some-whitelisted-user-agent
the request for the sitemap doesn't use the user agent header.
$ crab crawl:sitemap https://domain.com/sitemap.xml --output output.log
# NO prints here...
It might probably make sense to have a different option for persisting the logs as a valid json file...
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-json output.json
$ cat output.json
[
{"status": 200, "url": "https://domain.com/page-1", ...},
...
]
It should be possible to filter crawl logs, depending on what I'm doing I might just want to look for non 200 pages
An idea on how this could look like:
$ crawl crawl:sitemap https://domain.com/sitemap.xml --filter-status=!200
# Only lists results that are not 200...
Preferably as some sort of integration with the vegeta load testing tool.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.