Giter Club home page Giter Club logo

snapcrawl's Introduction

Snapcrawl - crawl a website and take screenshots

Gem Version Build Status Code Climate


Snapcrawl is a command line utility for crawling a website and saving screenshots.

Features

  • Crawls a website to any given depth and saves screenshots
  • Can capture the full length of the page
  • Can use a specific resolution for screenshots
  • Skips capturing if the screenshot was already saved recently
  • Uses local caching to avoid expensive crawl operations if not needed
  • Reports broken links

Install

Using Docker

You can run Snapcrawl by using this docker image (which contains all the necessary prerequisites):

$ alias snapcrawl='docker run --rm -it --network host --volume "$PWD:/app" dannyben/snapcrawl'

For more information on the Docker image, refer to the docker-snapcrawl repository.

Using Ruby

$ gem install snapcrawl

Note that Snapcrawl requires PhantomJS and ImageMagick.

Usage

Snapcrawl can be configured either through a configuration file (YAML), or by specifying options in the command line.

$ snapcrawl
Usage:
  snapcrawl URL [--config FILE] [SETTINGS...]
  snapcrawl -h | --help
  snapcrawl -v | --version

The default configuration filename is snapcrawl.yml.

Using the --config flag will create a template configuration file if it is not present:

$ snapcrawl example.com --config snapcrawl

Specifying options in the command line

All configuration options can be specified in the command line as key=value pairs:

$ snapcrawl example.com log_level=0 depth=2 width=1024

Sample configuration file

# All values below are the default values

# log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
log_level: 1

# log_color (yes, no, auto)
# yes  = always show log color
# no   = never use colors
# auto = only use colors when running in an interactive terminal
log_color: auto

# number of levels to crawl, 0 means capture only the root URL
depth: 1

# screenshot width in pixels
width: 1280

# screenshot height in pixels, 0 means the entire height
height: 0

# number of seconds to consider the page cache and its screenshot fresh
cache_life: 86400

# where to store the HTML page cache
cache_dir: cache

# where to store screenshots
snaps_dir: snaps

# screenshot filename template, where '%{url}' will be replaced with a 
# slug version of the URL (no need to include the .png extension)
name_template: '%{url}'

# urls not matching this regular expression will be ignored
url_whitelist: 

# urls matching this regular expression will be ignored
url_blacklist: 

# take a screenshot of this CSS selector only
css_selector: 

Contributing / Support

If you experience any issue, have a question or a suggestion, or if you wish to contribute, feel free to open an issue.


snapcrawl's People

Contributors

dannyben avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.