Giter Club home page Giter Club logo

sentry-4's Introduction

Sentry

GitHub Slack License Codecov CI

Sentry is a parallelized web crawler written in Go that writes urls, links, & response headers to a Postgres database, then stores the response itself on amazon S3. It keeps a list of “sources”, which use simple string comparison to keep it from wandering outside of designated domains or url paths.

The big difference from other crawlers is a tunable “stale duration”, which will tell the crawler to capture an updated snapshot of the page if the time since the last GET request is older than the stale duration. This gives it a continual “watching” property.

Sentry holds a separate stream of scraping for any url that looks like a dataset. So when it encounters urls that look like https://foo.com/file.csv, it assumes that file ending may be a static asset, and places that url on a separate thread for archiving.

License & Copyright

Copyright (C) 2017 Data Together
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Installation

Docker installation

docker compose up

Manual installation

  1. Install Go language
  2. Download and build repository
    export GOPATH=$(go env GOPATH)
    mkdir -pv $GOPATH
    cd $GOPATH
    
    git clone https://github.com/archivers-space/sentry
    cd sentry
    go install
  3. Configure Postgres server and then set connection URL
    export POSTGRES_DB_URL=postgres://[USERNAME_HERE]:[PASSWORD_HERE]@localhost:[PORT]/[DB_NAME]
    
  4. Run sentry
    $GOPATH/bin/sentry
  5. Configure S3 buckets [TODO]
  • on production
  • on development (how do you work with them in development env?)

Development

Coming soon!

Related Projects

In parallel to building this tool, we have engaged in efforts to map the landscape of similar projects:

👀 See: Comparison of web archiving software

sentry-4's People

Contributors

b5 avatar dcwalk avatar patcon avatar titaniumbones avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.