Giter Club home page Giter Club logo

rts's Introduction

Description

RTS (Realtime scrapper) is a tool developed to scrap all pasties,github,reddit..etc in real time to identify occurrence of search terms configured. Upon match an email will be triggered. Thus allowing company to react in case of leakage of code, any hacks tweeted..etc.. and harden themselves against an attack before it goes viral.

The same tool in malicious user hands can be used offensively to get update on any latest hacks, code leakage etc..

List of sites which will be monitored are:

  • Non-Pastie Sites
    • Twitter
    • Reddit
    • Github
  • Pastie Sites
    • Pastebin.com
    • Codepad.org
    • Dumpz.org
    • Snipplr.com
    • Paste.org.ru
    • Gist.github.com
    • Pastebin.ca
    • Kpaste.net
    • Slexy.org
    • Ideone.com
    • Pastebin.fr
For architecture information and details of how this tool work refer documnetation folder in this repository.

Configuration

Before using this tool is is neccessary to understand the properties file present in scrapper_config directory.

  • consumer.properties: Holds all the neccessary config data needed for consumer of Kafka (Refer apache Kafka guide for more information). The values present here are default options and does nto require any changes
  • producer.properties: Holds all the neccessary config data needed for Producer (Refer apache Kafka guide for more information).The values present here are default options and does nto require any changes
  • email.properties: Holds all the configuration data to send email.
  • scanner-configuration.properties: This is the core configuration file. Update all the config for enabling search on twitter/github(To get tokens and key refer respective sites). For pastie sites and reddit there is no need for any changes in config.
  • Note:However in all cases make sure to change "searchterms" to values of our choice to search. If there are multiple search terms then add them seperate by comma like the example data provided in config file.
    Understanding more about scanner-configuration.properties file.
        For any pastie site configuration is as below: Note:leave the pastie sites configuration as is and just change the search terms as requried by the organization. Thsi will do good.
        • scrapper.(pastie name).profile=(Pastie profile name)
        • scrapper.(pastie name).homeurl=(URL from where pastie ids a extracted)
        • scrapper.(pastie name).regex=(Regex to fetch pastie ids)
        • scrapper.(pastie name).downloadurl= (URL to get information about each apstie)
        • scrapper.(pastie name).searchterms=(Mention terms to be searched seperated by comma)
        • scrapper.(pastie name).timetosleep=(Time for which pastie thread will sleep before fetching pastie ids again)
        For github search configuration is as below:
        • scrapper.github.profile=Github
        • scrapper.github.baseurl=https://api.github.com/search/code?q={searchTerm}&sort=indexed&order=asc
        • scrapper.github.access_token=(Get your own github access token)
        • scrapper.github.searchterms=(Mention terms to be searched seperated by comma)
        • scrapper.github.timetosleep=(Time for which github thred should sleep before searching again)
        For reditt search configuration is as below:
        • scrapper.reddit.profile=Reddit
        • scrapper.reddit.baseurl=https://www.reddit.com/search.json?q={searchterm}
        • scrapper.reddit.searchterms=(Mention terms to be searched seperated by comma)
        • scrapper.reddit.timetosleep=(Time for which github thred should sleep before searching again)
        For Twitter search configuration is as below:
        • scrapper.twitter.apikey=test
        • scrapper.twitter.profile=Twitter
        • scrapper.twitter.searchterms=(Mention terms to be searched seperated by comma)
        • scrapper.twitter.consumerKey=(Get your own consumer key)
        • scrapper.twitter.consumerSecret=(Get your own consumerSecret)
        • scrapper.twitter.accessToken=(Get your own accessToken)
        • scrapper.twitter.accessTokenSecret=(Get your own accessTokenSecret)

How to use the tool

  • Install JDK
  • Install mvn and set the path
  • Start the zookeeper and Kafka Server (Refer https://kafka.apache.org/documentation/#quickstart for more information)
    • Commands needed to start kafka in windows:
      • zooper-server-start.bat ../../config/consumer.properties
      • kafka-server-start.bat ../../config/server.properties
      • kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic "Kafka Topic name"
    • Commands needed to start kafka in linux:
      • zooper-server-start.sh ../config/consumer.properties
      • kafka-server-start.sh ../config/server.properties
      • kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic "Kafka Topic name"
  • Use kafka topic created in previous step
  • Navigate to "rts" folder. Run command "mvn clean install -DskipTests". This willbuild the code.
  • Navigate to scraptool/tartget
  • Run the command "java -jar scraptool-1.0-SNAPSHOT-standalone.jar -t "Kafka Topic name" -c "complete path of config directory""

Authors:

  • Naveen Rudrappa

rts's People

Contributors

naveenrudra avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.