Giter Club home page Giter Club logo

pastebin-scraper's Introduction

pastebin-scraper

This is a multithreaded scraping script for Pastebin. It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.

WHY?

Fun.

Installation

The usual dance.

pip install -r requirements.txt

Define all required specs in settings.ini. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with pymysql and SQLite with the standard built in Python 3 connector are supported.

Also note that the file output creates a subdirectory output and dumps every paste as a separate file into it.

Settings

ini is a highly underrated file format. Here are some definitions on what the settings parameter actually do.

GENERAL

  • PasteLimit Stop after having scraped n pastes. Set to 0 for indefinite scraping
  • PBLink URL to Pastebin or another equivalent site
  • DownloadWorkers Number of workers that download the raw paste content and further process it
  • NewPasteCheckInterval Time to wait before checking the main site for new pastes again
  • IPBlockedWaitTime Time to wait until checking the main site again after the scraper's IP has been blocked

LOGGING

  • RotationLog Location of log file that contains debug output
  • MaxRotationSize Size in bytes before another log file is created
  • RotationBackupCount Maximum number of log files to keep

STDOUT/ FILE

  • Enable Enable formatted stdout output of paste data
  • ContentDisplayLimit Maximum amount of characters to show before content is cut off (0 to display all)
  • ShowName Display the paste name
  • ShowLang Display the paste language
  • ShowLink Display the complete paste link
  • ShowData Display the raw paste content
  • DataEncoding Encoding of the raw paste data

MYSQL

  • Enable Enable MySQL output
  • TableName Main table name to insert data into
  • Host MySQL server host
  • Port MySQL server port
  • Username MySQL server user
  • Password User password

SQLITE

  • Enable Enable SQLite output
  • Filename Filename the db should be saved as (usually ends with .db)
  • TableName Main table name to insert data into

If you use this thing for some cool data analysis or even research, let me know if I can help!

Inspiration for this scraper was taken from here.

pastebin-scraper's People

Contributors

dmuhs avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.