Giter Club home page Giter Club logo

rsspipe's Introduction

RSS Pipe

An RSS feed importer using Apache Flume that imports RSS feed elements filtered by date into HBase. Tried and tested on CDH 4.6.

Adding new sources

As of now, the source is kind of hardcoded to a local mongodb instance with hardcoded connection parameters. I'm working on making this dynamic.

  1. Create MongoDB DB by name MediapipeDB
  2. Create collections FeedSource and ExtractionStatus
db.FeedSource.save({"country" : "India" , "publisherName" : "The Hindu" , "state" : "All India" , "url" : "http://www.thehindu.com/?service=rss"});
db.ExtractionStatus.save({"publisherName" : "The Hindu", "lastExtractedTs" : <DATETIME IN EEE MMM dd HH:mm:ss zzz yyyy> });

Flume Configuration

Flume configuration is available in conf/flume.conf file

RSSPipe.sinks.HBASE.serializer.columns stores all the mapping between components of the rss feed and the hbase column name. You can choose to leave out any of the mapping entry if you do not want it to be stored onto the HTable. Also, if you want to change the name of the column, you may change the same in the mapping. The general represenation of the mapping is rss_feed_component>:hbase_column_name.

Below is the list of all available info extracted from a feed item :

timestampOfStorage : Time when the feed was extracted by flume
feedTitle : Title of the parent RSS feed
feedLink : Link of the parent RSS feed
feedDesc : Description of the parent RSS feed
feedLanguage : Language of the parent RSS feed
feedCopyRight : Copyright of the parent RSS feed
feedPubDate : Published date of the parent RSS feed
feedItemTitle : Title of the feed item
feedItemDescription : Description of the feed item feedItemLink:Link to the feed item
feedItemAuthor : Author of the feed item
feedItemGuid : GUID of the feed item
feedItemPubDate : Published date of the feed item
fullText : Full HTML of the feed item
decodedPubTime : A decoded representation of pubdate of the feed item
bestGuessRelevantText:Text info that is held by the div element with the highest density. Extracted using Readability algorithm. This is supposed to hold the most relevant piece of text within the page.

rsspipe's People

Contributors

dgkris avatar

Watchers

Sandeep Unni avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.