Giter Club home page Giter Club logo

feed-to-sqlite's Introduction

Build Status Tests PyPI License

feed-to-sqlite

Download an RSS or Atom feed and save it to a SQLite database. This is meant to work well with datasette.

Installation

pip install feed-to-sqlite

CLI Usage

Let's grab the ATOM feeds for items I've shared on NewsBlur and my instapaper favorites save each its own table.

feed-to-sqlite feeds.db http://chrisamico.newsblur.com/social/rss/35501/chrisamico https://www.instapaper.com/starred/rss/13475/qUh7yaOUGOSQeANThMyxXdYnho

This will use a SQLite database called feeds.db, creating it if necessary. By default, each feed gets its own table, named based on a slugified version of the feed's title.

To load all items from multiple feeds into a common (or pre-existing) table, pass a --table argument:

feed-to-sqlite feeds.db --table links <url> <url>

That will put all items in a table called links.

Each feed also creates an entry in a feeds table containing top-level metadata for each feed. Each item will have a foreign key to the originating feed. This is especially useful if combining feeds into a shared table.

Python API

One function, ingest_feed, does most of the work here. The following will create a database called feeds.db and download my NewsBlur shared items into a new table called links.

from feed_to_sqlite import ingest_feed

url = "http://chrisamico.newsblur.com/social/rss/35501/chrisamico"

ingest_feed("feeds.db", url=url, table_name="links")

Transforming data on ingest

When working in Python directly, it's possible to pass in a function to transform rows before they're saved to the database.

The normalize argument to ingest_feed is a function that will be called on each feed item, useful for fixing links or doing additional work.

It's signature is normalize(table, entry, feed_details, client):

  • table is a SQLite table (from sqlite-utils)
  • entry is one feed item, as a dictionary
  • feed_details is a dictionary of top-level feed information, as a dictionary
  • client is an instance of httpx.Client, which can be used for outgoing HTTP requests during normalization

That function should return a dictionary representing the row to be saved. Returning a falsey value for a given row will cause that row to be skipped.

Development

Tests use pytest. Run pytest tests/ to run the test suite.

feed-to-sqlite's People

Contributors

eyeseast avatar simonw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

feed-to-sqlite's Issues

Poll command to update existing feeds

Grabbing something @simonw said on Slack:

I like the idea of a feeds table - for a couple of reasons beyond just having something to foreign key against
You could stash information about when each feed was last polled - maybe even store an ETag for more efficient polling against sites that support that
And... that way you could have a feeds-to-sqlite poll feeds.db command which you run on cron which then runs a new fetch for every feed in that table
Which you could accompany with a feeds-to-sqlite feeds.db https://.../feed.xml which adds a new URL to that table - and now you've implemented a full feed reader :)

So maybe write this poll command. It's a good idea.

More test data

I'm testing with my NewsBlur and Instapaper feeds. Those are weird. Let's get more weird (and less weird) examples.

Fetch feeds in parallel

Is there an easy way to do this within the standard library? Don't want to add a giant dependency if I can avoid it. Haven't done much with asyncio.

Python 2.x support

Any chance this would work on Python 2.7 (Please don't freak out - I need it to run some old software.)

Alter tables on update?

Should probably call alter=True in upsert statements to any changes in the feed or entry schema flow down to the actual data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.