Giter Club home page Giter Club logo

rss-lambda's Introduction

rss-lambda

Monitor your favourite blogs through RSS and get a notification whenever a new blog is posted. New blogposts are stored in DynamoDB and (optionally) sent out to your e-mail address using SES. The Step Function function to retrieve the blogs runs every 15 minutes by default. The cost for running the solution should be less than $3 per month, which is mostly influenced by the polling frequency of the function.

You can extend the blog scraper by adding your own RSS feeds to monitor. By default various AWS related feeds are included, but you can add any of your own feeds in the lambda-dynamo/feeds.txt file. Within the DynamoDB table that is deployed, you can find various details about the blogposts and also the text or html versions of the content. This can be helpful in case you are building your own feed scraper or notification service. You can also use the included AppSync endpoint to read data from the table using GraphQL.

Optionally, a JSON output for every blog category can be uploaded as a public S3 object. These files can be included in a single page app, such as the one at https://marek.rocks . The output will be compressed using 'brotli' or something similar later in the future to save on S3 storage and bandwidth costs.

The feed retrieval feature uses a "readability" library which works similarly to the "Reader View" function of the Apple Safari browser. This makes it convenient to read the full text of a blogpost in your email client or on mobile. All of the links, images and text markup is preserved.

Finally, an AppSync public endpoint can be deployed which retrieves the blogposts from DynamoDB. You can include the endpoint in a single page app to query blogpost context real time in a (web) application.

alt text

The following fields are stored in DynamoDB per blog article. In the screenshot, the large HTML and text outputs were omitted;

alt text

Finally, the following State Machine is created to retrieve blog posts;

alt text

Installation

  • Make sure the AWS SAM CLI and Docker are installed and configured on your local machine.
  • If you want, you can edit the RSS feeds in 'lambda/feeds.txt'. These contain various AWS blogs I read by default.
  • Run 'make init' to deploy the stack for the first time. Once the 'samconfig.toml' file is present, you can use 'make deploy'.
  • If you optionally select to use email notifications using SES, you will need to ensure that you have the SES sender and email address preconfigured in your account. There is unfortunately no simple way to provision this using SAM.

You can now run the Step Function to trigger the blog refresh. The URL to find the Step Function is given as an output value of the CloudFormation stack.

Roadmap

  • Switch to Step Functions Express to save on costs. The Express option can be used today, but is more difficult to debug in case of Lambda failures.
  • Add AppSync endpoint for retrieval of blog posts through Amplify.
  • Decompose the "monolith" Lambda function into smaller functions. This will allow for easier retries and debugging of blogpost retrieval.
  • Implement Step Function for better coordination of individual functionality.
  • Add Lambda Extension to monitor network and CPU usage of the RSS function.
  • Optimize Lambda memory and timeout settings to lower cost.
  • Add "smart" text extraction of the full blogpost, so that the full content of a post can be stored in DynamoDB or sent through e-mail.
  • Add generation of JSON files with blogposts to S3 for easier inclusion in a single page app (as seen on https://marek.rocks ).
  • Add support for retrieval of non AWS blogposts using RSS.
  • Add DynamoDB Global Secondary Indexes for (partial) data retrieval based on GUID, timestamp and blog categories.

About the repo contents

The following description describes briefly what the files and folder contains;

  • Run make init to deploy the stack to AWS. It will download all of the Lambda dependancies, pack them and upload them to S3 and deploy a CloudFormation stack using SAM. After the initial run, you can use make deploy for incremental changes to your SAM stack.
  • The template.yaml file is the SAM CloudFormation stack for the deployment. You do not need to edit this file directly.
  • The lambda-crawl folder has the Lambda function to discover the RSS feeds, if files are present on S3 and see how much days of data need to be retrieved. It is triggered at the start of the Step Function.
  • The lambda-getfeed folder contains the source code the function that checks every feed individually. It is triggered in the map state of the Step Function.
  • The statemachine folder contains the source code for Step Function in JSON.
  • The lambda-layer folder contains the requirements.txt file for the Lambda layer of the blog retrieval function.
  • The graphql folder contains the GraphQL schema and VTL resolvers for AppSync.

License

MIT-0, please see the 'LICENSE' file for more info.

Contact

In case of questions or bugs, please raise an issue or reach out to @marekq!

rss-lambda's People

Contributors

marekq avatar nmoutschen avatar

Stargazers

 avatar  avatar InspiRare avatar Narcisse Z avatar  avatar  avatar GAURAV avatar Drew Dresser avatar Tim Carlock avatar Nithur avatar  avatar  avatar Dru Serkes avatar Steffen Opel avatar hharzer avatar Theresa Thoraldson avatar Ravi Kumar avatar miked0004 avatar Andrew Holzer avatar Ari avatar Henry Z avatar Piotr Kieszczyński avatar Julian Harris avatar Kaan Dolgun avatar Derek Nordgren avatar Vince Fulco--Bighire.tools avatar Zi How Poh avatar Jeffrey Swindel avatar Ivan Nikolov avatar Taylor W avatar AJW  avatar taro-is avatar Avinash Sooriyarachchi avatar Josh Kahn avatar Nick Beattie avatar John Johnson avatar Aditya K avatar Justin Coffi avatar Mims avatar mjchen avatar Marc-Henry GEAY avatar Fintechee avatar Zack McCauley avatar  avatar Simon Moisselin avatar Biplob Biswas avatar Mathew Werber avatar ben avatar Vigneshwaran Ravimurugan avatar Ni Jannasch avatar Kellan avatar William Bert avatar Shrikar Archak avatar  avatar Tim Kersey avatar Sylvan Ravinet avatar Jairo Chapela-Martínez avatar

Watchers

 avatar James Cloos avatar  avatar Miguel Albano avatar Vince Fulco--Bighire.tools avatar

rss-lambda's Issues

[ERROR] AlgoliaUnreachableHostException: Unreachable hosts

I tried running this and keep getting this / similar error notes. Thoughts?

Traceback (most recent call last):  File "/opt/python/aws_lambda_powertools/logging/logger.py", line 347, in decorate    return lambda_handler(event, context)  File "/opt/python/aws_lambda_powertools/tracing/tracer.py", line 314, in decorate    response = lambda_handler(event, context, **kwargs)  File "/var/task/getfeed.py", line 493, in handler    blogupdate, newblogs = get_feed(url, blogsource, guids, table, event)  File "/opt/python/aws_lambda_powertools/tracing/tracer.py", line 631, in decorate    response = method(*args, **kwargs)  File "/var/task/getfeed.py", line 251, in get_feed    put_dynamo(timest_post, title, cleantxt, rawhtml, description, link, blogsource, author, guid, tags, category, datestr_post, table, event)  File "/opt/python/aws_lambda_powertools/tracing/tracer.py", line 631, in decorate    response = method(*args, **kwargs)  File "/var/task/getfeed.py", line 85, in put_dynamo    index.save_objects([smallitem])  File "/opt/python/algoliasearch/search_index.py", line 72, in save_objects    response = self._chunk("updateObject", objects, request_options)  File "/opt/python/algoliasearch/search_index.py", line 527, in _chunk    raw_responses.append(self._raw_batch(requests, request_options))  File "/opt/python/algoliasearch/search_index.py", line 534, in _raw_batch    return self._transporter.write(  File "/opt/python/algoliasearch/http/transporter.py", line 35, in write    return self.request(verb, hosts, path, data, request_options, timeout)  File "/opt/python/algoliasearch/http/transporter.py", line 72, in request    return self.retry(hosts, request, relative_url)  File "/opt/python/algoliasearch/http/transporter.py", line 94, in retry    raise

Difficulty adjusting feeds list

Hi @marekq , really cool stack and thanks for the efforts on it.

Everything works when I use feeds from the AWS blog; however, if I try other RSS feeds, the Dynamo table does not fill up.

I don't see anything in the lambda.py that would prevent other feeds being parsed, especially since these feeds also have "title," "link," and "description." And it seems rss-vacuum.py is just to clean up the table, no fetching the feed and populating the table here.

Any suggestions on the best way to tweak the parser to use feeds from other sites? Bit of a python noob...

Thanks

Phil

Adapting this to build a Twitter bot

Hello @marekq, thanks for building this.
I want to use this to build a Twitter bot. I don't need the SES email feature, Algolia, etc. Can you tell me how can I adapt this to my use case? I mean what parts should I care about?
If I want to build a Twitter bot that alerts me whenever there is a new post, should I need Dynamodb? or simple caching is enough?
I'd appreciate any tips. Thank you.

Looking for advice on non-standard feeds tags...

I have dived into the code a little and am looking at modifying for non-blog rss feeds. The industry I am focused on doesn't have a standard for feed tags. Is there an easier way to manage such diverse tag "sets" or is it a lot of manual modification to the code depending on the feeds unique qualities?

Thank you in advance.

Software License?

Howdy.

I'm about to do something similar / monstrous--I'd like to start with your version as a baseline, but there's no license assigned to the repository. Any chance you could slap a license into place for this so I can use it without getting my wrist slapped?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.