arshadrr / subreddit-archiver Goto Github PK

View Code? Open in Web Editor NEW

26.0 2.0 1.0 77 KB

Python utility to archive and keep up-to-date archives of reddit subreddits. Archives to SQLite databases.

License: GNU General Public License v3.0

Python 100.00%

reddit archiver python3 sqlite

subreddit-archiver's Introduction

subreddit-archiver

⚠ This project is archived.

This utility allows you to save subreddits as SQLite databases. Partial and complete archives can then be updates with newer posts submitted since the archival. Allows you to have your personal copy of a subreddit for safekeeping, analysis, etc.

This tool makes uses of pushshift.io, a volunteer-run service. If you found this tool useful, you can donate to pushshift.io as it makes this tool possible.

Installation

Clone this repository

$ git clone https://github.com/arshadrr/subreddit-archiver

cd into the directory

$ cd subreddit-archiver

Install the package. Requires Python 3.7 or newer. It's recommended you install using pipx (however it does work with pip).

$ pipx install .

Usage

Once installed, the utility can be invoked as subreddit-archiver in your terminal. The utility comes with two commands, subreddit-archiver archive which archives (saves posts from the present to the past, all the way up to the oldest post in the subreddit if allowed to run long enough) and subreddit-archiver update which saves posts newer than the most recent post in an archive.

The utility makes use of the Reddit API and takes API credentials through a configuration file the path to which you pass as an argument (--credentials). Instructions on how to acquire these, and the format the configuration file should take is described in the section Credentials

The output of this utility is a SQLite database. For information on the structure and how to make use of what this program produces, see Schema.

$ subreddit-archiver
usage: subreddit-archiver [-h] [--version] {archive,update} ...

`subreddit-archiver archive`

$ subreddit-archiver archive -h
usage: subreddit-archiver archive [-h] [--batch-size BATCH_SIZE] --subreddit SUBREDDIT --file FILE --credentials CREDENTIALS

Archive a subreddit.

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Number of posts to fetch from from the Reddit API with each request. Defaults to 100.

required arguments:
  --subreddit SUBREDDIT
                        Name of subreddit to save. Optional if resuming archival.
  --file FILE           Location and name of file that output SQLite database should take. e.g ~/archives/mysubreddit.sqlite
  --credentials CREDENTIALS
                        File containing credentials to access Reddit API.

Saves posts and comments from a subreddit to a SQLite database. Archival can be
stopped (e.g Ctr-C) while in progress safely and without data loss. To resume
archival, just point to the same output file. The output file keeps track of the
progress of archival.

Suppose you'd like to archive the subreddit /r/learnart to the file learnart.sqlite with API credentials stored in credentials.config:

$ subreddit-archiver archive --subreddit learnart --file learnart.sqlite --credentials credentials.config
Subreddit created on Tue Nov  9 05:18:57 2010
Saved 3 posts more. Covered 0.0% of subreddit lifespan

Archival can be stopped (e.g using Ctr-C). You can then resume by re-running while specifying the same output file.

`subreddit-archiver update`

$ subreddit-archiver update -h
usage: subreddit-archiver update [-h] [--batch-size BATCH_SIZE] --file FILE --credentials CREDENTIALS

Update an existing archive of a subreddit.

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Number of posts to fetch from from the Reddit API with each request. Defaults to 100.

required arguments:
  --file FILE           Path to existing archive that should be updated.
  --credentials CREDENTIALS
                        File containing credentials to access Reddit API.

Archives created by this utility keep track of the newest post within them. This command saves posts (and comments under these) newer than the newest post in the archive. Changes to older posts and comments will not be
added. Comments made after a post was saved to an archive will not be added.

Suppose you used this tool to create an archive of the /r/learnart subreddit. A few days pass and new posts have been submitted which aren't part of your archive. To fetch posts submitted since the most recent post in your archive, use this command.

$ subreddit-archiver update --file learnart.sqlite --credentials credentials.config
Newest post in archive is from Mon Mar  8 19:34:26 2021
Saved 1 up to Mon Mar  8 19:53:36 2021
Completed updating

Credentials

To use the Reddit API, this application will need to be provided with API credentials from Reddit. Follow these instructions to get these:

Visit https://www.reddit.com/prefs/apps and click on the button are you a developer? create an app... towards the end of the page.
Choose the 'script' option in the list of radio buttons. Give the app a name and a redirect-url (anything will do, the values you enter don't really matter). Create the app.
Copy the text that follows the label secret and keep hold of it. This is your client_secret.
Beneath the text 'personal use script', you'll find a random string of letters. Copy this too, it is your client_id.
Create a text file and paste the client_id and client_secret in the format below:

[DEFAULT]
client_id=<insert your client id here>
client_secret=<insert your client secret here>

This will be your credentials file. When using this utility, pass the location of this file to the --credentials option.

Schema

You've saved a bunch of posts. Now what? This section describes what the SQLite database this utility produces looks like so that you can put it to use. It required you either install the SQLite command-line shell or have bindings for SQLite in your programming language such as sqlite3 for Python.

The output database will contain three tables, archive_metadata, comments and posts:

$ sqlite3 learnart.sqlite
-- Loading resources from /home/user/.sqliterc
SQLite version 3.30.0 2019-10-04 15:03:17
Enter ".help" for usage hints.
sqlite> .tables
archive_metadata  comments          posts

archive_metadata: stores some metadata about the archive. The specific information stored can be found listed in the file states.py in the class DB.
posts: stores the posts with each row representing one post. schema.sql contains inline comments explaining the columns that make up this table and how to use them
comments: stores comments with each row representing one comment. As with the posts table, refer to schema.sql for more.

subreddit-archiver's People

Contributors

Stargazers

Watchers

Forkers

aarasmith

subreddit-archiver's Issues

Some posts are not downloaded

Right after completing an update for /r/rust, here's how the https://www.reddit.com/r/rust/new/ page looked in my browser:

The post Announce Engula 0.2! (id=riba4x) is indeed part of the archive:

sqlite3 rust.sqlite "SELECT id, title FROM posts ORDER BY created_utc DESC LIMIT 1;"
riba4x|Announce Engula 0.2!

but the previous one ([MEDIA] Basic rasterizer using compute shaders and wgpu, id=ri8vvn) is not:

sqlite3 rust.sqlite "SELECT id, title FROM posts WHERE id='ri8vvn'"

That's... not good at all :-(
I expect Subreddit-archiver to archive all posts. That's what I need.
At the very least, whenever some posts couldn't be downloaded for any reason, it should be reported.

Allow to archive a subreddit past a certain date

Heya,

let me start by expressing my gratitude for your efforts into this nice project !
I'm actually surprised I couldn't find any mention of it in Reddit... how about posting it to e.g. r/DataHoarder ?

One of my daily chores is to follow a number of subreddits, say r/rust for instance, and I'm hoping this project will help me to do so more comfortably.

Taking r/rust as an example, this subreddit was created in 2010, while I'm only interested in posts newer than Rust 1.0 (in may 2015).

It would be great if the archive command would allow to specify a starting date, which would help in my situation.

Cheers!

update with new pushshift

Hi, I realized my archive hadn't been updating since the pushshift API changed, so I pulled your changes and went to update my archive. I immediately started getting rate limited with 429's. I realized what was happening is that with the new changes, if you request with sort=desc and since=<newest_created_utc>, it returns the newest ID's created after the since parameter. Moreover, it returns them newest to oldest. So with a batch size of 100, it was returning the newest 100 posts since december and then setting the newest_created_utc to the newest post on the subreddit. However, the since parameter is inclusive, so it then repeatedly gets a single post with the newest ID and sends it to get_from_pushshift() over and over very rapidly resulting in 429's.

So I changed the sort parameter to 'asc' when make_pushshift_url(after=True), but the data returned is then oldest to newest, so in the update_posts() function I needed to change the index from newest_post_utc = get_created_utc(posts[0]) to a negative index with newest_post_utc = get_created_utc(posts[-1]) to avoid just incrementing by 1 post at a time. Not sure if it will exit cleanly as it's still updating, but it's properly updating my archive now

Other than that, love the tool and thanks for maintaining it!

Completion rate not 100% after process reported as completed

Here's the output I got after the process completed:

subreddit-archiver archive --subreddit rust --file rust.sqlite --credentials credentials.txt 
Subreddit created on Thu Dec  2 22:27:18 2010
Saved 51630 posts more. Covered 90.6% of subreddit lifespan

Completed archiving

It doesn't make sense... what's going on here ?

Issue with flatten_commentforest

Hi, I'm running the subreddit archiver on /r/combatfootage and getting an error after successfully archiving 26,299 posts.

command used: subreddit-archiver archive --subreddit combatfootage --file combatfootage.sqlite --credentials credentials.config

least_recent_saved_post_utc: 1559489863.0
subreddit_created_utc: 1347238229.0

traceback:
Traceback (most recent call last):
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\Scripts\subreddit-archiver.exe_main.py", line 7, in
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\main.py", line 61, in main
archive(args.subreddit, args.file, args.batch_size, args.credentials)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\utils.py", line 4, in clean
func(*args, **kwargs)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\main.py", line 32, in archive
get_posts.archive_posts(reddit, db_connection, batch_size)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\get_posts.py", line 115, in archive_posts
process_post_batch(posts, db_connection)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\get_posts.py", line 81, in process_post_batch
serializer.flatten_commentforest(post.comments, comments)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
[Previous line repeated 8 more times]
AttributeError: 'MoreComments' object has no attribute 'replies'
Subreddit created on Sun Sep 9 20:50:29 2012
Preparing to archive or continue archival

Thanks!

question about post/comment history

If a post/comment gets updated/deleted, does the archive keep the history, or does it overwrite the saved data? tia

Is this a pushshift issue?

File "/home/ubuntu/.local/lib/python3.7/site-packages/prawcore/sessions.py", line 266, in _request_with_retries
raise self.STATUS_EXCEPTIONSresponse.status_code
prawcore.exceptions.ServerError: received 500 HTTP response

Media Content

Hi there,

First of all thank you so much for developing what might be the most polished and scalable reddit archiver I've ever seen! I just had more of a question than a problem, figured this is the best way to ask in case anyone else has the same query.

How does your tool handle media content (e.g. images, videos, etc)? Are there links recorded?

It would be fantastic to be able to download and store the media as well in a future release (just off the top of my head you could use folders with the post ID and a generated UUID for each piece of media, then reference that in the DB).

There are some subreddits that are "at risk" in need of archiving, at the moment we're just focusing on the text but if we could get media as well that would be amazing.

Thanks again for this fantastic tool