Giter Club home page Giter Club logo

subreddit-archiver's Issues

Some posts are not downloaded

Right after completing an update for /r/rust, here's how the https://www.reddit.com/r/rust/new/ page looked in my browser:

Screenshot_20211217_110938

The post Announce Engula 0.2! (id=riba4x) is indeed part of the archive:

sqlite3 rust.sqlite "SELECT id, title FROM posts ORDER BY created_utc DESC LIMIT 1;"
riba4x|Announce Engula 0.2!

but the previous one ([MEDIA] Basic rasterizer using compute shaders and wgpu, id=ri8vvn) is not:

sqlite3 rust.sqlite "SELECT id, title FROM posts WHERE id='ri8vvn'"

That's... not good at all :-(
I expect Subreddit-archiver to archive all posts. That's what I need.
At the very least, whenever some posts couldn't be downloaded for any reason, it should be reported.

Allow to archive a subreddit past a certain date

Heya,

let me start by expressing my gratitude for your efforts into this nice project !
I'm actually surprised I couldn't find any mention of it in Reddit... how about posting it to e.g. r/DataHoarder ?

One of my daily chores is to follow a number of subreddits, say r/rust for instance, and I'm hoping this project will help me to do so more comfortably.

Taking r/rust as an example, this subreddit was created in 2010, while I'm only interested in posts newer than Rust 1.0 (in may 2015).

It would be great if the archive command would allow to specify a starting date, which would help in my situation.

Cheers!

Issue with flatten_commentforest

Hi, I'm running the subreddit archiver on /r/combatfootage and getting an error after successfully archiving 26,299 posts.

command used: subreddit-archiver archive --subreddit combatfootage --file combatfootage.sqlite --credentials credentials.config

least_recent_saved_post_utc: 1559489863.0
subreddit_created_utc: 1347238229.0

traceback:
Traceback (most recent call last):
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\Scripts\subreddit-archiver.exe_main
.py", line 7, in
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\main.py", line 61, in main
archive(args.subreddit, args.file, args.batch_size, args.credentials)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\utils.py", line 4, in clean
func(*args, **kwargs)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\main.py", line 32, in archive
get_posts.archive_posts(reddit, db_connection, batch_size)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\get_posts.py", line 115, in archive_posts
process_post_batch(posts, db_connection)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\get_posts.py", line 81, in process_post_batch
serializer.flatten_commentforest(post.comments, comments)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python310\lib\site-packages\subreddit_archiver\serializer.py", line 163, in flatten_commentforest
flatten_commentforest(comment.replies, outlist)
[Previous line repeated 8 more times]
AttributeError: 'MoreComments' object has no attribute 'replies'
Subreddit created on Sun Sep 9 20:50:29 2012
Preparing to archive or continue archival

Thanks!

Is this a pushshift issue?

File "/home/ubuntu/.local/lib/python3.7/site-packages/prawcore/sessions.py", line 266, in _request_with_retries
raise self.STATUS_EXCEPTIONSresponse.status_code
prawcore.exceptions.ServerError: received 500 HTTP response

Media Content

Hi there,

First of all thank you so much for developing what might be the most polished and scalable reddit archiver I've ever seen! I just had more of a question than a problem, figured this is the best way to ask in case anyone else has the same query.

How does your tool handle media content (e.g. images, videos, etc)? Are there links recorded?

It would be fantastic to be able to download and store the media as well in a future release (just off the top of my head you could use folders with the post ID and a generated UUID for each piece of media, then reference that in the DB).

There are some subreddits that are "at risk" in need of archiving, at the moment we're just focusing on the text but if we could get media as well that would be amazing.

Thanks again for this fantastic tool

update with new pushshift

Hi, I realized my archive hadn't been updating since the pushshift API changed, so I pulled your changes and went to update my archive. I immediately started getting rate limited with 429's. I realized what was happening is that with the new changes, if you request with sort=desc and since=<newest_created_utc>, it returns the newest ID's created after the since parameter. Moreover, it returns them newest to oldest. So with a batch size of 100, it was returning the newest 100 posts since december and then setting the newest_created_utc to the newest post on the subreddit. However, the since parameter is inclusive, so it then repeatedly gets a single post with the newest ID and sends it to get_from_pushshift() over and over very rapidly resulting in 429's.

So I changed the sort parameter to 'asc' when make_pushshift_url(after=True), but the data returned is then oldest to newest, so in the update_posts() function I needed to change the index from newest_post_utc = get_created_utc(posts[0]) to a negative index with newest_post_utc = get_created_utc(posts[-1]) to avoid just incrementing by 1 post at a time. Not sure if it will exit cleanly as it's still updating, but it's properly updating my archive now

Other than that, love the tool and thanks for maintaining it!

Completion rate not 100% after process reported as completed

Here's the output I got after the process completed:

subreddit-archiver archive --subreddit rust --file rust.sqlite --credentials credentials.txt 
Subreddit created on Thu Dec  2 22:27:18 2010
Saved 51630 posts more. Covered 90.6% of subreddit lifespan

Completed archiving

It doesn't make sense... what's going on here ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.