Comments (5)
Yes, this functionality is currently broken - and likely both in this version and in the TypeScript rewrite. Reddit comments and submissions generally change or are lost over a long enough window, and coupled with the fact that the official Reddit API is (or was) extremely slow for individual lookups, PushShift was implemented as the sole solution for single targets.
For people using this functionality, the reason is generally because they have more saved posts than the official API will return (capped at 1000), so typical CSV downloads will have many thousands of posts to scan. Frankly, the Reddit API is unsuitable for this task. Due to the harsh rate limiting of their API, and also because of their general slow response time, processing a CSV directly through official means would take a significant amount of time. Ignoring the API response times and skipping the actual download calls, which use additional API queries in some cases, the optimistic run time just to retrieve 1000 individual posts within API limits is 30+ minutes. This also ignores any old deleted or edited posts, where the data will be completely unrecoverable. In the rewritten TS version, PushShift functionality was mandatory in order to reliably build relationships between saved comments and their parents, in the event that the parent submission had been removed from the live site.
This probably isn't the best place to discuss, but I may as well dump it here on the most recent issue caused by Reddit actions:
Bluntly, it's changes like this that have driven me away from supporting these Reddit-backed applications. They've been teasing for years now that they intend to restrict their API further and further, and it makes investing my energy into these projects seem like a tremendous waste. I've been directing my focus lately towards a more convenient, site-agnostic method of preserving media, which I'd rather push forward with rather than supporting a site that doesn't support its users in return.
Suffice it to say that I'm unlikely to expend much effort towards bringing these features back in the short term. I have very limited time to work on my passion projects these days, and I would prefer not to waste that time stepping into adversarial relationships with social media site developers.
If they get things sorted out with PushShift, then everything should start working again and I'll be more encouraged to move forward with completing the rewrite, which also heavily utilizes PS. If not... well, it will likely be impossible to reimplement the lost functionality to the level people expect from the application. The code to add a bandage fix exists - scattered around - within RMD already, and I'm very open to accepting Pull Requests, but I probably won't be the one implementing it. The fix would only get RMD limping along, and honestly that seems likely to only raise more complaints and issues. At this point I'll be keeping an eye out for future Reddit API developments, and should anything come up, I'll be happy to revisit this.
from redditdownloader.
Good news, it seems like they sorted things out with PushShift and it is coming back in the following month. Bad news is that
"use of Pushshift will be limited to moderation use cases only."
"Though access to Pushshift data for research purposes is not available at this time, , we are keen to explore possibilities that might allow us to provide researchers with access to datasets essential for their valuable social media research. We understand the significance of empowering the academic community, and we are dedicated to working with Reddit to develop frameworks that responsibly balance data access, data security, and user privacy."
from redditdownloader.
PushShift is currently broken, due to API restrictions that Reddit staff are implementing. As a result, I will be unable to support any further PushShift development until (and if) they work something out with Reddit.
from redditdownloader.
So the CSV download is effectively dead now? The --full_csv flag seems to imply it will bypass the need for PushShift but it fails in the same way. Is there an easy bypass?
from redditdownloader.
The psaw
library seems to be pretty heavily integrated into the project. Not even direct url download works without it.
Though that might be easier to hack, as far as I can tell in that case pushshift is only needed to get the metadata from a reddit post to create an instance of the processing.wrappers.redditelement.RedditElement
class. The unfortunate lack of type annotations in the project doesn't make it easy though.
PS: @vincenzogianfelice I am appalled by the entitlement displayed in your comment. This software is provided entirely free of charge, the least you could do is to be nice to the developer.
from redditdownloader.
Related Issues (20)
- deleted
- Retried retrieval of Reddit posts too many times, are you connected to the Internet? HOT 1
- Setting to change downloader HOT 2
- Change PSAW to PMAW (PROBLEM PUSHFIT.IO !!)
- Add sorting for User's Submission & Comment History
- Deleted sources keep downloading
- Source toggle
- Hide downloaded items HOT 2
- Failed URLs: Reddit links are disabled HOT 1
- [feature request] Option to not match something HOT 1
- over18 filter is not working
- Cant dowload media, and get message "unable to parse url from reddit album." in Command Prompt HOT 2
- Metadata-generation-failed HOT 2
- Invalid client id. HOT 1
- authorize an account feature doesnt work anymore HOT 1
- AUTH --- Cant authorize account HOT 1
- Authorizing not working HOT 1
- MacOS release file lacks an extension.
- Revised Reddit API HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from redditdownloader.