Comments (12)
Pushshift data isn't normalized and there are inconsistencies sometimes. Working on a fix
from redarc.
This commit should fix it. Let me know if that works for you
from redarc.
from redarc.
Prefect, thanks. Seems to be working, going to take awhile to get everything imported mind you 🤣
from redarc.
Spoke too soon, same file failed differently toward the end...
Identifier inserted succesfully x9zhzi
====================
ERROR: A string literal cannot contain NUL (0x00) characters.
(also spelling typo in your code 😝 )
from redarc.
Working on a fix. I'm also considering removing the SQL rollback; it's not ideal to have to reinsert every submission/comment after an error especially for very large dump files
from redarc.
it's not ideal to have to reinsert every submission/comment after an error especially for very large dump files
True, once I'm able to get everything into the DB it's going to take some tuning for it to be lean but I've already provisioned all nvme storage and a terabyte of ram for it so we should be able to host a static historical version of what PS had available on it's API before people started wanting things removed and reddit nuked it!
There's going to be more edge cases in this data I'm sure so I'm very glad you're able to put in some time wrangling it, get on discord if you want to knock heads on this. Realizing who you are I've been a fan of your previous works 👌
from redarc.
I completely revamped the scripts. It now tries to insert everything first and then logs the problematic rows.
from redarc.
I also added .replace("\u0000", "")
to handle null characters in user submitted fields like selftext and title
from redarc.
from redarc.
real 383m8.352s
Worked that time, but now lets work on speeding up ingest if possible 😋
from redarc.
It's the print statements that's making it slow. Btw, did you check the log file for any warnings/errors
from redarc.
Related Issues (17)
- Error with npm ci HOT 3
- /applications/postgres-docker isn't shared properly HOT 14
- Using API HOT 10
- Correct API search capitalization HOT 2
- Add total submission/comment count on index.
- Anonymous/Username Option.
- Add loading notifications.
- Add alt image for thumbnail toggle.
- Add sorting in indexes.
- Add dark mode. HOT 1
- Documentation Clarification Needed HOT 8
- Add r/libreal, r/conservative to UI? HOT 1
- Collapse removed/deleted comments.
- Elasticsearch Documentation HOT 15
- Roadmap
- Add support for more link aggregators
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from redarc.