Giter Club home page Giter Club logo

Comments (12)

Yakabuff avatar Yakabuff commented on June 3, 2024

Pushshift data isn't normalized and there are inconsistencies sometimes. Working on a fix

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

5da45f2

This commit should fix it. Let me know if that works for you

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

@ohhdemgirls

from redarc.

ohhdemgirls avatar ohhdemgirls commented on June 3, 2024

Prefect, thanks. Seems to be working, going to take awhile to get everything imported mind you 🤣

from redarc.

ohhdemgirls avatar ohhdemgirls commented on June 3, 2024

Spoke too soon, same file failed differently toward the end...

Identifier inserted succesfully x9zhzi
====================
ERROR: A string literal cannot contain NUL (0x00) characters.

(also spelling typo in your code 😝 )

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

Working on a fix. I'm also considering removing the SQL rollback; it's not ideal to have to reinsert every submission/comment after an error especially for very large dump files

from redarc.

ohhdemgirls avatar ohhdemgirls commented on June 3, 2024

it's not ideal to have to reinsert every submission/comment after an error especially for very large dump files

True, once I'm able to get everything into the DB it's going to take some tuning for it to be lean but I've already provisioned all nvme storage and a terabyte of ram for it so we should be able to host a static historical version of what PS had available on it's API before people started wanting things removed and reddit nuked it!

There's going to be more edge cases in this data I'm sure so I'm very glad you're able to put in some time wrangling it, get on discord if you want to knock heads on this. Realizing who you are I've been a fan of your previous works 👌

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

I completely revamped the scripts. It now tries to insert everything first and then logs the problematic rows.

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

I also added .replace("\u0000", "") to handle null characters in user submitted fields like selftext and title

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

@ohhdemgirls

from redarc.

ohhdemgirls avatar ohhdemgirls commented on June 3, 2024

real 383m8.352s

Worked that time, but now lets work on speeding up ingest if possible 😋

from redarc.

Yakabuff avatar Yakabuff commented on June 3, 2024

It's the print statements that's making it slow. Btw, did you check the log file for any warnings/errors

from redarc.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.