Giter Club home page Giter Club logo

million-customers's Introduction

Processing 1Million+ rows of csv via Node.js

This was an interesting problem to come across. Filtering 1M+ rows from a csv to sql seems like an architechtural flaw to begin with. That being said, provided csv file had 89 occurences of misplaced column. Probably a large dataset was copy-pasted repeatedly and forgot to put in a newline before pasting.

Problem Approach

I can only think of two ways to solve this problem.

  • Stream it! => use streams to pipe data in chunks, process it, store it.
  • All SQL => use LOAD DATA INFILE from MySQL to directly import it and postprocess into separate tables

I went with the first approach as second would introduce some major vulnerability if done through an webserver. A bash script inside myqsl server may work just fine.

I could have not used a webserver, but for a complete app included this. avoided using multer to upload as instruction sets did not mention anything explicitly about it.

The following are the execution time for the process

  • 87 seconds - no filter, insert all in one table
  • 93 seconds - with filter
  • 94 seconds - with filter and export to csv

Project Structure

This is a fastify application. It is known to be extremely performant and scalabale, beats express anyday.

  • app -> contains all encapsulated features
  • handlers => app logic only
  • services => business logic only
  • schemas => request validation & response serialization

Installation Steps

Local

Requirements:

  • MySQL >= 8
  • Node.js >= 18
  • NPM >= 9

create .env from .env.example file and replace with proper database values. Host should be 0.0.0.0 if you are using Docker. Then install the app.

npm install

once you have proper values for DB in env, run

npm run db:migrate

this will process migration and create the necessary tables. incase of testing docker, need to already have migrations in your db as we are using a distroless image.

Docker :

this is a production image build using distroless Node by Google, just barebones Kernel hotwired with Nodejs. Make sure your MySQL instance is on the same Virtual Network of your application instance. Otherwise you will not be able to run this app. I have included Kubernetes manifests file, so feel free to use that as well.

first build your image, give it a name and tag. Do Not Forget to keep the csv you want to process within data folder of app. Otherwise, It's not going to work.

docker build . -t millioncsv:v1

then use the built image to create a compute instance, it will keep running in the background

docker run -d -p 3000:3000 --name appinstance millioncsv:v1

Execution Steps

Please ensure you have the csv you want to process in the data folder on the root of your application. Once the installation is done, run npm run dev and your webserver will start. If you are using docker, just ensure you have your csv in data folder.

There is only one endpoint that takes the filename to look for in data folder and processes it. Here is the API Doc for it:

filename = file to look for in data folder

HTTP/1.1

HOST: localhost:3000

POST /v1/customer/parse?filename=<filename> 

i.e : localhost:3000/v1/customer/parse?filename=<filename>

request returns execution time in ms - miliseconds.

This will trigger the process. given csv will be parsed, processed, inserted to sql and simultaneously filtered data written to data/output

Note

if you get errors related to columns breaking, use a proper csv file. Or use VSCode and regex to filter it. the csv values are in data/output folder

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.