Giter Club home page Giter Club logo

duckdb-vs-polars's Introduction

DuckDB vs Polars

Unofficial Benchmarking on Performance Difference Between DuckDB and Polars.

Data

2021 Yellow Taxi Trip that contains 30M rows with 18 columns. It's about 3GB in size on disk.

Method

Using the following operations for the benchmark:

  • Reading a csv file
  • Simple aggregations (sum, mean, min, max)
  • Groupby aggregations
  • Window functions
  • Joins

Result

I did the benchmark on an Apple M1 MAX MacBook Pro 2021 with 64GB RAM, 1TB SSD, and 10‑Core CPU.

output

How to Run This Benchmark on Your Own

  1. Download the csv file at: 2021 Yellow Taxi Trip.
  2. Create data folder at the top level in the repo and place the csv file in the folder. The path the the file should be: data/2021_Yellow_Taxi_Trip_Data.csv. If you name it differently then you'll need to adjust the file path in the Python script(s).
  3. Make sure you're in the virtual environment.
python -m venv env
source env/bin/activate
  1. Install dependencies.
pip install -r requirements.txt

Or

pip install duckdb polars pyarrow pytest
  1. Run the benchmark.
python duckdb_vs_polars
  1. Optional: Run the following command in terminal to run unit tests.
pytest

Notes/Limitations

  • All the queries used for the benchmark are created by Yuki (repo owner). If you think they can be improved or want to add other queries for the benchmark, please feel free to make your own or make a pull request.
  • Benchmarking DuckDB queries is tricky because result collecting methods such as .arrow(), .pl(), .df(), and .fetchall() in DuckDB can make sure the full query gets executed, but it also dilutes the benchmark because then non-core systems are being mixed in.
    • .arrow() is used to materialize the query results for the benchmark. It was the fastest out of .arrow(), .pl(), .df(), and .fetchall() (in the order of speed for the benchmark queries).
    • You could argue that you could use .execute(), but it might not properly reflect the full execution time because the final pipeline won't get executed until a result collecting method is called. Refer to the discussion on DuckDB discord on this topic.
    • Polars has the .collect() method that materializes a full dataframe.

Future Plans for This Benchmark

Although, I don't have solid plans on how I want this repo to be, I plan on periodically run this benchmark as tools improve and get updates quickly. And potentially adding more queries to the benchmark down the road.

duckdb-vs-polars's People

Contributors

stuffbyyuki avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.