Giter Club home page Giter Club logo

algo_data.table's Introduction

algo_data.table

A place to work on documenting recent algorithmic improvements to data.table.

Motivated by a tweet by Marianna Foos referring to a closed Stack Overflow question. Matt Dowle then issued an RFH (request-for-help).

Tips on where to start: https://github.com/Rdatatable/data.table/wiki/Presentations

Brainstorming / no bad ideas / no ranking yet

truelength-clobber

  • find the commit in base R which initialized truelength to zero. That was instigated by Matt asking Simon Urbanek off list to make that change. Once that groundwork was in place, data.table could then start to rely on it given a dependency on that R release. (R's own misuse of truelength assumed to be positive sign only.)
  • counting in truelength with sign bit on
  • clobber now promoted to R but not on by default yet (another proposal needed to be accepted first).
  • proposals for improvement to reduce complexity/risk
  • benchmark is easy and compelling > 10x (which is why it made it to base R)

parallel ordering, finding the groups, high cardinality focus

factor vs character from parallelism's perspective

overcoming OpenMP's ordered clause non-parallel after ordered section in the GNU implementation

overcoming no #pragma omp cancel in older versions of OpenMP (to support them)

Literature review

The short-circuits from Terdiman and Herf, and how data.table applied them.

Benchmarks

  • 1st and 2nd run times on big data (0.5GB, 5GB and 50GB), like https://h2oai.github.io/db-benchmark/.
  • Call-overhead for iterating many small queries could be an aspect to write about too.

... keep adding

algo_data.table's People

Contributors

asantucci avatar mattdowle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

algo_data.table's Issues

Change repo name please

How about algo_data.table instead of parallel_data.table ?
e.g. the truelength-clobber isn't related to parallelism.

Structure of the document

I think it's too early to start the fork / PR workflow, so i'll take down and share a couple of notes and suggestion here. Hopefully it helps the discussion.

If looked at the original SO question and the discussion why it was closed in particular. Also based on what I could find about the original OP (Marianna) and what I could find about her background (calls herself a carpenter, not an eginneer), I guess this whole thing here is a chance to make data.table even more inclusive and accessible to a broader group of people.

So...

  • how about a bookdown project /w working title: Why is data.table so !#$@!! fast?
    (that doesn't mean the hard core stuff can go to an article)
  • which are the things data.table is fast at?
    I remember a discussion on twitter about how data.table got overlooked in articles on fast i/o despite the fact that fwrite / fread is the best out there. On the other hand it's obviously good at those typical aggregation / group by things. So maybe one approach to to structure this would be to break the algos down by their usage. Referencing across section is always
    possible of course....

groupby optimization

explained once by @mattdowle, so providing here

Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?

gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.