asantucci / algo_data.table Goto Github PK

A place to work on documenting recent algorithmic improvements to data.table.

R 100.00%

algo_data.table's Introduction

algo_data.table

A place to work on documenting recent algorithmic improvements to data.table.

Motivated by a tweet by Marianna Foos referring to a closed Stack Overflow question. Matt Dowle then issued an RFH (request-for-help).

Tips on where to start: https://github.com/Rdatatable/data.table/wiki/Presentations

Brainstorming / no bad ideas / no ranking yet

truelength-clobber

find the commit in base R which initialized truelength to zero. That was instigated by Matt asking Simon Urbanek off list to make that change. Once that groundwork was in place, data.table could then start to rely on it given a dependency on that R release. (R's own misuse of truelength assumed to be positive sign only.)
counting in truelength with sign bit on
clobber now promoted to R but not on by default yet (another proposal needed to be accepted first).
proposals for improvement to reduce complexity/risk
benchmark is easy and compelling > 10x (which is why it made it to base R)

parallel ordering, finding the groups, high cardinality focus

factor vs character from parallelism's perspective

overcoming OpenMP's ordered clause non-parallel after ordered section in the GNU implementation

overcoming no #pragma omp cancel in older versions of OpenMP (to support them)

Literature review

The short-circuits from Terdiman and Herf, and how data.table applied them.

Benchmarks

1st and 2nd run times on big data (0.5GB, 5GB and 50GB), like https://h2oai.github.io/db-benchmark/.
Call-overhead for iterating many small queries could be an aspect to write about too.

... keep adding

algo_data.table's People

Contributors

Stargazers

Watchers

algo_data.table's Issues

Change repo name please

How about algo_data.table instead of parallel_data.table ?
e.g. the truelength-clobber isn't related to parallelism.

Structure of the document

I think it's too early to start the fork / PR workflow, so i'll take down and share a couple of notes and suggestion here. Hopefully it helps the discussion.

If looked at the original SO question and the discussion why it was closed in particular. Also based on what I could find about the original OP (Marianna) and what I could find about her background (calls herself a carpenter, not an eginneer), I guess this whole thing here is a chance to make data.table even more inclusive and accessible to a broader group of people.

So...

how about a bookdown project /w working title: Why is data.table so !#$@!! fast?
(that doesn't mean the hard core stuff can go to an article)
which are the things data.table is fast at?
I remember a discussion on twitter about how data.table got overlooked in articles on fast i/o despite the fact that fwrite / fread is the best out there. On the other hand it's obviously good at those typical aggregation / group by things. So maybe one approach to to structure this would be to break the algos down by their usage. Referencing across section is always
possible of course....

groupby optimization

explained once by @mattdowle, so providing here

Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?

gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.

Understand Base-R Sort (implemented by data.table)

And benchmark it!

No small data (all benchmarks with at least one gig of data, e.g. sampling integers, test using 1/2 # cores)
dbbenchmark (time first and second run only; path dependence means that the latter runs get benefit of having already loaded data in cache)
https://github.com/wch/r-source/blob/trunk/src/main/radixsort.c
See sort and sort.list.
Use rprof for profiling.

Link to .tex file consistent with JSM formatting expectations

Will use this for now:
amstat tex template