A place to work on documenting recent algorithmic improvements to data.table.
Motivated by a tweet by Marianna Foos referring to a closed Stack Overflow question. Matt Dowle then issued an RFH (request-for-help).
Tips on where to start: https://github.com/Rdatatable/data.table/wiki/Presentations
- find the commit in base R which initialized truelength to zero. That was instigated by Matt asking Simon Urbanek off list to make that change. Once that groundwork was in place, data.table could then start to rely on it given a dependency on that R release. (R's own misuse of truelength assumed to be positive sign only.)
- counting in truelength with sign bit on
- clobber now promoted to R but not on by default yet (another proposal needed to be accepted first).
- proposals for improvement to reduce complexity/risk
- benchmark is easy and compelling > 10x (which is why it made it to base R)
- 1st and 2nd run times on big data (0.5GB, 5GB and 50GB), like https://h2oai.github.io/db-benchmark/.
- Call-overhead for iterating many small queries could be an aspect to write about too.
... keep adding