fstpackage / fsttable Goto Github PK
View Code? Open in Web Editor NEWAn interface to fast on-disk data tables stored with the fst format
License: GNU Affero General Public License v3.0
An interface to fast on-disk data tables stored with the fst format
License: GNU Affero General Public License v3.0
In other words, if two fst tables represent the same subset of the same file, their proxy table state should be identical. (Due to internal self refs, the entire fsttable objects may not be identical.)
An example where this is currently violated is a selection, using i
, where all rows are returned in their original order. I'm going to submit a patch to address that, but I think this principle is something to keep in mind as the interface is further extended and optimized.
Hi @phillc73, you posted a question on the differences between disk.frame
and fsttable
. As you say, disk.frame
is a very nice package to work with large datasets split into chunks of data (stored as fst files). With fsttable
, instead of working with chunks, the idea is to split all operations per column, so only columns required for a certain operation are in memory. In addition, I would like to use fsttable
as a driver to develop grouping and ordering capabilities in fst
. For grouping and ordering, it is necessary to be able to read datasets from disk in arbitrary row order. If that's possible, it's not needed to rewrite data to rearrange a dataset vertically.
(obviously, these features will be beneficial to disk.frame as well)
So, while disk.frame
limits RAM usage by splitting operations in vertical chunks, fsttable
limits RAM usage by processing columns separately. For grouped operations, fsttable
can use subsets of column data to further limit RAM usage. Both packages can store the results onto disk again as new on-disk table's.
hope that answers your question, thanks!
Now, the class name fst_table is uses in the fst
package, but that should be renamed to fst_frame to better reflect it's purpose. The class name fsttable could be used for the reference object in the fsttable
package.
For fst
to be able to use parallel methods (see #2) during loading of data from a file, they need to have a C++
implementation according to a predefined interface. That interface should be defined in fst
. The question is whether it would be best to put these methods in a separate package (e.g. fstmethods
).
The advantage would be that a separate package would separate concerns and could concentrate fully on developing new parallel methods. Such a package could experiment with using SIMD operations for example (which I believe isn't done in a CRAN package yet). Parallel operations on vectors are perfect for SIMD.
Also, multiple packages could exist with specific functionality. Linear regression can be done in parallel, and the methods for that could exist in another user defined package or defined on the fly. With this setup, fsttable
would be very modular and could be easily extended (without growing huge).
Suite of tests consisting of iterated simple row selections and permutations to make sure that the proxy_table updates correctly.
Along these lines:
dt <- data.table(x = 1:10)
ft <-fsttable::fst_table(fst::write_fst(dt, tempfile))
ft[2:6][3:2][1] # expect 4
("Subissue of #4)
To keep a very low memory footprint, fsttable
could use a range of operators and methods that are parallel implementations of their counterparts. For example, p_mult
is the parallel implementation of *
. When the user specifies these parallel methods in a call to a fsttable
's interface, processing is done on multiple threads and during loading of the data.
This will increase the speed significantly, and for interactive use, only small amounts of data need to be loaded from file. For example, because fsttable
understands the method p_mult
, it knows that it only has to read the first few lines of a fst
file to display the results for:
print(ft[, .(X, Y = p_mult(A, B)])
So printing this result interactively requires almost no memory. Also, when the full column Y has to be calculated, multiple threads can perform the operation.
For custom methods, this can't be done, because it's unknown whether they calculate per element or need the whole vector, so fsttable
always needs to read all columns fully before calling the method. But aggregate methods build from parallel methods could still be used for parallel calculations.
The table object should determine when the file it points to is deleted. At a minimum, this should happen when a query arrives that requires reading from disk, so that the inevitable error is caught. This check could also happen when ever the fsttable is acted upon (without disk read), and a warning could be thrown. I'm not sure if that is necessary.
Also, the fst file may be modified by other means, unknown to an existing fsttable object. There are a variety of ways to see whether the fstttable is up to date. A checksum is overkill. We could use Date Modified as provided by the OS, but I think that could be unreliable. Another option is to build a time-stamp into the fst metadata. This time-stamp could then be read directly, avoiding relying on the OS. Interested to hear alternative approaches to this.
It would be nice to put the abstract data.table
interface in it's separate code base.
That abstract interface requires some basic operations to do it's work:
ncol()
, nrow()
, dim()
, etc.column_data()
, filter()
, slice()
, group()
, merge()
, etc.The fstproxy
object would be an implementation of all of these required methods. When the interface can be made abstract enough, we could separate it and put it in it's own package (e.g. data.table.interface
) to be able to use it for other backends (SQL
databases for example).
A table proxy has a specific state. That state reflects the current table that it represents. But no actual data is read until necessary, so the state is a collection of row- and column transformations that provide a view on the underlying physical fst
file.
For example: when the user does a row selection:
ft2 <- ft[Year > 2000]
a new fsttable
object is created (ft2). That object contains a filter that reflects the result of Year > 2000
in binary format, no other data from the table is stored. The filter is computed in chunks so it has no significant memory requirements except that of creating the filter (number of bytes equal to the length of the vector divided by 8). The filter could be stored on disk.
When the user does:
ft3 <- ft2[, .(ColA, ColB = ColD * 100)]
again a new fsttable
is created (ft3). That object holds the original filter and a column selection (ColA). It also holds a virtual table selection (ColB), No actual data is computed at this point. Because fsttable
knows that *
operates per-element, it is enough to hold the expression, the primitive value (100) and the column reference (ColD). With that information, the column can be computed at a later time.
: the selected columns in the fsttable
virtual columns: for constructed columns that have single element operators these are expressions that define the new column (no actual data stored).
To assure correct data.table
like behavior, the interface needs to be tested thoroughly. A good test suite would make the development of new features in fsttable
a lot easier. Perhaps we could look at the data.table
package and use a small subset of tests.
But no actual data is calculated until required for displaying it on the screen or loading it into memory.
The general interface dt[i, j, by]
should be implemented so it can be used to operate directly on fst
files:
i: subsetting without the use of aggregate functions could be done from fstlib
(using parallel C++ code).
j: the expression in j should be parsed. For aggregate (or unknown custom) functions, whole columns need to be read. For known operators, a subset can be read, just a sample is enough to print a result (in the same format data.table
does with just the top and bottom lines).
by: for grouping we can create a map of group id's that can be stored in the fsttable
object. Any subsequent operation can be performed using the group map as a filter.
Delaying calculations can get complicated quite quickly. The best way to go for fsttable
is probably to first implement the data.table
interface using complete files. And then to delay calculations step by step to make a fsttable
more interactive and much faster.
The fst
package needs to create new methods for working with filters, groups and calculations. That could be done as private functions as a start. These methods should be used by the fsttable
package as well as the fstplyr
package.
Continuous integration
The fstproxy
object contains the following information:
unsigned int
or unsigned long long
vector (depends on the size of the result) with the positions in the result vector of each selected element in the mask. This vector can be large: a 1 billion row selection will have a slice map of 4 GB (unsigned int). A 5 billion row selection will have slice map of 40 GB (unsigned long long). So these maps are potentially too large for the user's memory.fsttable
A slice map that is too large for memory can be stored on-disk in a single column fst
file with the ordering. When needed, the file can be read per chunk, not requiring extra memory.
If we decide to implement this, the index map file could be saved in a temporary directory or perhaps alongside the original fst
file (that might not be possible due to access restrictions).
Currently the datatableinterface
inherits from data.table
, and stores its table_proxy
object in a data.table
cell. This allows the table_proxy
object to be updated in-place. I believe this functionality should be moved into the table_proxy
object because (1) it is not specific to the data.table
interface and (2) any function intended update the state of a table_proxy
needs to take the entire datatableinterface
object as an argument, which seems a bit unwieldy.
The new arrangement would allow:
table_proxy_update_state(tbl_proxy, new_state) {
tbl_proxy[, remote_table_state := new_state]
}
Or, assuming that the remote_table_state
field it is the only part of the table_proxy
that ever needs to be updated, that field itself could inherit from data.table, which would allow functions along these lines:
table_proxy_select_cols(tbl_proxy, selection) {
tbl_proxy$remote_table_state[, colnames := colnames[selection]]
tbl_proxy$remote_table_state[, ncol := length(selection)]
tbl_proxy$remote_table_state[, coltypes := coltypes[selection]]
}
If this makes sense, I can have a go at implementing it.
I would like to rbind two fsttable objects or a single fsttable with data.frame. What would be the preferred method?
library(fsttable)
library(data.table)
ft1 <- fst_table('1.fst')
rbindlist(list(ft1, ft1[1:10]))
.table_proxy X Y
1: <tableproxy[2]> 0 0
2: <tableproxy[2]> 0 0
For creating a new column/updating, I tried
ft1[1:4, .(X)] *4
X
1: 4
2: 8
3: 12
4: 16
If I update based on data.table methods, it is resulting in error
new <- (ft1[1:4, .(X)] * 4)[[1]]
ft1[1:4, new := new]
Error in parse_j(j, tbl_proxy$remotetablestate$colnames, parent.frame()) :
j must be a list
Is there a preferred method for modifying/updating columns? I did read some previous issues here and here. I just wonder if there are any updates for that.
Thanks
PS: My objective is to update an already loaded fsttable object without converting to data.frame/data.table, add new rows and write it back as .fst file (after doing some join operations)
We will need a method csv_to_fst
in the fst
package (planned). By using that method under the hood, we can define a fsttable
object from a csv file:
ft <- fst_table_from_csv("somebigfile.csv", columns = c("A", "B", "C"))
print(ft[, p_sum(A)])
This would calculate the sum of column A in a csv file that is possibly too large to read with currently existing methods. With this code snippet, the csv file would be streamed (in blocks) to a fst file first, and then a fsttable
reference is returned.
(and a new file somebigfile.fst would also be created simultaneously)
But it will not be evaluated like in data.table
.
That's because the datatableinterface object doesn't really contain any data, but just references to on-disk data. So we can't actually evaluate the j expression within the list
it is contained. Some example code:
parser <- function(jsub, parent_frame, table_columns) {
if (is.call(jsub)) {
print(paste0("method '", as.character(jsub[[1]]), "' is called:"))
result <- list(rep(0, length(jsub)))
result[[1]] <- as.character(jsub[[1]])
# some arguments
if (length(jsub) > 1) {
for (pos in 2:length(jsub)) {
if (is.call(jsub[[pos]])) {
# call registration here
result[[pos]] <- parser(jsub[[pos]], parent_frame, table_columns)
} else
{
is_symbol <- typeof(jsub[[pos]]) == "symbol"
name <- as.character(jsub[[pos]])
print(paste0("argument '", jsub[[pos]], "', exists in parent frame: ",
exists(name, where = parent_frame), " is symbol: ", is_symbol,
if(is_symbol) paste(" is column:", name %in% table_columns) else ""))
result[[pos]] <- as.character(jsub[[pos]])
}
}
}
print(paste0("end method '", as.character(jsub[[1]])))
}
result
}
parse <- function(j, table_columns) {
parser(substitute(j), parent.frame(), table_columns)
}
# call the parser with (simulated) known columns Z and E and a j-expression
parse(.(A = 5, B = 3 * C(7), C = f(r * E), D = g(2 * Q)), c("Z", "E"))
#> [1] "method '.' is called:"
#> [1] "argument '5', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method '*' is called:"
#> [1] "argument '3', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method 'C' is called:"
#> [1] "argument '7', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "end method 'C"
#> [1] "end method '*"
#> [1] "method 'f' is called:"
#> [1] "method '*' is called:"
#> [1] "argument 'r', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "argument 'E', exists in parent frame: FALSE is symbol: TRUE is column: TRUE"
#> [1] "end method '*"
#> [1] "end method 'f"
#> [1] "method 'g' is called:"
#> [1] "method '*' is called:"
#> [1] "argument '2', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "argument 'Q', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "end method '*"
#> [1] "end method 'g"
#> [1] "end method '."
So with a similar method, we can determine all the calls that are made in the j argument and determine if we need to load a column.
To be able to process code like:
# identify fsttable with fst file
ft <- fsttable::fst_table("1.fst")
# add simulated column 'N'
ft[, N := E * 5 + B]
In this example, column N is added as a simulated (or perhaps better: virtual?) column to the table. That means that no data is generated yet, but the new column is kept as a tree structure of known methods (*
and +
) and data (5
).
To store that information, ft
needs to be updated. To do that, a fsttable
needs to update itself which requires an internal self reference (like a data.table
object). Perhaps the relevant data in a fsttable
can be encapsulated in a single cell list-type data.table
to start with (that list
element can be updated in-memory). Equivalent code:
# some object
obj <- list(Param1 = TRUE, Param2 = 1)
# store in a single cell data.table
x <- data.table::data.table(Data = list(obj))
# that creates a data.table column of type 'list'
typeof(x$Data)
#> [1] "list"
# example method for updating list element
update_obj <- function(x) {
obj_current <- x[1, Data[[1]]] # get obj
obj_current[["Param3"]] <- "value" # update obj
x[1, Data := list(list(obj_current))] # rewrite to x
}
# update in-place
update_obj(x)
# the element in x now points to the updated obj
x$Data[[1]]
#> Param1 Param2 Param3
#> 1: TRUE 1 value
In addition to the standard integer
(or long long
) vector of indices. A range can be stored using very little data while a vector of indices is very expensive.
As shortly discussed here, an rquery
interface would be a nice third candidate as a wrapper around the table_proxy
class.
In RStudio, list
and data.frame
objects have code completion, when you do:
you get a nice popup showing the list element names.
However, when you select a name from the menu, the following happens:
so we get brackets around the column name, as you would expect for a list
or data.frame
.
If we want to get auto-completion on a fsttable
object the data.table
-way, that object should have a data.table
class and a fsttable
class name:
# create a print generic
print.myclass <- function(dt){
print("This is an override")
}
# define a data.table object
dt <- data.table::data.table(ColA = 1:100, ColBB = LETTERS)
# add a myclass class to the list of class names (at position 1)
class(dt) <- c("myclass", class(dt))
# check that custom print method is called
print(dt)
#> [1] "This is an override"
the auto-completion will now work as intended.
It would be nice to use the data.table
class name in the fsttable
object. That would require that overrides exist in the fsttable
package for every possible generic that is defined in the data.table
package. Otherwise, the data.table
generic would be called for that particular method...
remotes::install_github("fstpackage/fsttable")
*** arch - i386
Warning: S3 methods '$.fst_table', '[.fst_table', '[[.fst_table', 'as.data.frame.fst_table', 'as.list.fst_table', 'dim.fst_table', 'dimnames.fst_table', 'names.fst_table', 'print.fst_table', 'row.names.fst_table', 'str.fst_table'
were declared in NAMESPACE but not found
Error: package or namespace load failed for 'fsttable' in namespaceExport(ns, exports):
undefined exports: fst_table
Error: loading failed
Execution halted
Auto-completion of column names is simulated by adding data.table
as a class to the datatableproxy object (see this code). When the datatableproxy object is modified in-place (with the :=
operator), auto-completion will only follow when the object itself is also modified, requiring a self-reference.
library(fsttable)
nr_of_rows <- 5
x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
fst::write_fst(x, "1.fst")
ft <- fst_table("1.fst")
ft
ft[X==1] # this doesn't work
really like this package thanks!
Some operations might require temporary files. For example, a large slice map or a newly created column can be stored in an additional fst
file.
When these potentially large temp files are stored in a temp directory, that might cause a problem for the user. They might not be cleaned very often and large fst
files will quickly fill up the users disk.
Perhaps it would be nicer to organize the fsttable
's in a fststore
(better name required :-)). The temp files can be kept organized and cleaned more easily
For example, we can have a fst_remote
package that implements the fst format as a remote table. That structure could be easily modified to have the fst_remote
package running on a different computer than the table proxy (which would be running on the client side).
In effect, that would create a remote database that's serving data from fst files (accessible with a data.table
interface) . Therefore, it might be good design to keep the in-memory structures in the table proxy class as small as possible. For example, when a row mask is calculated from an i expression, that structure could be saved to disk instead of returned to memory. Subsequent operations that need the mask would need to read it from file before it can be used. This avoids sending data back- and forth between the table proxy and the remote table implementation until absolutely necessary (and that helps for speed when working over a network connection for example). Also, it would be in line with the fsttable
philosophy to keep the memory requirements as small as possible.
That filter could be very small if implemented as a binary vector (C++) with a mask that is set to each selected row. A 1 billion records table will have a 125 MB mask (8 bits per byte and each row is 1 bit, excluding NA
's that is)
These two interfaces to the underlying fst
file should use the same mechanism for keeping the minimum amount of data in memory. Therefore, the fst
package should define methods to accommodate both, probably first as non-visible package methods.
After some tweaking, the design is now as follows:
fst
backend).To connect everything together, method fsttable
(define here) calls the _table_proxy_()
constructor with a specific implementation of a remote_table. In this case that is our remote_table_fst
implementation, defined here.
With this design, the data.table
interface is completely separated from the remote_table implementation. Other remote_table's (other backends) could easily be added when they implement the generic functions. The data_table_interface class and generic functions can be refactored into a separate package. That package could be used by other backend packages to define a data.table
interface for that specific backend, which they could do by implementing a custom version of the remote_table generics.
(only relevant when we can do more than just printing the whole table :-))
By providing a helper C
method, we can test for a range without creating intermediate vectors.
A first (non-optimized) try:
Rcpp::cppFunction('
SEXP is_range(SEXP int_vec) {
int vec_length = LENGTH(int_vec);
int* int_values = INTEGER(int_vec);
int counter = int_values[0];
for (int i = 1; i < vec_length; i++) {
counter++;
if (int_values[i] != counter) return Rf_ScalarLogical(FALSE);
}
return Rf_ScalarLogical(TRUE);
}
')
# a range
is_range(1:1e6)
#> [1] TRUE
# not a range
is_range(c(1:1e6, 2L))
#> [1] FALSE
# performance
int_vec <- 1:1e7
timings <- microbenchmark::microbenchmark(
is_range(int_vec)
)
# speed in billions of integers per second
length(int_vec) / median(timings$time)
#> [1] 1.746079
That's fast enough for practical purposes I think. Perhaps the method above can be improved by chunking the vector and determining the sum of squares of the vector minus the expected vector.
(that would minimize the number of if
statements and may improve CPU throughput)
A slice map would be implemented as an unsigned int
or unsigned long long
vector (depends on the size of the table). The length would be equal to the number of selected elements in the mask.
This vector can be large: a 1 billion row selection will have a slice map of 4 GB (unsigned int
). A 5 billion row selection will have slice map of 40 GB (unsigned long long
). So these maps are potentially too large for the user's memory and might require storing on-disk.
Ordering is expensive. But for printing purposes, where we need only 10 rows, at most 10 complete (16 kB) blocks have to be read. That's no problem because that should be sub-millisecond reads for each column (so no need to read the whole column for table's with a stored slice map)
Seems like a great package for handling large datasets.
If a back-end can assume the row filter is ordered, it doesn't need to order the vector before starting a sequential read.
Working with parallel methods is not straightforward. Perhaps it's best to start with documentation early on and enforce updating it with every new parallel method added.
The fsttable
package has two main tasks that are best separated into two distinct components:
data.table
interface. This requires non-standard evaluation where we compute on the language R
itself. The code for this will be specific to the data.table
interface.fst
file. This component should not use any language features or methods specific to the data.table
package. And it should be made abstract enough so that it can be re-used for the fstplyr
package (or any other future interface, e.g. SQL
).Perhaps it's best to separate these components from the start. The second component will also make clear what the best API will be for the fst
package to expose. In the future this component could go into a separate package that can act as a dependency for other interfaces (package fstref
or fstproxy
).
R 3.5.0
brings (an experimental implementation of) the ALTREP framework.
It would be very useful to detect and use compact expressions like 1:N
. For example, when a range is specified in i, detecting a ALTREP vector means that we don't have to run a loop to detect ranges anymore (#35).
(see the ALTREP draft vignette for details)
Also, ALTREP provides for a mechanism for delayed loading of the actual vector data until it's required. Perhaps that can be used to return vectors in compact format:
# returns a compact object without actual data
x <- ft[10:20, X]
# regular methods can use the object
max(x)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.