fstpackage / fsttable Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 4.0 89 KB

An interface to fast on-disk data tables stored with the fst format

License: GNU Affero General Public License v3.0

R 100.00%

fsttable's People

Contributors

Stargazers

Watchers

Forkers

martinblostein marcusklik yuri-m-dias akrun1

fsttable's Issues

Fsttable state should be memoryless

In other words, if two fst tables represent the same subset of the same file, their proxy table state should be identical. (Due to internal self refs, the entire fsttable objects may not be identical.)

An example where this is currently violated is a selection, using i, where all rows are returned in their original order. I'm going to submit a patch to address that, but I think this principle is something to keep in mind as the interface is further extended and optimized.

Differences in goals between disk.frame and fsttable

Hi @phillc73, you posted a question on the differences between disk.frame and fsttable. As you say, disk.frame is a very nice package to work with large datasets split into chunks of data (stored as fst files). With fsttable, instead of working with chunks, the idea is to split all operations per column, so only columns required for a certain operation are in memory. In addition, I would like to use fsttable as a driver to develop grouping and ordering capabilities in fst. For grouping and ordering, it is necessary to be able to read datasets from disk in arbitrary row order. If that's possible, it's not needed to rewrite data to rearrange a dataset vertically.

(obviously, these features will be beneficial to disk.frame as well)

So, while disk.frame limits RAM usage by splitting operations in vertical chunks, fsttable limits RAM usage by processing columns separately. For grouped operations, fsttable can use subsets of column data to further limit RAM usage. Both packages can store the results onto disk again as new on-disk table's.

hope that answers your question, thanks!

Reserve 'fsttable' S3 class name for `fsttable` package

Now, the class name fst_table is uses in the fst package, but that should be renamed to fst_frame to better reflect it's purpose. The class name fsttable could be used for the reference object in the fsttable package.

Parallel methods in separate package

For fst to be able to use parallel methods (see #2) during loading of data from a file, they need to have a C++ implementation according to a predefined interface. That interface should be defined in fst. The question is whether it would be best to put these methods in a separate package (e.g. fstmethods).

The advantage would be that a separate package would separate concerns and could concentrate fully on developing new parallel methods. Such a package could experiment with using SIMD operations for example (which I believe isn't done in a CRAN package yet). Parallel operations on vectors are perfect for SIMD.

Also, multiple packages could exist with specific functionality. Linear regression can be done in parallel, and the methods for that could exist in another user defined package or defined on the fly. With this setup, fsttable would be very modular and could be easily extended (without growing huge).

Row selection unit tests

Suite of tests consisting of iterated simple row selections and permutations to make sure that the proxy_table updates correctly.

Along these lines:

dt <- data.table(x = 1:10)
ft <-fsttable::fst_table(fst::write_fst(dt, tempfile))
ft[2:6][3:2][1] # expect 4

("Subissue of #4)

Parallel methods and operators

To keep a very low memory footprint, fsttable could use a range of operators and methods that are parallel implementations of their counterparts. For example, p_mult is the parallel implementation of *. When the user specifies these parallel methods in a call to a fsttable's interface, processing is done on multiple threads and during loading of the data.

This will increase the speed significantly, and for interactive use, only small amounts of data need to be loaded from file. For example, because fsttable understands the method p_mult, it knows that it only has to read the first few lines of a fst file to display the results for:

print(ft[, .(X, Y = p_mult(A, B)])

So printing this result interactively requires almost no memory. Also, when the full column Y has to be calculated, multiple threads can perform the operation.

For custom methods, this can't be done, because it's unknown whether they calculate per element or need the whole vector, so fsttable always needs to read all columns fully before calling the method. But aggregate methods build from parallel methods could still be used for parallel calculations.

fsttable should gracefully handle the deletion/modification of the file it refers to

The table object should determine when the file it points to is deleted. At a minimum, this should happen when a query arrives that requires reading from disk, so that the inevitable error is caught. This check could also happen when ever the fsttable is acted upon (without disk read), and a warning could be thrown. I'm not sure if that is necessary.

Also, the fst file may be modified by other means, unknown to an existing fsttable object. There are a variety of ways to see whether the fstttable is up to date. A checksum is overkill. We could use Date Modified as provided by the OS, but I think that could be unreliable. Another option is to build a time-stamp into the fst metadata. This time-stamp could then be read directly, avoiding relying on the OS. Interested to hear alternative approaches to this.

Separate data.table interface from the implementation

It would be nice to put the abstract data.table interface in it's separate code base.
That abstract interface requires some basic operations to do it's work:

ncol(), nrow(), dim(), etc.
column_data(), filter(), slice(), group(), merge(), etc.
...many more...

The fstproxy object would be an implementation of all of these required methods. When the interface can be made abstract enough, we could separate it and put it in it's own package (e.g. data.table.interface) to be able to use it for other backends (SQL databases for example).

References to real- and virtual columns are stored in the table_proxy

A table proxy has a specific state. That state reflects the current table that it represents. But no actual data is read until necessary, so the state is a collection of row- and column transformations that provide a view on the underlying physical fst file.

For example: when the user does a row selection:

ft2 <- ft[Year > 2000]

a new fsttable object is created (ft2). That object contains a filter that reflects the result of Year > 2000 in binary format, no other data from the table is stored. The filter is computed in chunks so it has no significant memory requirements except that of creating the filter (number of bytes equal to the length of the vector divided by 8). The filter could be stored on disk.

When the user does:

ft3 <- ft2[, .(ColA, ColB = ColD * 100)]

again a new fsttable is created (ft3). That object holds the original filter and a column selection (ColA). It also holds a virtual table selection (ColB), No actual data is computed at this point. Because fsttable knows that * operates per-element, it is enough to hold the expression, the primitive value (100) and the column reference (ColD). With that information, the column can be computed at a later time.

: the selected columns in the fsttable
virtual columns: for constructed columns that have single element operators these are expressions that define the new column (no actual data stored).

Extensive unit tests

To assure correct data.table like behavior, the interface needs to be tested thoroughly. A good test suite would make the development of new features in fsttable a lot easier. Perhaps we could look at the data.table package and use a small subset of tests.

A fsttable object has similar interface as a data.table object

But no actual data is calculated until required for displaying it on the screen or loading it into memory.

The general interface dt[i, j, by] should be implemented so it can be used to operate directly on fst files:

i: subsetting without the use of aggregate functions could be done from fstlib (using parallel C++ code).
j: the expression in j should be parsed. For aggregate (or unknown custom) functions, whole columns need to be read. For known operators, a subset can be read, just a sample is enough to print a result (in the same format data.table does with just the top and bottom lines).
by: for grouping we can create a map of group id's that can be stored in the fsttable object. Any subsequent operation can be performed using the group map as a filter.

Delaying calculations can get complicated quite quickly. The best way to go for fsttable is probably to first implement the data.table interface using complete files. And then to delay calculations step by step to make a fsttable more interactive and much faster.

The fst package needs to create new methods for working with filters, groups and calculations. That could be done as private functions as a start. These methods should be used by the fsttable package as well as the fstplyr package.

Add Travis and AppVeyor infrastructure

Continuous integration

Offline ordering file

The fstproxy object contains the following information:

filter: a binary vector (C++) with a mask that is set to each selected row. This mask is relatively small: a 1 billion records table will have a 125 MB mask (8 bits per byte and each row is 1 bit)
slice map: an unsigned int or unsigned long long vector (depends on the size of the result) with the positions in the result vector of each selected element in the mask. This vector can be large: a 1 billion row selection will have a slice map of 4 GB (unsigned int). A 5 billion row selection will have slice map of 40 GB (unsigned long long). So these maps are potentially too large for the user's memory.
column selection: the selected columns in the fsttable
virtual columns: for constructed columns that have single element operators these are expressions that define the new column (no actual data stored).

A slice map that is too large for memory can be stored on-disk in a single column fst file with the ordering. When needed, the file can be read per chunk, not requiring extra memory.

If we decide to implement this, the index map file could be saved in a temporary directory or perhaps alongside the original fst file (that might not be possible due to access restrictions).

Move state-update ability from `datatableinterface` into `table_proxy`

Currently the datatableinterface inherits from data.table, and stores its table_proxy object in a data.table cell. This allows the table_proxy object to be updated in-place. I believe this functionality should be moved into the table_proxy object because (1) it is not specific to the data.table interface and (2) any function intended update the state of a table_proxy needs to take the entire datatableinterface object as an argument, which seems a bit unwieldy.

The new arrangement would allow:

table_proxy_update_state(tbl_proxy, new_state) {
   tbl_proxy[, remote_table_state := new_state]
}

Or, assuming that the remote_table_state field it is the only part of the table_proxy that ever needs to be updated, that field itself could inherit from data.table, which would allow functions along these lines:

table_proxy_select_cols(tbl_proxy, selection) {
  tbl_proxy$remote_table_state[, colnames := colnames[selection]]
  tbl_proxy$remote_table_state[, ncol := length(selection)]
  tbl_proxy$remote_table_state[, coltypes := coltypes[selection]]
}

If this makes sense, I can have a go at implementing it.

rbind/cbind two fsttable object or a fsttable with a data.frame or data.table

I would like to rbind two fsttable objects or a single fsttable with data.frame. What would be the preferred method?

library(fsttable)
library(data.table)
ft1 <- fst_table('1.fst')
rbindlist(list(ft1, ft1[1:10]))
.table_proxy X Y
1: <tableproxy[2]> 0 0
2: <tableproxy[2]> 0 0

For creating a new column/updating, I tried

 ft1[1:4, .(X)] *4
    X
1:  4
2:  8
3: 12
4: 16

If I update based on data.table methods, it is resulting in error

new <- (ft1[1:4, .(X)] * 4)[[1]]
ft1[1:4, new := new]
Error in parse_j(j, tbl_proxy$remotetablestate$colnames, parent.frame()) : 
  j must be a list

Is there a preferred method for modifying/updating columns? I did read some previous issues here and here. I just wonder if there are any updates for that.
Thanks

PS: My objective is to update an already loaded fsttable object without converting to data.frame/data.table, add new rows and write it back as .fst file (after doing some join operations)

Stream csv file directly to a fsttable object

We will need a method csv_to_fst in the fst package (planned). By using that method under the hood, we can define a fsttable object from a csv file:

ft <- fst_table_from_csv("somebigfile.csv", columns = c("A", "B", "C"))
print(ft[, p_sum(A)])

This would calculate the sum of column A in a csv file that is possibly too large to read with currently existing methods. With this code snippet, the csv file would be streamed (in blocks) to a fst file first, and then a fsttable reference is returned.

(and a new file somebigfile.fst would also be created simultaneously)

Parsing the j argument requires 'computing on the language'

But it will not be evaluated like in data.table.

That's because the datatableinterface object doesn't really contain any data, but just references to on-disk data. So we can't actually evaluate the j expression within the list it is contained. Some example code:

parser <- function(jsub, parent_frame, table_columns) {

  if (is.call(jsub)) {
    print(paste0("method '", as.character(jsub[[1]]), "' is called:"))

    result <- list(rep(0, length(jsub)))
    result[[1]] <- as.character(jsub[[1]])

    # some arguments
    if (length(jsub) > 1) {
      for (pos in 2:length(jsub)) {

        if (is.call(jsub[[pos]])) {

          # call registration here

          result[[pos]] <- parser(jsub[[pos]], parent_frame, table_columns)
        } else
        {
          is_symbol <- typeof(jsub[[pos]]) == "symbol"
          name <- as.character(jsub[[pos]])

          print(paste0("argument '", jsub[[pos]], "', exists in parent frame: ",
            exists(name, where = parent_frame), " is symbol: ", is_symbol,
            if(is_symbol) paste(" is column:", name %in% table_columns) else ""))
          result[[pos]] <- as.character(jsub[[pos]])
        }
      }
    }

    print(paste0("end method '", as.character(jsub[[1]])))
  }

  result
}


parse <- function(j, table_columns) {
  parser(substitute(j), parent.frame(), table_columns)
}

# call the parser with (simulated) known columns Z and E and a j-expression
parse(.(A = 5, B = 3 * C(7), C = f(r * E), D = g(2 * Q)), c("Z", "E"))
#> [1] "method '.' is called:"
#> [1] "argument '5', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method '*' is called:"
#> [1] "argument '3', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method 'C' is called:"
#> [1] "argument '7', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "end method 'C"
#> [1] "end method '*"
#> [1] "method 'f' is called:"
#> [1] "method '*' is called:"
#> [1] "argument 'r', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "argument 'E', exists in parent frame: FALSE is symbol: TRUE is column: TRUE"
#> [1] "end method '*"
#> [1] "end method 'f"
#> [1] "method 'g' is called:"
#> [1] "method '*' is called:"
#> [1] "argument '2', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "argument 'Q', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "end method '*"
#> [1] "end method 'g"
#> [1] "end method '."

So with a similar method, we can determine all the calls that are made in the j argument and determine if we need to load a column.

The fsttable object needs to contain a self reference

To be able to process code like:

# identify fsttable with fst file
ft <- fsttable::fst_table("1.fst")

# add simulated column 'N'
ft[, N := E * 5 + B]

In this example, column N is added as a simulated (or perhaps better: virtual?) column to the table. That means that no data is generated yet, but the new column is kept as a tree structure of known methods (* and +) and data (5).

To store that information, ft needs to be updated. To do that, a fsttable needs to update itself which requires an internal self reference (like a data.table object). Perhaps the relevant data in a fsttable can be encapsulated in a single cell list-type data.table to start with (that list element can be updated in-memory). Equivalent code:

# some object
obj <- list(Param1 = TRUE, Param2 = 1)

# store in a single cell data.table
x <- data.table::data.table(Data = list(obj))

# that creates a data.table column of type 'list'
typeof(x$Data)
#> [1] "list"

# example method for updating list element
update_obj <- function(x) {
  obj_current <- x[1, Data[[1]]]  # get obj
  obj_current[["Param3"]] <-  "value"  # update obj
  x[1, Data := list(list(obj_current))]  # rewrite to x
}

# update in-place
update_obj(x)

# the element in x now points to the updated obj
x$Data[[1]]
#>    Param1 Param2 Param3
#> 1:   TRUE      1  value

A slice_map should be able to contain a range

In addition to the standard integer (or long long) vector of indices. A range can be stored using very little data while a vector of indices is very expensive.

Table proxy object should be able to accommodate an `rquery` interface

As shortly discussed here, an rquery interface would be a nice third candidate as a wrapper around the table_proxy class.

Code completion with fsttable interface in RStudio

In RStudio, list and data.frame objects have code completion, when you do:

you get a nice popup showing the list element names.
However, when you select a name from the menu, the following happens:

so we get brackets around the column name, as you would expect for a list or data.frame.

If we want to get auto-completion on a fsttable object the data.table-way, that object should have a data.table class and a fsttable class name:

# create a print generic
print.myclass <- function(dt){
  print("This is an override")
}

# define a data.table object
dt <- data.table::data.table(ColA = 1:100, ColBB = LETTERS)

# add a myclass class to the list of class names (at position 1)
class(dt) <- c("myclass", class(dt))

# check that custom print method is called
print(dt)
#> [1] "This is an override"

the auto-completion will now work as intended.

It would be nice to use the data.table class name in the fsttable object. That would require that overrides exist in the fsttable package for every possible generic that is defined in the data.table package. Otherwise, the data.table generic would be called for that particular method...

error on installation

remotes::install_github("fstpackage/fsttable")

*** arch - i386
Warning: S3 methods '$.fst_table', '[.fst_table', '[[.fst_table', 'as.data.frame.fst_table', 'as.list.fst_table', 'dim.fst_table', 'dimnames.fst_table', 'names.fst_table', 'print.fst_table', 'row.names.fst_table', 'str.fst_table'
 were declared in NAMESPACE but not found
Error: package or namespace load failed for 'fsttable' in namespaceExport(ns, exports):
 undefined exports: fst_table
Error: loading failed
Execution halted

Update to auto-completion after in place modification with := requires self-reference

Auto-completion of column names is simulated by adding data.table as a class to the datatableproxy object (see this code). When the datatableproxy object is modified in-place (with the := operator), auto-completion will only follow when the object itself is also modified, requiring a self-reference.

Feature request: conditional select doesn't work

 library(fsttable)
  nr_of_rows <- 5
  x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
  fst::write_fst(x, "1.fst")
  ft <- fst_table("1.fst")
  ft
  ft[X==1] # this doesn't work

really like this package thanks!

fsttable objects are grouped into a fststore

Some operations might require temporary files. For example, a large slice map or a newly created column can be stored in an additional fst file.

When these potentially large temp files are stored in a temp directory, that might cause a problem for the user. They might not be cleaned very often and large fst files will quickly fill up the users disk.

Perhaps it would be nicer to organize the fsttable's in a fststore (better name required :-)). The temp files can be kept organized and cleaned more easily

A remote table implementation and the physical data could be separated by a low-speed medium

For example, we can have a fst_remote package that implements the fst format as a remote table. That structure could be easily modified to have the fst_remote package running on a different computer than the table proxy (which would be running on the client side).

In effect, that would create a remote database that's serving data from fst files (accessible with a data.table interface) . Therefore, it might be good design to keep the in-memory structures in the table proxy class as small as possible. For example, when a row mask is calculated from an i expression, that structure could be saved to disk instead of returned to memory. Subsequent operations that need the mask would need to read it from file before it can be used. This avoids sending data back- and forth between the table proxy and the remote table implementation until absolutely necessary (and that helps for speed when working over a network connection for example). Also, it would be in line with the fsttable philosophy to keep the memory requirements as small as possible.

Row selection information is kept as a filter in the table_proxy object

That filter could be very small if implemented as a binary vector (C++) with a mask that is set to each selected row. A 1 billion records table will have a 125 MB mask (8 bits per byte and each row is 1 bit, excluding NA's that is)

Common base for `fsttable` and `fstplyr`

These two interfaces to the underlying fst file should use the same mechanism for keeping the minimum amount of data in memory. Therefore, the fst package should define methods to accommodate both, probably first as non-visible package methods.

Three layer interface

After some tweaking, the design is now as follows:

The data_table_interface class (defined here) acts as a wrapper around the controller object of class table_proxy (defined here).
The table_proxy contains code for smart cashing and calculating on (subsets of) the data.
The constructor for the table_proxy object needs a remote_table object as an argument. So it wraps an abstract remote_table object.
That remote_table object should implement the generic functions defined here. Together, these generic functions are the only connection to everything that has to be done on the backend (in our case a fst backend).

To connect everything together, method fsttable (define here) calls the _table_proxy_() constructor with a specific implementation of a remote_table. In this case that is our remote_table_fst implementation, defined here.

With this design, the data.table interface is completely separated from the remote_table implementation. Other remote_table's (other backends) could easily be added when they implement the generic functions. The data_table_interface class and generic functions can be refactored into a separate package. That package could be used by other backend packages to define a data.table interface for that specific backend, which they could do by implementing a custom version of the remote_table generics.

(only relevant when we can do more than just printing the whole table :-))

Method (C) for determining if a row-selection corresponds to a range

By providing a helper C method, we can test for a range without creating intermediate vectors.

A first (non-optimized) try:

Rcpp::cppFunction('

  SEXP is_range(SEXP int_vec) {
    int vec_length = LENGTH(int_vec);
    
    int* int_values = INTEGER(int_vec);
    int counter = int_values[0];
    
    for (int i = 1; i < vec_length; i++) {
    counter++;
    if (int_values[i] != counter) return Rf_ScalarLogical(FALSE);
    }  
    
    return Rf_ScalarLogical(TRUE);
  }
')

# a range
is_range(1:1e6)
#> [1] TRUE

# not a range
is_range(c(1:1e6, 2L))
#> [1] FALSE

# performance
int_vec <- 1:1e7
timings <- microbenchmark::microbenchmark(
  is_range(int_vec)
)

# speed in billions of integers per second
length(int_vec) / median(timings$time)
#> [1] 1.746079

That's fast enough for practical purposes I think. Perhaps the method above can be improved by chunking the vector and determining the sum of squares of the vector minus the expected vector.

(that would minimize the number of if statements and may improve CPU throughput)

Ordering information is kept in a slice_map in the data_proxy object

A slice map would be implemented as an unsigned int or unsigned long long vector (depends on the size of the table). The length would be equal to the number of selected elements in the mask.

This vector can be large: a 1 billion row selection will have a slice map of 4 GB (unsigned int). A 5 billion row selection will have slice map of 40 GB (unsigned long long). So these maps are potentially too large for the user's memory and might require storing on-disk.

Ordering is expensive. But for printing purposes, where we need only 10 rows, at most 10 complete (16 kB) blocks have to be read. That's no problem because that should be sub-millisecond reads for each column (so no need to read the whole column for table's with a stored slice map)

Is this package still being maintained?

Seems like a great package for handling large datasets.

Use logical marker to indicate an ordered row filter

If a back-end can assume the row filter is ordered, it doesn't need to order the vector before starting a sequential read.

Documentation on parallel methods

Working with parallel methods is not straightforward. Perhaps it's best to start with documentation early on and enforce updating it with every new parallel method added.

Define a `fst_proxy` class

The fsttable package has two main tasks that are best separated into two distinct components:

interpret the data.table interface. This requires non-standard evaluation where we compute on the language R itself. The code for this will be specific to the data.table interface.
a proxy object to the underlying fst file. This object stores row- and column- selections, virtual columns and ordering information. Given a specific configuration, data can be retrieved from the fst file. This component should not use any language features or methods specific to the data.table package. And it should be made abstract enough so that it can be re-used for the fstplyr package (or any other future interface, e.g. SQL).

Perhaps it's best to separate these components from the start. The second component will also make clear what the best API will be for the fst package to expose. In the future this component could go into a separate package that can act as a dependency for other interfaces (package fstref or fstproxy).

See how the ALTREP framework can be integrated with fsttable

R 3.5.0 brings (an experimental implementation of) the ALTREP framework.
It would be very useful to detect and use compact expressions like 1:N. For example, when a range is specified in i, detecting a ALTREP vector means that we don't have to run a loop to detect ranges anymore (#35).

(see the ALTREP draft vignette for details)

Also, ALTREP provides for a mechanism for delayed loading of the actual vector data until it's required. Perhaps that can be used to return vectors in compact format:

# returns a compact object without actual data
x <- ft[10:20, X]

# regular methods can use the object
max(x)