knowsys / nemo Goto Github PK

View Code? Open in Web Editor NEW

37.0 5.0 4.0 4.82 MB

A fast in-memory rule engine

License: Apache License 2.0

Rust 98.98% Nix 0.53% Python 0.49%

datalog rule-engine

nemo's People

Contributors

Stargazers

Watchers

Forkers

rlwww eltociear sysang

nemo's Issues

Get rid of tagged enum for datatypes

Currently, we use tagged enums, like RangedColumnScanT, to abstract the data type used internally in the object. However, using these objects is often cumbersome, since they have to "unpacked" into the corresponding -Enum object through match statements like

match target_schema.get_type(layer_index) {
  DataTypeName::U64 => {
    for scan in &trie_scans {
        if let RangedColumnScanT::U64(scan_enum) =
            scan.get_scan(layer_index).unwrap()
        {
            column_scans.push(scan_enum);
        } else {
            panic!("type should match here")
        }
    }
   ...
  },
  DataTypeName::Float => {...},
  DataTypeName::Double => {...},

This leads to code repetition and generally makes things harder to read/understand.

For the purpose of reordering a table, it was discussed to store floating point numbers as unsigned integers, giving you a two dimensional array of one type. Since this would invalidate the ordering with respect to the floating point value anyways, we might want to give up on storing data in Column<float>, and instead store everything in Column<u64>. This would allow us to replace the tagged enums with a "tagged struct", i.e.

struct ColumnScanT {
  type: DatatypeName,
  scan: ColumnScanEnum<u64>
}

Here, no unpacking is necessary, so the different arms of the match statement can be unified.

For operations which require the actual type (like addition, certain aggregates, ...) one would have to cast them back. But this should not incur any runtime costs. You do, however, lose the ability to quickly calculate conditions like "greater than 1.0f", since the data would not be sorted according to their floating point value. Another thing to consider is whether differently sized integer types should be used (probably u32 and u64). If yes, then tagged enums might still be needed, leaving the code as complex as before (but perhaps still reducing the number of types to consider). Otherwise one may waste storage space. A third option would be to decide this at compile time, instead of for each individual table at run time.

Build proper (basic) CLI

Technical Debt from #62.

We have a hacky binary for testing with hardcoded input files. We should at least give it a proper CLI where the input file can specified via an argument or where the input is just read from STDIN. Further arguments might be of use (later).

Full Datalog Support and Existential Rules

The following features are still missing:

Support incremental parsing

Currently, the parser requires one large String of everything to parse. Switch to nom's incremental parsing, which should mostly be changing a few imports and possible handling some Incomplete errors, and have a wrapper that can chunk large inputs.

Depends on #58 being merged.

General Permutator for sorting index-based data

What functionality is wanted?

Data in index-based structures shall be sorted. As a logical step before physically sorting all the data, a Permutator is favourable.

Needed functionality

Sorting of index-based data
Multi-structure sorting
Sorting on slices/intervals
Sorting on Columns

Related information

A permutator is mentioned as one method in Issue #7

Support incremental RLE

A further Column-implementation that stores data using incremental run-length encoding should be created. The data structure should store a list of records that each contain (1) a start data value v (of the Column's type T), (2) a length l, and (3) an increment i. Such a block encodes a sequence of l values, starting with v, with each further value being i larger (i can be 0 for sequences of repeating values). This implementation only is needed for integer types (currently only u64 and usize, but further types might be added later, e.g., to reduce memory use).

The AdaptiveColumnBuilder should be extended to use this encoding of columns whenever there is a significant memory reduction in doing so (this is determined heuristically, by starting to build RLE columns and tracking their size; if the expected size of the plain encoding less than some factor [2? 1.5?] the size of the RLE encoding, then the data is transcoded to plain vector and processing is continued with a vector; there might be other ways to guess what is best).

Optimization of Datalog Rule Execution

What functionality is wanted?

The semi-naive evaluation can be improved in several dimensions, improving its speed and memory consumption.

Needed functionality

There are a few possibilities for optimizing the computation. Some of them may interact with each other. This list might be incomplete.

Estimate computational and storage cost of an operation: This is probably a prerequisite to some of the optimizations listed below
Find a good variable order: Try to avoid reorderings of the tables as much as possible. This can be done at two levels － locally for a single rule and globally for the whole program.
Caching: Some results (like the join of two particular tables) computed during the evaluation of one rule might be helpful for another and might therefore be kept in memory
Find a good rule execution order: Can lead to bigger subtables, speeding up future operations
Drop tables from memory: Free up memory if a table is no longer needed. Here one could also consider to sometimes continue the evaluation with a non-materialized table
Combine multiple subtables into one: This could speed up future joins

Implement handling of prefixes in the rule parser

The prefix handling code in the rule parser is currently incomplete. While prefixes are taken into account on parsing, there is, e.g., currently no way to obtain the original IRIs from a PrefixedName.

Depends on #58 being merged.

Adaptive Column Builder: More Flexible Heuristic Configuration

The adaptive column builder starts building an RleColumn and uses a heuristic to decide whether to keep it or transform it into a VectorColumn at some point.

Currently the adaptive column builder uses two constants for this heuristic; COLUMN_IMPL_DECISION_THRESHOLD to determine after how many RleElements a decision is made and TARGET_MIN_LENGTH_FOR_RLE_ELEMENTS to require a minimum average length of the RleElements to pick a RleColumn over a VectorColumn.

Instead of using constants, we should allow to pass these values when creating the AdaptiveColumnBuilder. We should provide Default implementations for both arguments with the current constant values. Furthermore COLUMN_IMPL_DECISION_THRESHOLD should support a special value for making the decision only when finalize is called.

Operator Trait

Define two traits for defined behaviour of operations:

Operator Trait
Set-Operator Trait (inherit Operator trait)

Make `RuleParser` a Builder

RuleParser should have a finalize method that moves out the parsed Data, Program, and Dictionary.

Depends on #58 being merged.

RLE Column - Seek panics with interval bounds

After calling narrow on the RLE column, the seek operation should be restricted to the given interval. Right now this does not work as expected since the search might end up in a RLE block with a negative increment, which is not supported.

Select operator

Implement Select

Join operator

Implement join

Setting Datatypes for Reading CSV

Technical Debt from #62.

When reading a CSV file, we pass an optional datatype for each column (U64,Float,Double). If none of these datatypes is specified, we treat the contents of the column as strings and internally represent them by u64 (with a dictionary).

Is this fallback solution what we want or should we rather introduce String as a datatype? It currently feels like a little bit of a dirty hack to me. Let's discuss this!

RLE Column returns Infinity for large floating point numbers

The quickcheck test uncovered than in certain cases, the RLE column returns Inf or -Inf in the get function instead of a large floating point number.
One example for this is:

let col = RleColumn::new(vec![Float::new(-3.4028235e38).unwrap(), Float::new(0.0).unwrap(), Float::new(3.4028235e38).unwrap());
assert_eq!(col.get(2), Float::new(3.4028235e38).unwrap()); // fails since col.get(2) returns Infinity

I think the cause for this problem is that the column is stored as RleColumn { values: [Float(-3.4028235e38)], end_indices: [3], increments: [Increment(Float(3.4028235e38))] } and get(2) just multiplies the increment by 2, which probably already ends up to be Inf. Adding Inf to the start value then results in Inf as well.

In general, for signed datatypes, the issue is that increments like MAX - MIN overflow the value space of the underlying datatype.

Unify Trie Display and CSV Debug Output

Technical Debt from #62.

We use pretty much the same logic to implement the Display trait of a trie and to dump it to a CSV file. We should unify this a bit more and extract common parts.

AdaptiveColumnBuilder panics on empty data

Reproduce issue

minimal example:

let mut builder: AdaptiveColumnBuilder<u32> = AdaptiveColumnBuilder::new();
let column = builder.finalize();  // panic (division by 0)

Expected behaviour

An empty column should be returned by the builder

Discussion

I'm not sure whether this shall be checked on the ColumnBuilder-level or the rleColumn-level.

Implement `RangedColumnScan` methods on `RleColumn`

The methods pos and narrow that are required for RangedColumnScans are currently not implemented for RleColumnScans.
These methods just panic currently but they should be made usable.

Rename operator

Implement rename

Create Trie from List of Columns or Rows

Tries (and anything that implements the Table trait) should be constructable from a list of columns or a list of rows, which essentially represent a table.

For example, consider the following table:

This may be given as input row-wise: [[1, 3, 8], [1, 2, 7], ...]
or column-wise: [[1, 1, 2, 1, 2], [3, 2, 3, 2, 6], ...]

The resulting Trie would look like this:

Optimizations Database Operations: Join

Suppose you want to evaluate the rule

   a(x, y) b(y, z) c(z) -> d(x, y)

Currently this would compute the temporary table tmp(x, y, z) and then project this down to d(x, y).

This is inefficient because the result does not care about every valid z-value. Hence it would suffice to know that one such value exists and move on to the next x, y pair.

To illustrate, consider a database with a(1, 2), b(2, 5), b(2, 6), c(5), c(6). We'd obtain tmp(1, 2, 5) and tmp(1, 2, 6), even though we only need to save d(1, 2).

RLE Scan - Performance Improvements

Before the recent restructures, the RLE Scan was not used in the reasoning process but now it is.
Apparently, the RLE Scan is much slower than the generic column scan that was used before.

We should at least get the performance to the original level.

Support iterating an interval in a column

The GenericColumnIterator currently iterated from the start to the end of a column. It should support the specification of different bounds and merely initialise these bounds to be 0 and len() as a default choice.

Implement Display Trait for Tries

Tries should implement the Display trait for easier debugging.
Consider a Trie that has [1,2] in the upper level with [2, 3] and [3, 6] as their children, respectively. In turn, the lowest level consists of [7,8], [8], [9], [9].
The output may look as follows:

Trie and Trie iterators

We need a data structure to represent a Trie. It should store the underlying data as a vector of IntervalColumn. Furthermore, some interface for Trie iterators together with a basic implementation is required. A trie iterator should allow you to vertically traverse a Trie and should be able to return a ColumnScan object which can be used for the OrderedMergeJoin.

GenericColumnIterator's seek does not take interval into account

The seek method of GenericColumnIterator currently sets the upper bound to the length of the column instead of the end of the interval.

Adaptive Column Builder does not work with any supported datatype

Description

Due to the traits defined for the generic data type T, one cannot instantiate an AdaptiveColumnBuilder<u64>, AdaptiveColumnBuilder<Float>, or AdaptiveColumnBuilder<Double>.

Expected behaviour

One of the defined "basic" datatypes should be possible to instantiate the `ColumnBuilder'.

Sorting csv-imported datastructure

After importing a csv-file, it is sometimes needed to sort the data before it is given to column-builders.

To keep the relations of the data, which is already represented in column-like vectors, one needs to implement a Permutator to sort all of the vectors in the same order.

In contrast to the usual implementation from crates.io Permutator, multiple-slice-spanning ordering shall be supported too. (i.e. sorting by the value of additional columns if the values in one column are identical).

Implement Semi-Naive Datalog Evaluation

What functionality is wanted?

Stage2 should be able to perform Datalog reasoning using the semi-naive evaluation technique.

Needed functionality

A single application of a rule. This includes
- Combining the basic database operations Join and Union to evaluate the body of the rule
- Using Difference as a way to eliminate duplicate entries
- Materializing the above (combined) iterator to obtain a temporary table
- Using Project to reorder/delete columns, materializing every needed head atom
Evaluating a set of rules until a fixpoint is reached

Related information

Here we focus on correctness, not optimization. In particular:
- Every result is materialized. A table is never dynamically deleted
- It may be assumed that a variable order is given for each rule. Every table should be materialized in every column order that is required. For testing purposes, a simple algorithm which produces some variable order should be implemented.
- Any fair rule execution order suffices

Delegation of Trait Implementations

For our custom types Float and Double, we have to delegate quite a lot of trait implementations to the underlying f32 and f64 types, respectively. I played around with delegate and ambassador a bit but I did not achieve satisfying results here.

This is not a pressing issue at all but it would be nice to have if we can find a nice way to shorten those delegations.
Feel free to experiment here :)

Union operator

Implement union

Optimization Database operations: Project

The current project and reorder is very inefficient and accounts for a majority of runtime in the benchmarks performed so far.

Basically, this operation has to be reimplemented from the ground up, as the current approach did not turn out to be very good.

Project operator

Implement project

Define trait based API for physical table

Define how to access and manipulate data, based on the physical table representation.

Full Trie Iterators

As of now, all TrieScanEnum objects resulting from database operations only represent a partial trie, i.e. a trie where there may exist paths not of maximal depth. It would be nice to have trie iterators which would look ahead to determine if the current value is actually "real".

Separate physical and logical layer into different crates

To have a clearer separation of the logical and physical layer we want to put the physical layer into its own crate which is then used by the logical layer.

The only problem that needs solving here is code that is currently used by both layers. This is:

meta/timing.rs
meta/logging.rs

The logging dependency could be solved by pulling the log statements out of this global file to the places where it used. The original reason for having them in a separate file was to reduce clutter in the functions which do work. The redesign has to keep this is mind. One way of doing this could be to use Display implementations to avoid verbose setup code.

This should wait until #70 is done since we currently have other dependencies relating to the type system right now.

Include Test Data for Benchmarks

Some benchmarks for the operations refer files like test-files/bench-data/xe.csv that are not in the repository.
@aannleax could you add those files?

Dependency Graph

What functionality is wanted?

Calculate a depency graph from a given rule sets.

Related information

This can be copied from d2sparql

Performance Improvements for RleColumn

@aannleax pointed out some straightforward improvements to improve the performance of the RleColumn.
The main improvement possibilities lie in the get method of the RleColumn as well as in the seek method of the RleColumnScan.
Both methods are implemented rather naively at the moment.

Namely, the following improvements should be done:

Instead of storing a vector of structs in the RleColumn, the RleColumn can store 3 vectors which in comparison to the RleElement hold:
1. The start value of the RleElement,
2. the end index of the RleElement plus one in context of the RleColumn (in other words, the accumulated length of all elements before and including the current one)
3. the increment of the RleElement.
For example, the list [1, 2, 3, 7, 9, 11] can be endoded with three vectors as follows:
1. [1, 7],
2. [3, 6],
3. [1, 2].
The get method should perform a binary search on the end indices. (This is straightforward after the above adjustment.)
The seek method should perform a binary search on the values. This requires the assumption that the data encoded in the RleColumn is sorted, which is probably a valid assumption. Note that this adjustment could be done without the other two.

The assumption of sorted data for the last task may be subject to discussion and may deserve its own issue.

Mapping Terms from Model to Internal Datatypes

Technical Debt from #62.

There is no one-to-one correspondence between the terms in model (obtained from parsing) and the internal datatypes (U64,Float,Double). There is currently only an incomplete mapping from terms to those datatypes.

What should we do here? I think this requires some more conceptual discussion.

Excise Rule Filters for Constants and Duplicate Variables from the Rule Model

Technical Debt from #62.

Support for constants and duplicate variables is currently implemented using filters in a very hacky way in the execution engine.
I did not look into the code in detail yet so I cannot make an informed proposal on how to improve this.

Maybe @aannleax has ideas on this already?

Difference operator

Implement difference

Untangle physical and logical layers

This is the meta-issue for the untanglement milestone, it holds the results of the discussion on 2022-11-23.

The physical layer should be self-sustaining (i.e., it could potentially be a stand-alone crate):

there should be a “Database Instance” on the physical layer that holds all relevant information (i.e., dictionaries for predicate names and for values, all tables, etc.); it should be able to answer questions about size and number of tables
tables are named (by Strings), where names are always set by the logical layer
all access to tables is exposed via table names
Iterators can be constructed from execution plans (plans need to be validated before execution, errors need to be propagated up)
the physical layer should not need knowledge about execution steps and delta tables
the schema of derived tables can be inferred from its constituents, for new tables, it is passed down from the logical layer
we want to support the following (distinct) primitive data types: u32, u64, f32, f64, “u32 with dictionary”, and “u64 with dictionary”
primitive data types can be joined iff they are (i) the exact same types (using, where applicable, the same dictionary), (ii) they are u32 and u64, or (iii) they are “u32 with dictionary” and “u64 with dictionary” and they have the same dictionary
association between table and dictionary is tracked by the database instance

The logical layer has:

the data model (which may need to be separated into a general and a rule-specific part)
a knowledge base (an instance of the logical data model)
the parser, which takes a (possibly empty) database instance, a (possibly empty) knowledge base and a program, and produces a database instance and a data model instance
a materialisation manager, which holds the database instance (and likely the data model instance) and, e.g., knows which delta tables are part of the materialisation of which tables
the materialisation manager directs the construction of Iterators, i.e., creates execution plans

Prefixed String Dictionary: broken PartialEq implementation

Encountered the following error while running cargo +nightly test:

---- physical::dictionary::prefixed_string_dictionary::test::iri stdout ----
thread 'physical::dictionary::prefixed_string_dictionary::test::iri' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `4`', src/physical/dictionary/prefixed_string_dictionary.rs:440:13


failures:
    physical::dictionary::prefixed_string_dictionary::test::iri

usize vs u32 vs u64 for the Dictionary

There has been a conversation about the choice of the Datatype of the Dictionary indices. It should be discussed on due time.

Apart from this discussion on usize/u64 (which we should probably decide on in the group) and the small changes above, this looks good, though.

Thank you for the thoughtful review, did all the changes except the usize vs u32/u64 issue

I see your concern. It is possible to refactor the Dictionary trait to be either u64 or u32.
My thoughts have been to utilize the native number representation of the machine.

usize is not about being the native representation, its sole point is to cover the whole range of allowable pointer values. Even i386 has 64-bit-registers, so u64 could also be considered “native” there.

It is still something to consider when using indices, as indices are always usize. I wanted to avoid unnecessary casting back and forth when I've designed the Dictionary treat.

Originally posted by @ellmau in #68 (comment)

Refactor AdaptiveColumnBuilder to support further column types

The AdaptiveColumnBuilder is currently implemented to decide the use of a RleColumn vs a VectorColumn based on the data that the AdaptiveColumnBuilder receives.
The current solution may not be suitable when further column types shall be supported.

This issue provides a place for discussion on refactoring possibilities. Some discussion also already happened in #8.
Once a decision is made, the implementation can happen in the scope of this issue.
Maybe, it is worthwhile to postpone the discussion until another column type is actually to be supported by the AdaptiveColumnBuilder. In this sense the issues acts as a mere reminder.

Planning Errors

What is needed to be answered

We need to discuss the planning and handling of Errors.

More detailed information

On a global library scale, we have crate::error::Error now.
Introducing an error-enum for every module/column-type feels a bit too much.
This happened during the PR, which got resolved, but in general one needs to think about errors and handling of them.

The initial issue started on the discussion about error-handling in one of the columns (RLE), therefore the following concepts are using this module/structure.

options/suggestions:

have at most an error enum for the columns (e.g. crate::physical::columns::ColumnError).
have an error-module on the resprective level (e.g. crate::physical::columns::error::ColumnError) - this might be a bit too much boilerplate though
have all error-enums in crate::error (e.g. crate::error::ColumnError)
have only one enum for all physical issues (e.g. crate::physical::error::PhysicalError)
have an error-module for physical with various errors (e.g. crate::physical::error::ColumnError)
have only one Error-Enum (crate::error:Error)

Remarks on the options

In addition on every case a variant in crate::error::Error should be in place too.

I think we are at a point, where this decision needs to be done for the whole library, to avoid many different island-solutions.

Remarks on documentation

The result of this discussion should be documented somewhere too (either a wiki-page, a sticky discussion-thread, or just inside a CONTRIBUTING.md file)

Originally posted by @ellmau in #8 (comment)

Due to changes in the code the discussion turned obsolete in the PR, but it is still some question to be answered sooner than later.

Rule Parsing

What functionality is wanted?

Parsing of rules provided as a text file

Needed functionality

Come up with a internal representation of rules
Given a file containing a set of rules in the Rulewerk format, parse them into the format above

Related information

This can be copied from d2sparql

Support binary search

The seek() implementation in GenericColumnScan should support binary search. Right now, it is only using iteration. Binary search should be used while the interval of values to scan is not small (say of size >10; this should be defined as a constant somewhere). Small intervals (say below 10 elements) should always be scanned.

knowsys / nemo Goto Github PK

nemo's People

Contributors

Stargazers

Watchers

Forkers

nemo's Issues

What functionality is wanted?

Needed functionality

Related information

What functionality is wanted?

Needed functionality

Reproduce issue

Expected behaviour

Discussion

Description

Expected behaviour

What functionality is wanted?

Needed functionality

Related information

What functionality is wanted?

Related information

What is needed to be answered

More detailed information

options/suggestions:

Remarks on the options

Remarks on documentation

What functionality is wanted?

Needed functionality

Related information

Recommend Projects

Recommend Topics

Recommend Org