uccross / skyhookdm-ceph-cls Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 9.0 773 KB

Skyhook Data Management: Storage and management of tabular data in Ceph.

Home Page: https://www.skyhookdm.com

License: GNU Lesser General Public License v2.1

CMake 0.23% C++ 99.23% Shell 0.45% Ruby 0.09%

skyhookdm-ceph-cls's People

Contributors

Stargazers

Watchers

Forkers

aditigupta17 jayjeetatgithub ivotron kingwind94 esmaeil-mirvakili manishprivet asmi10 annu2803 jhonxue

skyhookdm-ceph-cls's Issues

Add dev container for vstart

Add container for vstart for interactive debugging and development. Perhaps something like this, which requires Cmake "Debug" build instead of "Release" build.

- id: build-dev
  uses: docker://uccross/skyhookdm-builder:nautilus
  runs: [bash]
  env:
    CMAKE_FLAGS: "-DBOOST_J=4 -DCMAKE_BUILD_TYPE=Debug -DWITH_MANPAGE=OFF -DWITH_BABELTRACE=OFF -DWITH_MGR_DASHBOARD_FRONTEND=OFF"
    BUILD_THREADS: "4"
  args: [scripts/build.sh]

We should add another args script to build / run vstart.

Add indexing for Arrow data format

Implement indexing of arrow data format (build/use index methods) for point queries via omap interface, similar to indexing methods for flatbuffer format.

Fix epel repo issue

Builds are failing frequently due to mirror issues.

Step 3/5 : RUN yum update -y &&     yum install -y https://apache.bintray.com/arrow/centos/7/apache-arrow-release-latest.rpm &&     yum install -y       re2-devel       arrow-devel       parquet-devel &&     yum clean all
 ---> Running in 5b90ef187d1a
Loaded plugins: fastestmirror, ovl
Determining fastest mirrors
 * base: mirror.netdepot.com
 * epel: mirror.team-cymru.com
 * extras: mirror.grid.uchicago.edu
 * updates: mirror.compevo.com
http://mirror.team-cymru.com/epel/7/x86_64/repodata/repomd.xml: [Errno -1] repomd.xml does not match metalink for epel
Trying other mirror.
https://linux-mirrors.fnal.gov/linux/fedora/epel/7/x86_64/repodata/repomd.xml: [Errno -1] repomd.xml does not match metalink for epel
Trying other mirror.

logfiles attached:
travis.err.build54.log
travis.err.build56.log

as part of #3 , we ported the CI pipeline of the luminous branch to centos-7.8. However, this is causing unit tests to fail (see travis#13). Error being thrown is -95, which it's apparently Operation not supported on transport endpoint.

Create "Hello World" instructions

Hello world instructions are optimized for just trying out Skyhook but do not create a development environment. The instructions assume a running Nautilus Ceph deployment in Kubernetes.

Installation instructions of Skyhook in Rook/Ceph assuming sysadmin access to a functioning Ceph deployment in a Kubernetes cluster
Installation instructions of Python Skyhook client
Generation of example table with Python client
Example queries over that table

continuous release of skyhook binaries

Once packages are available (see uccross/skyhookdm-ceph#91), we can create a release pipeline that pushes deb packages to a repository. An alternative is to use https://packagecloud.io/ that supports open source projects by not charging for their service.

Add SQL GROUP BY to Skyhook Flatbuffer processing

This project will develop object-class methods methods to sort/group query result sets. This requires extending the current code (select/project/basic-aggregations - min/max/sum/count) to support groupby and/or orderby.

Skyhook flatbuffer data format is specified here.

Current processing includes select, project, simple aggregates min/max/count/sum. This is a row-oriented layout that uses Google Flatbuffers. Each row (or agg across all rows) is processed in a loop here.

For each row, first all of the predicates are checked for each column (data schema), and if they all pass, the projected columns (query_schema) are encoded into the return flatbuffer based upon their data type here.

Currently, simple aggregates are being accumulated during the applyPredicates function into a typed predicate object over all rows that pass (nothing is encoded into the return flatbuffer until the very end) here.

SQL GROUP BY statement re-organizes the rows returned by the query as indicated here / here.

This project requires implementing SQL GROUP BY by grouping the rows of the return flatbuffer by similar column values (such as group by the same ShippingDate). One straightforward way to do this is for each unique value of the grouping column, create a hashmap keyed on that value and store the associated row numbers in a list as that key's value. After all rows have been evaluated, iterate over the map and just encode the rows of each group using the existing encoding switch statement.

This project may require refactoring the encoding switch into a separate function so it can be reused by the existing code and the new groupby code. Testing can be done using our TPCH Lineitem table data, schema specified here.

A very similar method will also need implemented for our column-oriented processing function here, that uses Apache Arrow format.

Once the groupby functionality as described above is correct, the approach should be further optimized as appropriate to minimize memory footprint and/or processing complexity.

create high-level c++ api

create a high-level C++ API. A sketch (many details purposely omitted):

#include <string>
#include <skyhookdm.h>  // or something alike

int main(int argc, char **argv)
{
  // connect to a skyhookdm-capable ceph cluster
  skyhookdm::SkyhookDM sky;
  sky.init(NULL);

  // open pool
  skyhookdm::IoCtx ioctx;
  sky.ioctx_create("mypool", ioctx);
  // ...

  // buff contains an arrow table
  sky.write(ioctx, oid, "arrow", buff);

  // define query
  skyhookdm::Query query(
    columns="col2,col5",
    filters="col1 == 'foo' AND col3 > 5"
    // etc
  );

  // run query on given object and put result (arrow format) in buff
  sky.query(ioctx, oid, "arrow", query, &buff); 
}

subtasks for this:

implement API
create tests
document API (use doxygen)
create example(s) on how to create a program that uses the API (something like the above but more extensive and with lots of comments).

Add support for operations on List types to cls_tabular class

Scientific data often uses arrays, for example arrays (of non-uniform length) very commonly appear in ROOT data. These arrays are currently represented as List< > types in Apache Arrow within Skyhook.

Adding support for simple list operations on Arrow data allows us to "pushdown" selection predicates/filters into storage. This is beneficial for our scientific data and Arrow data more generally (lists are a native data type in Arrow). The most useful ops in our context are likely those for data reduction such as filters or summary/agg methods (min/max/count/sum/first), performing these in storage across billions of rows (rather than returning all data) will become important as we increase scientific data scales.

We need to identify (with help from data analysts and their workloads) some common and simple list ops used in scientific analysis, and then determine which of these can be implemented as pushdown predicates. Here is one example usage of array ops.

We can also use the Awkward arrays library to examine and motivate specific, simple ops to consider - though awkward will have much more complex ops as well. We do not want to re-implement any awkward specific operations, only to understand simple list ops that are generally useful for arrays, but in partcular ROOT analysis. Once this is well understood (which ops and how to offload) we may want to consider including the awkward python lib within Skyhook via a webassembly approach that we are developing. For example, here is a scientific python package compiled for webassembly.

This will involve the following steps.

We have jagged arrays types defined here, but do not yet support predicate operations on this data type: link

Apply predicates will be called here, which will include the specified new ops: link

Add a case to apply predicates on lists/jaggedarrays type: link

Add a compare (or similar) function, that applies the new ops on lists. Here is an example of a simple integer compare op: link. And a regex op and date op: link

Define any new operation types for list, here: link

Adding support for operations on list types will be useful in general in Skyhook for any supported data format, lists often appear in JSON and log/sensor data.

port the travis CI pipeline from skyhookdm-ceph repo

port the CI pipeline to this repo

Python front end for SQL query

This will create a python front end that will generate skyhook query and pushdown project for now (select later) via a SQL statement in a python shell such as:
SELECT a,b, from T;

And queries of the form:
SELECT a,b, from T WHERE a < x;
SELECT sum(a) from T;

Can report unsupported syntax at this time. Although skyhook supports those ops, we defer until the basic project works reliably with error checking via this python SQL interface.

This query
SELECT a,b, from T;
should pushdown the column projection for cols a,b into storage via the correct skyhook run-query flags. There will also be metadata DDL such as table create statement to import before querying. For now assume a default test table LINEITEM schema. Underlying data format should not be specified at query time as that is always invisible to the user. Output is reported within the python shell or use something such as an '-o filename' or '\output filename' to send results to file.

Start SkyhookDM documentation at Read the Docs

Start documentation with a "Getting started" experience for sysadmins that takes very little time and resources.

Implement Rados reads in CLS

Based on feedback from RedHat, in lieu of a specialized copy_from() (our original approach) instead implement the ability to invoke rados read() from within the CLS context. This is more general, and has some support already within RedHat. This will be a significant coding effort, and it should be driven with good use-cases in mind.

Idea: from within a cls context, read from a remote object - with the ability to invoke a method directly on the remote object (e.g., any registered cls method such as transform data), returning the data output from the method to the caller object.

Considerations: this must be asynchronous (due to PG blocking), and must work for all object types to include replicated objects (more straightforward) as well as erasure coded objects (more complicated).

Motivation
Skyhook (1) : Data originally written to an object in row format in one object is transposed to column format and written to another object. While normally this could be done outside Ceph in a client, here we prefer to avoid sending data back and forth between client/storage and rather do this directly within storage.
Skyhook (2): A step in a scatter-gather of data from multiple src objects into a single target object, all within the storage layer.

General (1): De/Encrypt data from src to target object.
General (2): Transcode data from src to target object, such downscaling resolution for video data.

deb packages for skyhook

create packages for debian-based distros for the library and the client. The client-side package will package the run-query for now. Once uccross/skyhookdm-ceph#90 is resolved, we could package the run-query separately as a skyhookdm-utils package and only package the client library (with librados as dependency)

enable unit tests

add an add_subdirectory(src/query) instruction to the root CMakeLists.txt so that the src/query/CMakeLists.txt gets picked up by the build. It likely won't work so it needs to be modified so that tests work fine.

Refactor physical off/len indexing of data structs in an object

Currently we have 2 types of indexes:

physical off/len of a data struct (e.g., flatbuffer or arrow), where each data struct is given a logical sequence number.
data content indexes (column values) referring to a data struct's logical sequence number.

These separate logical data from its physical placement. However our build index function currently overloads the building of the physical index when building the content indexes.

Task 1: The physical index building should be separated out of that function, such that it can be called either when building the content indexes, or during a maintenance/reorganization phase, or when appending new data structs to the current object.

Augment Ceph Copyfrom() interface

To perform data relocation within the storage layer, rather than externally via a client, we would like to provide the capability of
target_obj.copyfrom(source_obj,...,p)

This copies the full src obj to the target obj.

Since copyfrom is an existing interface in Ceph this can potentially be augmented to include a few extra parameters p. The parameters requires include an off/len pair to copyfrom src, rather than the entire object. Other parameters may include a transformation function to be applied or other metadata to specify functionality.

We may wish to implement a cls_skyhook_transform() method during the copy process, either at the source or target obj.

One way to do this is to modify the current copyfrom() to include new our new parameters and functionality required. Another way is to create our own cls_skyhook_copy_from() method that intiates and coordinates any transforms and then wraps the existing copyfrom() method or simply re-implements the necessary subset of Ceph's copyfrom() method inside our cls class (preferable just cut/paste existing code then modify as needed).

test: add flatbuffer layout

Taken from uccross/skyhookdm-ceph#9.

# TESTS BELOW with Skyhookdb Flatbuf layout (2 objs)
bin/rados mkpool tpchflatbuf 

# NOTE: existing objs 1 and 6 stored as below contain 14 rows total
bin/rados mkpool tpchflatbuf;
yes | PATH=$PATH:bin ../src/progly/rados-store-glob.sh tpchflatbuf  /mnt/storage1/tpchdata/flabuffer_test_objs/skyhookdb.testdb.lineitem.oid.*

# CHECK DATA OBJECTS
bin/rados df  # should see tpchflatbuf pool with our 2 objs

# TEST QUERY “FLATBUF” NO CLS
# selectivity=100%
# expect “total result row count: 14 / 14; nrows_processed=14"
bin/run-query --num-objs 2 --pool tpchflatbuf --wthreads 10 --qdepth 10 --quiet --query flatbuf 

# TEST QUERY “FLATBUF” WITH CLS
# selectivity=100%
# expect “total result row count: 14 / 14; nrows_processed=14"
bin/run-query --num-objs 2 --pool tpchflatbuf --wthreads 10 --qdepth 10 --quiet --query flatbuf --use-cls

create a librados cls sdk mock

implement arrow::dataset::Fragment on (invoke cls via librados)

Implement Fragment API on librados (invoke cls methods). Add tests as well

extract utility functions that are shared by server- and client-side code

Code that is shared between the tabular cls and clients resides in the same source files. This in turn is causing the build to exit with exit code 2, which results from adding the same CC code on multiple cmake targets. (see this failed build).

For this we have to:

refactor code so that functions that are shared between cls_tabular and client-side applications are defined in a standalone library.
define a target for this utility library.
include this library target in the cls and all client-side (executable) targets such as run-query, etc.

bring commit history from ceph fork repo

bring the commits from the https://github.com/uccross/skyhookdm-ceph repo so history about contributors is not lost

port wiki documentation to readthedocs

create outline on readthedocs (revise the github wiki one)
look at the github wiki documentation and decide what to change/add/remove
bring the wiki documentation to readthedocs

Add timestamp types

We should add timestamp to SDT_TYPEs here. Currently we only have SDT_DATE (only mmddyy), Ashay suggests to add new type names SDT_TSp, SDT_TIME, SDT_TIMETZ.

Corresponding Postgres types are here.

We can use posix time (ptime) structs as here, which have string constructors.

Currently dates are stored as strings and extracted here as a string, from the flatbuffer format; then converted/compared here.

If the date passes the compare function, then it is re-encoded into the output (flatbuffer format) here.

And also will need a corresponding output as csv print method, and a PostgreSQL binary format.

add step to CI pipeline to push rook-compatible image to dockerhub

Write append new flatbuffer layout to Ceph

Replace the local file write of binary flatbuffer object data with our Ceph writer client to perform write append of binary flatbuffers to their respective object ids (Ceph oid = flatbuffer bucket_id).

Originally Ceph write append flatbuffers was mixed in with issue uccross/skyhookdm-ceph#3; closing that issue and moving Ceph writer portion into this issue.

Supports vector/list column types in a table.

This needs to be done: https://github.com/uccross/skyhookdm-ceph/blob/skyhook-luminous/src/cls/tabular/cls_tabular_utils.h#L115

Add extensibility mechanism for sets of objects

Just as there is the cls mechanism for extending objects, we need an extensibility mechanisms for managing sets of objects. The most established concept of managing sets of objects is files. The MDS is already extensible thanks to Michael and Noah's work (Malacology, Mantle, Cudele, Tintenfish) of which some if it is already merged upstream (I think mostly Mantle, so load-balancing things). The connection point of RADOS and MDS is the CephFS client which exists as kernel module and as library (libcephfs). I think libcephfs is a great starting point for an extensibility mechanism that allows the definition of new types of files and directories, install them in the MDS, and allow CephFS clients to dynamically link to libraries similar to what OSDs do for cls extensions. There needs to be also a software package system that allows CephFS and OSD extensions to be bundled since they depend on each other. SkyhookDM for tables might be one of those packages. SkyhookDM for arrays another.

port skyhook to make use of the rados SDK

decouple the build from requiring the entire ceph source tree

Document how to use rados namespaces as buckets

Database statistics collection on partitioned data

This issue will develop object class methods (cls) that perform statistics collection on an object's formatted data partitions. develop object-class methods to compute data statistics (histograms) for each object and store them in a query-able format within each storage server’s local RocksDB, then write client code to accumulate all the object-local statistics into global statistics for a given database table.

Statistics should be collected as a histogram, then returned to the client calling function as en encoded struct representing the column name and a sequence of buckets with counts in each. Initially we should consider course grained histogram of 10 buckets per column, and later to be extended to a user-configurable value.

More information can be found here in the stats collection operation struct, which will contain the statistics collection instructions (col name, etc), and the col_stats struct that will contain the results to be returned to the caller.

Upgrade to new Apache Arrow v1.0.0

Arrow v1.0 was recently released. Update our code to this version. Guarantees backward compatibility from now forward.

move to sdk

the rados-sdk branch has WIP on porting the cls to the RADOS sdk. Known things that the SDK does not provide yet (not an exhaustive list, as we're still investigating):

the ENCODE, ENCODE_FINISH, ... macros. For example they’re used here https://github.com/uccross/skyhookdm-ceph-cls/blob/nautilus/src/cls/cls_tabular.h#L141 . Those are defined in a private include/encoding.h header.
the other methods that the sdk doesn’t expose are the ones that work with lists of keys for the object map (cls_cxx_map_set_vals and cls_cxx_map_get_vals)

some alternatives to solving these:

find different ways to accomplish the same without requiring more functionality exposed via the sdk header.
expose it on our fork, and open a PR.
open an issue on the upstream repo and wait for the community to address that.

create skyhook client-side library

allow applications to call a c/c++ library directly (instead of interfacing with the run-query binary). This entails defining the skyhookdm C++ API which may (or not) hide cls/cls_tabular.h. When deployed, this client library will have as dependency librados.

incorporate buildkit caching feature to CI pipeline

Docker's buildkit (19.03+) incorporates a feature for caching images in a registy (for more see here). We can incorporate this to the CI pipeline so that we build the builder as part of the travis build, which results in not needing to maintain a separate builder repo (which in turns has us having to manually sync it whenever we modify the builder logic).

Option to convert Arrow table to/from Parquet on disk.

After transforming from Flatbuffer to Arrow data format, we should include the option to save the arrow table as Arrow or Parquet format. This should probably go here. Our tests have shown the conversion from Arrow to Parquet for LINEITEM table does save disk space due to compression.

Similarly when reading Arrow format from disk, we should check if the data is stored in Parquet format if so convert to Arrow before processing. Probably here after unpacking the blob from fbmeta, we should convert from Parquet blob to Arrow in mem before calling processArrow(). We do not need a processParquet() function at this time. Since blob data will already be in memory we should always convert and treat it as Arrow once it is read from disk to memory.

We have already included Parquet libs in our install_deps.sh alongside Arrow, so the required Parquet API (if needed) is already present and can be included similarly to Arrow as here.

publish python library to pypi.org

once uccross/skyhookdm-ceph#93 is implemented, we can implement a release pipeline that pushes to python packages to pypi.org. This would allow people to execute:

pip install skyhookdm

use system boost libraries in CI pipeline instead to avoid building them on every job

most of the time it takes to run the CI job (approx. 30%) is spent building boost. We can reduce this by installing boost on the builder image and then configuring the build with -DWITH_SYSTEM_BOOST=ON.

create python library

Once a C/C++ API is defined (#90), we can create python bindings for skyhook (e.g. using cython). Similarly to the c/c++ library , this might require to have python-rados as a dependency.

arrow-cls: InMemoryFragment interface to object store's API

as a followup to #60, we will add the ability to write and read InMemoryFragment using the CLS SDK's interface. We will also add unit tests as part of this issue too.

ci: automated performance testing

uccross/skyhookdm-ceph#9 added correctness testing (i.e. build and unit tests) in Travis CI. We also want performance testing:

We need to implement 3 components:

Dockerhub builds Skyhook images
Cloudlab/Internal cluster templates for deploying, testing, and measuring performance
Jenkins interface for push-button testing

Images are here.

add step to CI pipeline to bump up the submodule commit of the ceph fork

Add numeric/decimal types

Numeric and Decimal types are structs in Postgres, and supported by other formats as well such as Arrow and Parquet, but not in Flatbuffers. Postgres has function for string to numeric types as well.

Skyhook can store these types as string, like we do for date type, such as here, that calls compare with 2 strings and the compare converts to date types. Or could convert first then use a custom compare for 2 structs.

We should add these 2 to SDT_TYPEs here.

setup readthedocs folder

add docs/ folder with setup for generating documentation via readthedocs.

arrow-cls: work on benchmarks

create a benchmark based on this https://blog.thedataincubator.com/2018/05/sqlite-vs-pandas-performance-benchmarks/ . Initially, we'll create a partitioning of parquet/ipc files and use that to benchmark IPC vs Parquet implementations. We'll then run this on rados (once it's ready)

Compaction of Arrow data structures

This project will develop object class methods that will merge (or conversely split) formatted data partitions within an object. Self-contained partitions are written (appended) to objects and over time objects may contain a sequence of independent formatted data structures. A compaction request will invoke this method that will iterate over the data structures, combining (or splitting) them into a single larger data structure representing the complete data partition. In essences, this methods will perform a read-modify-write operation on an object's local data.

Task 1: A given object may contain a sequence of serialized Arrow data structures on disk. This task will read 2 or more Arrow structures and combine them into a single Arrow structure. Essentially this can be thought of as compaction, which is needed since objects may contain multiple smaller Arrow structs after some number of row inserts over time. The inverse of compaction -- i.e., splitting 1 Arrow struct into 2 -- is also part of this task.

Assumptions: Each Arrow struct has a known number of entries (rows), and the combined (or separated) struct will have a target max_rows or size. If the new expected size would be larger than the target max_rows or size, do not perform the compaction or splitting.

Task 2: The current physical off/len of each data struct (Arrow in this case) is stored in omap as an idx_fb_entry struct.. After combining the 2 Arrow structs into one struct, the idx_fb_entries for the old structs must be removed. Also note the max fb_seq_number is stored in omap and needs updated as well. This value is is always set during idx creation process, it indicates the current (i.e., max) number of data structs in the object, given the object contains a sequence of structs. For example, if there were 2 Arrow structs, the previous fb_seq_num should have been 2 and after combining, it should be set to 1. Alternatively, for safety the fb_seq_num key could be removed from omap, and later set elsewhere, outside of the compaction task, for instance in optional task#3 below.

Task 3: Optionally - a new physical idx entry (off/len) should be added to omap for the new combined struct. The idx_entry is normally built here, by looping over all bufferlists in an object. That functionality needs duplicated, or preferably that code should first be refactored to support building the idx_fb_entry separately from building the index data entries (opening issue #21 for that).

Add format specific print methods for all Skyhook supported formats.

Add a print method for each format, including a generic print as csv method with specified delim.

First, this current method should be renamed as a print_generic(const char *dataptr, size_t len, int format_type) and that method should then call the format specific print method.

Second, add an enum for data structure types, which are then used here by our skyhook_root struct that wraps all data formats.

Third, duplicate this format-specific print function, for each format specified in enum formats (several currently being explored).

Each new print function can be declared here similarly to here.

Lastly, add a default print delimiter here.

Upgrade repo to Ceph-Nautilus

Move from Luminous to Nautilus. Although Octopus stable version was just released we will wait on this for more user adoption.

arrow-cls: setup for running unit tests

in the arrow-cls branch, add ability to run tests. At a high-level:

copy https://github.com/ceph/ceph/blob/octopus/src/test/cls_sdk/test_cls_sdk.cc to src/cls_sdk_test.cc
add test_util.cc with utility functions so that cls_sdk_test.cc runs fine out-of-tree (in the SDK setup)
update the builder Dockerfile to include gtest
modify the cmake config to add
add a new step to the Popper workflow to run the tests

Create development environment instructions

The Docker instructions are assuming a forked Ceph respository.

Adapt build instructions (e.g. these)
Explain the role of vstart.sh
Explain the role of micro-osd.sh
Add high-level introduction to Popper and how it's being used to automate development tasks