This code was made available as expected for VLDB publications.
The published code is based on Apache AsterixDB and contains modifications to support the columnar format AMAX as well as the code generation framework.
This is only a prototype and a cleaner and a reviewed version will be made available soon at https://github.com/apache/asterixdb
For easier installation experience, you first need to install the following:
- Maven
Apache Maven is needed to compile AsterixDB. You need Maven 3.3.9 or newer - Ansible
The link above shows how to use Ansible to deploy AsterixDB cluster. Using Ansible would be the easiest way to use the code in this repository as one of Ansible's YAML files has been modified to accommodate using GraalVM. - GraalVM
Speaking of which, you need to install GraalVM for the code generation part. If you have Java, it is probably HotSpotVM and not GraalVM. GraalVM is required to compile the generated codes from SQL++ queries. Using Java HotSpotVM means the generated code will be executed in interpreter-mode.
* Please download GraalVM v21.3 as this was the one we used and we know it works :-)
** You only need to download and unzip GraalVM. You do NOT have to make it your default JVM.
After cloning this repo:
$ git clone https://github.com/wailyk/column.git
Run the following commands:
$ cd asterixdb
$ mvn clean package -DskipTests
Now, copy the compiled AsterixDB code to your HOME
folder:
$ cp asterixdb/asterix-server/target/asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip ~/
Unzip asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip
$ cd ~
$ unzip asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip
Good! You need now to modify one file. Using you favorite editor, open:
apache-asterixdb-0.9.8-SNAPSHOT/opt/ansible/yaml/instance_start.yml
You need to change <PATH_TO_GRAALVM>
to the location of GraalVM, for example:
environment:
JAVA_HOME: /home/wail/vms/graalvm-ce-java11-21.3.2
*Make sure you do that for the two commented YAML blocks (i.e., the one under [ncs]
and [cc]
)
Save and close instance_start.yml
run the following commands:
$ cd ~/apache-asterixdb-0.9.8-SNAPSHOT/opt/ansible/bin
$ ./deploy.sh
Make sure you can see the folder asterixdb_column
under your HOME
folder. This confirms that
AsterixDB was deployed successfully in your machine.
Start AsterixDB:
$ ./start.sh
Using your favorite browser, open the following link: http://localhost:19006. You should see AsterixDB's web console.
Perfect! You can now start using AsterixDB. Let's start to load some data and try a few queries. First, run the following DDLs:
DROP DATAVERSE ColumnTest IF EXISTS;
CREATE DATAVERSE ColumnTest;
USE ColumnTest;
CREATE TYPE OpenType AS {
uid: uuid
};
-- Create a Dataset using AsterixDB's default row format
CREATE DATASET RowDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED;
-- Create a Dataset using our proposed columnar format
CREATE DATASET ColumnDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED WITH {
-- Tells AsterixDB to store the records as columns
"dataset-format":{"format":"column"}
};
We provided a sample dataset with the code as well. It is a smaller version of the sensors
dataset from
the paper. Let's load the data into the two declared datasets RowDataset
and ColumnDataset
:
USE ColumnTest;
LOAD DATASET RowDataset USING localfs (
("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
("format" = "json")
);
LOAD DATASET ColumnDataset USING localfs (
("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
("format" = "json")
);
**Make sure to change <HOME>
to the location of your <HOME>
folder. For example, mine is home/wail
. So the full path will be:
localhost:///home/wail/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json
After loading the dataset, let us run the following queries:
Row
USE ColumnTest;
SELECT VALUE COUNT(*)
FROM RowDataset;
Column
USE ColumnTest;
SELECT VALUE COUNT(*)
FROM ColumnDataset;
Column with Code Generation
USE ColumnTest;
SET `compiler.codegen` "true";
SELECT VALUE COUNT(*)
FROM ColumnDataset;
Row
USE ColumnTest;
SELECT VALUE MAX(r.temp)
FROM RowDataset s, s.readings r;
Column
USE ColumnTest;
SELECT VALUE MAX(r.temp)
FROM ColumnDataset s, s.readings r;
Column with Code Generation
USE ColumnTest;
-- Enable code generation
SET `compiler.codegen` "true";
SELECT VALUE MAX(r.temp)
FROM ColumnDataset s, s.readings r;
From the web console, you can inspect the query plans. To have a detailed plan, change PLAN FORMAT
to STRING
.
The dropdown menu for changing the plan format is located on top of the query box (where you write queries).
After executing a query, click PLAN
, located in the result box, to show the query plan.
For example Q2
's plan with the column + code generation
looks like:
distribute result [$$45]
-- DISTRIBUTE_RESULT |UNPARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |UNPARTITIONED|
aggregate [$$45] <- [agg-global-sql-max($$48)]
-- AGGREGATE |UNPARTITIONED|
exchange
-- RANDOM_MERGE_EXCHANGE |PARTITIONED|
data-scan [$$48]<-[$$44, $$s, $$46, $$r, $$41, $$48] <- ColumnTest.ColumnDataset code >>
01| //reader0: {readings:[{temp:any}]}
02| function ColumnDataset0Func (cursor, resultWriter, reader0) {
03| var0 = NULL;
04| while (cursor.next()) {
05| reader0.next();
06| while (!reader0.isEndOfArray()) {
07| var1 = reader0.getValue();
08| var0 = var0 /\ var1;
09| reader0.next();
10| }
11| }
12| append(resultWriter, var0);
13| flush(resultWriter);
14| }
-- DATASOURCE_SCAN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|
Unfortunately, due to the redistribution restriction, we cannot share most of the datasets used in our experiments. For example, Twitter restricts republishing data collected from its API. Also, due to their sizes, we also cannot host downloading the datasets used in our experiments. However, for the synthetic dataset sensors
, we provided a sample that can be used to reproduce the sensors
queries. We also added a flag that scales the data loaded into AsterixDB, which can mimic the results produced in the paper. To do so, you need to add the following configuration to the LOAD
statement:
-- Scale factor = 4000
("expression" = "repeat,4000")
Thus, to load the data, you can do:
USE ColumnTest;
LOAD DATASET RowDataset USING localfs (
("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
("format" = "json"),
("expression" = "repeat,4000")
);
LOAD DATASET ColumnDataset USING localfs (
("path" = "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json"),
("format" = "json"),
("expression" = "repeat,4000")
);
The scale factor 4000
produces a similar size used in our experiment.
To test the ingestion performance
, you can use AsterixDB File Feed as in the following example for the ColumnDataset
:
DROP DATAVERSE ColumnTest IF EXISTS;
CREATE DATAVERSE ColumnTest;
USE ColumnTest;
CREATE TYPE OpenType AS {
uid: uuid
};
-- Create type for feed
CREATE TYPE FeedType AS { };
-- Create a Dataset using our proposed columnar format
CREATE DATASET ColumnDataset (OpenType)
PRIMARY KEY uid AUTOGENERATED WITH {
-- Tells AsterixDB to store the records as columns
"dataset-format":{"format":"column"}
};
-- Create a Feed
DROP FEED Feed0 IF EXISTS;
CREATE FEED Feed0 WITH {
"adapter-name": "localfs",
"type-name": "FeedType",
"path": "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json",
"format": "json",
-- Scale data
"expression": "repeat,2000"
};
-- Create a second feed
DROP FEED Feed1 IF EXISTS;
CREATE FEED Feed1 WITH {
"adapter-name": "localfs",
"type-name": "FeedType",
"path": "localhost:///<HOME>/apache-asterixdb-0.9.8-SNAPSHOT/opt/data/sensors.json",
"format": "json",
-- Scale data
"expression": "repeat,2000"
};
-- Connect Feed0 to ColumnDataset
CONNECT FEED Feed0 TO DATASET ColumnDataset;
-- Connect Feed1 to ColumnDataset
CONNECT FEED Feed1 TO DATASET ColumnDataset;
-- Start both feeds (star ingesting the data)
START FEED Feed0;
START FEED Feed1;
*As in LOAD
, make sure to change <HOME>
to the location of your <HOME>
folder.
**You can see here that we used two feeds to ingest the data -- similar to what we have used in the paper. You can connect more/less feeds as you see fit.
***The previous DDL is for the ColumnDataset
. You can do the same for the RowDataset
by simply removing "dataset-format":{"format":"column"}
from the CREATE DATASET
.
Feeds in AsterixDB run in the background. Thus, we need to monitor when the ingestion is finished. To do so, run the following command:
$ tail -f ~/asterixdb_column/logs/nc-1.log | grep "Total ingestion time"
After ingestion, you need to ensure everything is flushed to disk. Thus, you need to run the following commands:
For the RowDataset
:
$ curl http://localhost:19002/connector\?datasetName\=RowDataset\&dataverseName\=ColumnTest
For the ColumnDataset
:
$ curl http://localhost:19002/connector\?datasetName\=ColumnDataset\&dataverseName\=ColumnTest
After ingesting/loading the data, you can run the queries for the sensors
data:
-- Q1
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT VALUE COUNT(*)
FROM <DATASET>;
-- Q2
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT VALUE MAX(r.temp)
FROM <DATASET> s, s.readings r;
--Q3
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT sid , max_temp
FROM <DATASET> s , s.readings as r
GROUP BY s.sensor_id as sid
WITH max_temp as MAX(r.temp)
ORDER BY max_temp DESC
LIMIT 10;
-Q4
USE ColumnTest;
-- SET `compiler.codegen` "true";
SELECT sid , max_temp
FROM <DATASET> s , s.readings as r
WHERE s.report_time > 1556496000000
AND s.report_time < 1556496000000 + 24 * 60 * 60 * 1000
GROUP BY s.sensor_id as sid
WITH max_temp as MAX(r.temp)
ORDER BY max_temp DESC
LIMIT 10
*You need to replace <DATASET>
with either RowDataset
or ColumnDataset
.
** To enable the code generation, uncomment the following flag by removing --
-- SET `compiler.codegen` "true";
To stop the cluster, run:
$ ./stop.sh
To delete the cluster, run:
$ ./erase.sh
For more information about AsterixDB, refer to AsterixDB's Documentation
The implementation of the language (AIL) used in our code generation is a fork from Truffle's Simple Language
Please email me (my email is on the paper :-)). Or you can open an issue here on GitHub. Again, please remember this is unreviewed code. It is only a research prototype and NOT meant to be used for production.
- Users
maling list: [email protected]
Join the list by sending an email to [email protected] - Developers and contributors
mailing list:[email protected]
Join the list by sending an email to [email protected]