intel-ai / omniscidb Goto Github PK
View Code? Open in Web Editor NEWThis project forked from heavyai/heavydb
OmniSciDB (formerly MapD Core)
Home Page: https://intel-ai.github.io/omniscidb/
License: Apache License 2.0
This project forked from heavyai/heavydb
OmniSciDB (formerly MapD Core)
Home Page: https://intel-ai.github.io/omniscidb/
License: Apache License 2.0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Select
[ RUN ] Select.FilterAndSimpleAggregation
Tests/SQLiteComparator.cpp:96: Failure
Expected equality of these values:
ref_val
Which is: 15
omnisci_val
Which is: 0
CPU: SELECT COUNT(*) FROM test WHERE o1 > '1999-09-08';
Tests/SQLiteComparator.cpp:96: Failure
Expected equality of these values:
ref_val
Which is: 0
omnisci_val
Which is: 15
CPU: SELECT COUNT(*) FROM test WHERE o1 <= '1999-09-08';
2022-04-16T23:19:51.852455 F 149613 0 0 ArrowStorage.cpp:122 Check failed: elems > 0 (0 > 0)
Aborted (core dumped)
Current sort implementation generates permutation buffer by sorting allocating indices buffer and sorting it, using the comparator, which fetches values from data buffer by the given indices and compare them. (see ResultSetComparator
) It leads to random access to the data buffer, and so, this implementation does not scale well with multiple threads. We need to think, if there are better options we could do, to avoid that. Like sorting specified columns in-place, keeping there permutations and applying that permutations to the remaining columns.
Compilation with no chunks metadata
no group by aggregate
perfect hash
baseline hash
projection + filters
joins
OmniSciDb possesses some technical debt. Here we will collect related issues, accompanied by proposals for possible solutions.
Information about running the TimeSortValues test can be found here #216
The given commit adds the following call of convertToArrowTable with the following timing
0ms start(0ms) columnar conversion ArrowResultSetConverter.cpp:1181
34380ms start(0ms) row-wise conversion ArrowResultSetConverter.cpp:1201
32564ms start(0ms) fetch data single thread ArrowResultSetConverter.cpp:1206
1754ms start(32564ms) append data single thread ArrowResultSetConverter.cpp:1219
0ms start(34319ms) finish builders single thread ArrowResultSetConverter.cpp:1228
$ asv run --launch-method=forkserver --config ./modin/asv_bench/asv.conf.omnisci.json -b ^omnisci.benchmarks.TimeSortValues -a repeat=1 -a number=1 --show-stderr --python=same --set-commit-hash HEAD
▒ Discovering benchmarks
▒ Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] ▒ For modin commit be2716f3 <master>:
[ 0.00%] ▒▒ Building for existing-py_localdisk2_signatov_miniconda3_envs_omnisci-dev_bin_python3.8
[ 0.00%] ▒▒ Benchmarking existing-py_localdisk2_signatov_miniconda3_envs_omnisci-dev_bin_python3.8
[ 0.00%] ▒▒▒ Importing benchmark suite produced output:
[ 0.00%] ▒▒▒▒ UserWarning: The pandas version installed 1.3.4 does not match the supported pandas version in Modin 1.3.5. This may cause undesired side effects!
[ 50.00%] ▒▒▒ Running (omnisci.benchmarks.TimeSortValues.time_sort_values--).
[100.00%] ▒▒▒ omnisci.benchmarks.TimeSortValues.time_sort_values ok
[100.00%] ▒▒▒ ================ ============
-- columns_number / ascending_list
---------------- ------------
shape 5 / False
================ ============
[10000000, 10] 44.8▒0.03s
================ ============
so
at = curs.getArrowTable()
is much slower now than
rb = curs.getArrowRecordBatch()
To do inplace argsort in our new one-column sort optimization we need to copy column into temporary buffer. It is already implemented in ColumnarResult class, but we could not directly use it, because it copies the whole table. So, we could extract column materialization function to ResultSet and use it in ColumnarResult class too.
We have an output buffer initialization overhead. Previously we worked on that problem to disable such initialization but new tests revealed another problem in this area. We might have a case where we initialize 4 bytes per each output row. For series such initialization becomes quite significant – 8-10% of the whole execution time. It should be possible to get rid of this initialization
Cuda failing some tests on more recent versions of llvm.
As stated in the subject.
We want to have regular, conda
-accessible builds of OmniSci with Python bindings to enable better user experience for our Modin users.
For some reason, initColumnarGroups restricts count distinct queries. We need to check out why, and possibly remove that restriction. See #224.
Compilations produces many warnings we need to get rid of them.
Related to the issue #175. Need to improve performance of the function on a single column.
The plan:
Builds and runs ExecuteTest in CPU mode. Consider ExecuteTest ArrowStorage variant to avoid having to build and run initdb
and create a data dir.
This commit removed a set of global flag usages in NativeCodegen.cpp. Those flags were declared under GEOS flag which made an impression that all their usages are GEO related. We need to get them back.
DBEngine::importArrowTable()
function incorrectly imports table with bools, it switches true
to false
and vice versa.
To reproduce create a CSV file with the following values
i,bv,val
0,False,100
1,True,200
Then load the CSV file using arrow API, and print the content of the columns. After that load the table using DBEngine::importArrowTable()
function, run SELECT * from <your_table>
, get the table and print its content. You will see something like this
[
[
0,
1
]
]
[
[
true, # must be false
false # must be true
]
]
[
[
100,
200
]
]
"""
Issue 218 reproduction code
To be launched via:
PYTHONPATH=/nfs/site/home/gnovichk/shm/modin \
MODIN_OMNISCI_LAUNCH_PARAMETERS=enable-debug-timer=1, \
data=omnitmp MODIN_EXPERIMENTAL=true \
MODIN_STORAGE_FORMAT=omnisci MODIN_ENGINE=native python <name_of_this_script.py>
"""
import pandas
import pyarrow
import sys
prev = sys.getdlopenflags()
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
from omniscidbe import PyDbEngine
sys.setdlopenflags(prev)
def get_data():
df = pandas.DataFrame({"i": [0,1], "bv": [False, True], "val":[100,200] })
return df, pyarrow.Table.from_pandas(df)
server = PyDbEngine(
data="omnitmp",
enable_columnar_output=1,
enable_lazy_fetch=0,
enable_debug_timer=0,
calcite_port=4564
)
df, table = get_data()
server.importArrowTable("test_table", table)
cursor = server.executeDML("SELECT * FROM test_table")
df_at = pyarrow.Table.to_pandas(cursor.getArrowTable())
df_rb = pyarrow.Table.to_pandas(cursor.getArrowRecordBatch())
print ("===== Original Pandas Dataframe ======================")
print (df)
print ("===== Dataframe returned via getArrowTable() =========")
print (df_at)
print ("===== Dataframe returned via getArrowRecordBatch() ===")
print (df_rb)
print ("======================================================")
Execution results:
===== Original Pandas Dataframe ======================
i bv val
0 0 False 100
1 1 True 200
===== Dataframe returned via getArrowTable() =========
i bv val
0 0 True 100
1 1 False 200
===== Dataframe returned via getArrowRecordBatch() ===
i bv val
0 0 True 100
1 1 False 200
======================================================
Following (not intended by design) usage produces double frees (double free or corruption (!prev)
):
std::ifstream is(g_data_path);
std::string line;
while (std::getline(is, line)) {
insertCsvValues("trips", line);
}
somewhere around:
#0 0x00007f03f961cc31 in arrow::ChunkedArray::Make(std::vector<std::shared_ptr<arrow::Array>, std::allocator<std::shared_ptr<arrow::Array> > >, std::shared_ptr<arrow::DataType>) () from /home/petr/anaconda3/envs/omnisci-dev/lib/libarrow.so.500
#1 0x0000557e67ead4fb in ArrowStorage::appendArrowTable (this=0x557e698a6010, at=..., table_id=<optimized out>) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/stl_vector.h:99
#2 0x0000557e67eb3674 in ArrowStorage::appendCsvData (this=0x557e698a6010, csv_data=..., table_id=<optimized out>, parse_options=...) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/atomicity.h:96
#3 0x0000557e67eb38d4 in ArrowStorage::appendCsvData (this=0x557e698a6010, csv_data=..., table_name=..., parse_options=...) at /home/petr/code/offloading/omni/ArrowStorage/ArrowStorage.cpp:589
#4 0x0000557e676bcd87 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::insertCsvValues (values=..., table_name=..., this=<optimized out>) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:1020
#5 TestHelpers::ArrowSQLRunner::insertCsvValues (table_name=..., values=...) at /home/petr/code/offloading/omni/Tests/ArrowSQLRunner/ArrowSQLRunner.cpp:318
We have scripts which populate data and run queries using the google benchmark framework checked into the repo: https://github.com/intel-ai/omniscidb/blob/jit-engine/Benchmarks/taxi/taxi_reduced_bench.cpp
We should run these nightly as part of our CI. Google benchmark can output in various formats which should be readable by grafana or some other monitoring tool. We could also just output the raw values and visually check between different runs.
Google benchmark also has a compare tool which we might leverage to make runs report status automatically: https://github.com/google/benchmark/blob/main/docs/tools.md (see bottom section on U test).
Please extend NoCatalogRelAlgTest with test for streaming aggregation (e.g select sum(x) from table;)
Use fexolm/streaming-comilation branch.
We need to find out, what execute tests are sutable for streaming and adapt them.
Use GitHub actions and scripts from #369
Currently, we can only operate on data from a single database. Its ID is stored in RelAlgExecutor and Executor. We should get rid of this limitation by extending proper structures (like InputDescriptor).
A subtask would be to change how a dictionary is received from DataMgr. Currently, we use database id + dictionary id. That means, that SQLTypeInfo doesn't actually fully describe a dictionary it refers to. I think the easiest way to fix that is to encode schema id in the highest byte of dictionary id to choose a proper data provider in PersistentDataMgr and leave the rest to data providers.
Placeholder for tracking tasks about subj
Hash join fails when trying to join on a rowid
column and throws the following exception:
TOmniSciException(error_msg=Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size)
import sys
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
from omniscidbe import PyDbEngine
import pyarrow as pa
server = PyDbEngine()
# The behaviour is reproducible only if the length of the tables is greater than 1000
N = 1001
tbl1 = pa.Table.from_pydict({"a": range(N)})
tbl2 = pa.Table.from_pydict({"b": range(N)})
server.importArrowTable("table1", tbl1)
server.importArrowTable("table2", tbl2)
print("Join condition: rowid = b")
res = server.executeRA(
"SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.rowid = table2.b"
).getArrowRecordBatch() # Works
print(res)
print("Join condition: a = rowid")
res = server.executeRA(
"SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.a = table2.rowid"
).getArrowRecordBatch() # Fails
print(res)
print("Join condition: rowid = rowid")
res = server.executeRA(
"SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.rowid = table2.rowid"
).getArrowRecordBatch # Fails
print(res)
Output:
Join condition: rowid = b
pyarrow.RecordBatch
a: int64
b: int64
Join condition: a = rowid
2021-11-26T16:43:54.940381 E 2352382 0 0 DBHandler.cpp:1385 Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size
Traceback (most recent call last):
File "/localdisk1/dchigare/repos/modin_bp/testing.py", line 24, in <module>
res = server.executeRA(
File "dbe.pyx", line 196, in omniscidbe.PyDbEngine.executeRA
RuntimeError: TException - service has thrown: TOmniSciException(error_msg=Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size)
This behaviour is reproducible on the intel-ai/omniscidb/modin
branch (commit 7871987) as well as on the latest OmniSci 5.8.0 release (linux-64/pyomniscidbe-5.8.0-py37h1234567_1_cpu
)
Ilya investigate slow performance of a binary operation (mul was used as an example) for series in Modin on OmniSci. Several problems were revealed:
o There is an overhead per each row in dynamic modules used for service purposes (e. g. a in-memory counter of matched strings which is not registerized). With series this overhead becomes more visible because it’s paid per each 4 byte of input data. It makes kernel run t be ~1.5x slower comparing to dataframes with 10 columns and the same amount of data. Currently we don’t have a proposal for this problem
o We have an output buffer initialization overhead. Previously we worked on that problem to disable such initialization but new tests revealed another problem in this area. We might have a case where we initialize 4 bytes per each output row. For series such initialization becomes quite significant – 8-10% of the whole execution time. It should be possible to get rid of this initialization
o We spend much time on conversion of the result set into arrow record batch format. Current converter runs conversion of each column in parallel which means we have a single-threaded conversion for series output. Conversion takes 67% of all execution time. The proper solution here would be to convert into arrow table instead of a record batch. This would allow to use zero-copy conversion and remove most of the conversion overhead (would still spend some time on nulls map build)
Currently, I've commented all executor cleanup code in prepareStreamingExecution method. We need to re enable it some time.
Looks like #226 broke UdfTests:
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/udf_sample.cpp:5:
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/OmniSciTypes.h:24:
/localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/../Shared/InlineNullValues.h:371:28: error: no template named 'is_same_v' in namespace 'std'; did you mean 'is_same'?
std::enable_if_t<!std::is_same_v<V, bool> && std::is_integral<V>::value, int> = 0>
~~~~~^~~~~~~~~
is_same
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/type_traits:1292:12: note: 'is_same' declared here
struct is_same
^
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/udf_sample.cpp:5:
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/OmniSciTypes.h:24:
/localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/../Shared/InlineNullValues.h:371:47: error: expected '(' for function-style cast or type construction
std::enable_if_t<!std::is_same_v<V, bool> && std::is_integral<V>::value, int> = 0>
~~~~~~~~~~~~~~~~~~~~~~~ ^
Refer to: heavyai@74ada1b
I used conda environment:
commit 609edcff6b2614de59f7a11f0bf2c2f6c603fb28 (HEAD)
Author: ienkovich <[email protected]>
Date: Thu Feb 17 08:15:23 2022 -0600
Add ArrowBasedExecuteTest skeleton.
Signed-off-by: ienkovich <[email protected]>
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:359:59: required from 'std::shared_ptr<_Tp>::shared_ptr(std::_Sp_alloc_shared_tag<_Tp>, _Args&& ...) [with _Alloc = std::allocator<Calcite>; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}; _Tp = Calcite]'
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:701:14: required from 'std::shared_ptr<_Tp> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = Calcite; _Alloc = std::allocator<Calcite>; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}]'
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:717:39: required from 'std::shared_ptr<_Tp> std::make_shared(_Args&& ...) [with _Tp = Calcite; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}]'
/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp:390:84: required from here
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/aligned_buffer.h:91:28: error: invalid application of 'sizeof' to incomplete type 'Calcite'
91 | : std::aligned_storage<sizeof(_Tp), __alignof__(_Tp)>
| ^~~~~~~~~~~
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/aligned_buffer.h:91:28: error: invalid application of 'sizeof' to incomplete type 'Calcite'
make[2]: *** [Tests/CMakeFiles/ArrowBasedExecuteTest.dir/build.make:76: Tests/CMakeFiles/ArrowBasedExecuteTest.dir/ArrowBasedExecuteTest.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2932: Tests/CMakeFiles/ArrowBasedExecuteTest.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
commit 00431639e3f5071abe229fe37a531682d4b501ee (HEAD)
Author: ienkovich <[email protected]>
Date: Thu Feb 24 06:59:34 2022 -0600
Add Select tests to ArrowBasedExecuteTest.
Signed-off-by: ienkovich <[email protected]>
/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp: In static member function 'static void ExecuteTestBase::import_array_test(const string&)':
/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp:733:5: error: 'ASSERT' was not declared in this scope
733 | ASSERT(tinfo);
| ^~~~~~
commit 4ebdf37b5eda3e9f8a87dd6cc194712843b68f9a (HEAD)
Author: Alex Baden <[email protected]>
Date: Tue Mar 8 23:27:12 2022 -0700
Introduce BufferProvider
In file included from /localdisk2/signatov/jit-eng/QueryEngine/Descriptors/RelAlgExecutionDescriptor.h:23,
from /localdisk2/signatov/jit-eng/QueryEngine/ArrowResultSet.h:21,
from /localdisk2/signatov/jit-eng/Embedded/DBEngine.cpp:24:
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:198:3: note: candidate: 'ResultSet::ResultSet(int64_t, int64_t, std::shared_ptr<RowSetMemoryOwner>)'
198 | ResultSet(int64_t queue_time_ms,
| ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:198:3: note: candidate expects 3 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:196:3: note: candidate: 'ResultSet::ResultSet(const string&)'
196 | ResultSet(const std::string& explanation);
| ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:196:3: note: candidate expects 1 argument, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:189:3: note: candidate: 'ResultSet::ResultSet(std::shared_ptr<const Analyzer::Estimator>, ExecutorDeviceType, int, Data_Namespace::DataMgr*, BufferProvider*, int)'
189 | ResultSet(const std::shared_ptr<const Analyzer::Estimator>,
| ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:189:3: note: candidate expects 6 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:174:3: note: candidate: 'ResultSet::ResultSet(const std::vector<TargetInfo>&, const std::vector<ColumnLazyFetchInfo>&, const std::vector<std::vector<const signed char*> >&, const std::vector<std::vector<long int> >&, const std::vector<long int>&, ExecutorDeviceType, int, const QueryMemoryDescriptor&, std::shared_ptr<RowSetMemoryOwner>, Data_Namespace::DataMgr*, BufferProvider*, int, unsigned int, unsigned int)'
174 | ResultSet(const std::vector<TargetInfo>& targets,
| ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:174:3: note: candidate expects 14 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:164:3: note: candidate: 'ResultSet::ResultSet(const std::vector<TargetInfo>&, ExecutorDeviceType, const QueryMemoryDescriptor&, std::shared_ptr<RowSetMemoryOwner>, Data_Namespace::DataMgr*, BufferProvider*, int, unsigned int, unsigned int)'
164 | ResultSet(const std::vector<TargetInfo>& targets,
| ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:164:3: note: candidate expects 9 arguments, 8 provided
make[2]: *** [Embedded/CMakeFiles/DBEngine.dir/build.make:90: Embedded/CMakeFiles/DBEngine.dir/DBEngine.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2592: Embedded/CMakeFiles/DBEngine.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
To extend one-column sorting #226 with numeric types
At some point, we want to have generic structured type support. Let's remove current hardcoded structured GEO types, we don't need them as separate first-class citizens in JIT Engine.
To determine performance improvement of getArrowTable
against getArrowRecordBatch
we need to profile getArrowTable
and getArrowRecordBatch
functions. For that, perform the following:
N
(varies, could be 30 mln, 3 mln, TBD) and specified fragment_size
"SELECT * FROM <your_table>"
getArrowTable
function, measure its execution timegetArrowRecordBatch
function, measure its execution timeAs the time measurement of the function's execution may vary from time to time, it is desirable to do repeated measurements for each of the steps 3 and 4. The number of repetitions must be determined empirically (consider getting std. deviation getting as small as possible by taking larger number of repetitions).
The time measurements should be done for various numbers of fragments (equivalently, fragment sizes).
The data obtained by these efforts should shed light unto overall performance of each of the functions. As both functions use parallel execution, we might see how the different number of fragments affect the performance.
Before AVX-512 parallelization is done, it is better to get best possible performance by tinkering with the current implementation of getArrowTable
.
Related issue: #200
query:
"SELECT passenger_count, extract(year from "
"pickup_datetime), trip_type, count(*) FROM trips GROUP BY 1, 2, "
"3;",
backtrace:
#0 0x00007fffd766c876 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#1 0x00007fffd73ec664 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#2 0x00007fffd76d581b in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#3 0x00007fffd824ef37 in llvm::FPPassManager::runOnFunction(llvm::Function&) () from /lib/x86_64-linux-gnu/libigc.so.1
#4 0x00007fffd8250729 in llvm::FPPassManager::runOnModule(llvm::Module&) () from /lib/x86_64-linux-gnu/libigc.so.1
#5 0x00007fffd8250b91 in llvm::legacy::PassManagerImpl::run(llvm::Module&) () from /lib/x86_64-linux-gnu/libigc.so.1
#6 0x00007fffd74845cf in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#7 0x00007fffd74516ba in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#8 0x00007fffd72db551 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#9 0x00007fffd73b4d8d in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#10 0x00007fffe7b3c597 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#11 0x00007fffe7ac5ebb in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#12 0x00007fffe7ac75c6 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#13 0x00007fffe7ac7965 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#14 0x00007fffe7aaf19d in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#15 0x0000555557e7e63e in l0::L0Device::create_module (this=0x5555592eea80, code=0x7fff3c37e840 "\003\002#\a", len=58616, log=true) at /omnisci/L0Mgr/L0Mgr.cpp:198
There is an overhead per each row in dynamic modules used for service purposes (e. g. a in-memory counter of matched strings which is not registerized). With series this overhead becomes more visible because it’s paid per each 4 byte of input data. It makes kernel run to be ~1.5x slower comparing to dataframes with 10 columns and the same amount of data. Currently we don’t have a proposal for this problem
parallel_for
loopFSI is not needed anymore, we need to get rid of it.
It was found, that TBB has a strange problem when linking with libunwind.
The following code causes segfault due to register corruption
// compile with g++ -O0 -g reproducer.cpp -ltbb -lunwind -pthread -o reproducer
#include <tbb/task_arena.h>
#include <stdexcept>
#include <tbb/task_group.h>
int main() {
tbb::task_group tg;
tg.run_and_wait([] {
tbb::task_arena tbb_arena;
tbb_arena.execute([] {
throw std::runtime_error("work was canceled");
});
});
}
To bypass that problem, global task arena was removed in #201, but the problem could still exist but without visible artifacts.
GroupBy on multiple columns performance degraded x152 times comparing OmniSci v5.5.0 and v5.7.0 in the case of specifying custom table's fragment size.
The behaviour from the following reproducer could be achieved if the custom fragment size is computed by the following formula: fragment_size = int(n_cpus / n_table_rows)
, perf is starting to degrade from ~n_cpus = 16
.
This reproducer generates a table of 10 integer cols and 30 million rows and puts it into OmniSci in two ways: with unspecified default fragment size, and the one computed by the mentioned formula.
Then it runs SELECT SUM(all_of_the_cols) FROM test_table GROUP BY columns_to_by;
query for both of the tables and measures the execution time. It uses 4 columns to group on, there are 100 groups in the result.
You can uncomment the appropriate line and specify your logging directory in order to view the info from internal OmniSci's timers.
import sys
import os
from timeit import default_timer as timer
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
try:
from omniscidbe import PyDbEngine # for omnisci >= 5.7.0
except ImportError:
from dbe import PyDbEngine # for omnisci < 5.7.0
import numpy as np
import pyarrow
np.random.seed(42)
def generate_data(ncols=10, nrows=30_000_000, number_of_by_cols=2, number_of_groups=100):
data = {
f"col{i}": np.random.randint(low=0, high=1_000, size=nrows)
for i in range(ncols - number_of_by_cols)
}
data.update(
**{
f"by_col{i}": np.tile(
np.arange(number_of_groups), nrows // number_of_groups
)
for i in range(number_of_by_cols)
}
)
return pyarrow.Table.from_pydict(data)
number_of_by_cols = 4
test_table = generate_data(number_of_by_cols=number_of_by_cols)
server = PyDbEngine(
enable_debug_timer=1,
# log_directory="/logging_directory", uncomment this line and enter your logging directory to get
# info from internal OmniSci's timers
)
cpu_count = os.cpu_count() # 112 at the reproducing machine
fragment_size = test_table.num_rows // cpu_count # 267857 at the reproducing machine
server.importArrowTable("test_table_with_fragment_size", test_table, fragment_size=fragment_size)
server.importArrowTable("test_table_default_fragment_size", test_table)
groupby_arg = ", ".join(f"SUM({col})" for col in test_table.column_names)
by_cols = ", ".join([f"by_col{i}" for i in range(number_of_by_cols)])
t1 = timer()
server.select_df(f"SELECT {groupby_arg} FROM test_table_default_fragment_size GROUP BY {by_cols}")
print(f"Time for default fragment size: {timer() - t1} seconds")
t1 = timer()
server.select_df(f"SELECT {groupby_arg} FROM test_table_with_fragment_size GROUP BY {by_cols}")
print(f"Time for specified fragment size({fragment_size}): {timer() - t1} seconds")
Output:
omnisci v5.7.0
Time for default fragment size: 5.377382284961641 seconds
Time for specified fragment size(267857): 76.4291126029566 seconds
omnisci v5.5.0
Time for default fragment size: 5.003902015276253 seconds
Time for specified fragment size(267857): 0.5864222021773458 seconds
This commit (a2d77ce) was marked as suspicious by the guy who was bisecting the regressing commit.
cc @Garra1980
If is still relevant a patch needs to be applied
Looks like the problem is in results comparison:
7: The difference between ref_val and omnisci_val is nan, which exceeds EPS * std::fabs(ref_val), where
7: ref_val evaluates to inf,
7: omnisci_val evaluates to inf, and
7: EPS * std::fabs(ref_val) evaluates to inf.
7: CPU: SELECT 2e308 FROM test;
Right now, we still know about all table data in our streaming test, that is not the case for streaming workloads, so we need to fix the test to import data chunk by chunk.
The commit
commit cc698f16e15a40a2c667f07b9ad8d2107408c40c (HEAD)
Author: Alex Baden <[email protected]>
Date: Wed Mar 2 14:44:43 2022 -0700
Disable ALWAYS_INLINE under GCC/G++
ALWAYS_INLINE is only neccessary for clang-compiled C++ to LLVM IR for the JIT runtime. Disable it for GNU compilers to prevent inlining errors.
causes too long running of Interrupt.Check_Query_Runs_After_Interruption from RuntimeInterruptTest
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:5:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:29:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:56:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:82:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:136:1: error: redefinition of 'gen_null_bitmap_8'
gen_null_bitmap_8(uint8_t* dst, const uint8_t* src, size_t size, const uint8_t null_val) {
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:6:1: note: previous definition is here
gen_null_bitmap_8(uint8_t* dst, const uint8_t* src, size_t size, const uint8_t null_val) {
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:140:43: error: redefinition of 'gen_null_bitmap_16'
size_t __attribute__((target("default"))) gen_null_bitmap_16(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:30:1: note: previous definition is here
gen_null_bitmap_16(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:147:43: error: redefinition of 'gen_null_bitmap_32'
size_t __attribute__((target("default"))) gen_null_bitmap_32(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:57:1: note: previous definition is here
gen_null_bitmap_32(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:154:43: error: redefinition of 'gen_null_bitmap_64'
size_t __attribute__((target("default"))) gen_null_bitmap_64(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:83:1: note: previous definition is here
gen_null_bitmap_64(uint8_t* dst,
^
8 errors generated.
Since we are no longer using thrift, we can move the globals out of CommandLineOptions and remove the ThriftHandler folder, and all thrift install info / licenses.
The "fastest" distinct count implementation that OmniSci is always trying to use is the BitMap approach: it allocates a bit array of zeros in size of (max_col_value - min_col_value)
, puts a 1 at the corresponding position if the value was met, and then counts the amount of ones to get the number of unique values.
omniscidb/QueryEngine/GroupByAndAggregate.cpp
Lines 620 to 622 in a48c6a1
The described BitMap approach suffers from a memory overhead in cases of high ranged data, when a column contains 0 and 1_000_000_000 at the same time the size of the buffer would be around 120mb, for a table with 10 columns it would be 1.2gb, each kernel has its own buffer, so the final amount of memory for such table is num_kernels * 1.2
gb.
Operating with such BitMaps (allocating, initializing with zeros, reducing) seems like a heavy deal for OmniSci in the current master. For a table of 30 million rows and 10 columns, 100 distinct values in each column with the range from zero to one billion, it takes about 135 seconds to perform a distinct count (112 kernels were used).
import sys
import os
from timeit import default_timer as timer
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
from omniscidbe import PyDbEngine
import numpy as np
import pyarrow as pa
def get_data():
shape = (30_000_000, 10)
nunique = 100
max_value = 1_000_000_000
nrows, ncols = shape
data = {
"col{}".format(i): np.tile(
np.concatenate([np.arange(nunique - 1), [max_value]]), nrows // nunique
)
for i in range(ncols)
}
return pa.Table.from_pydict(data)
table = get_data()
print(f"Generated ({table.num_rows}, {table.num_columns}) shape table")
server = PyDbEngine(
enable_union=1,
enable_columnar_output=1,
enable_lazy_fetch=0,
null_div_by_zero=1,
enable_watchdog=0,
enable_debug_timer=1,
log_directory="/localdisk/dchigare/repos/omnisci_log",
)
fragment_size = max(min(table.num_rows // os.cpu_count(), 2 ** 25), 2 ** 18)
server.importArrowTable("test_table", table, fragment_size=fragment_size)
cols = ", ".join([f"COUNT(distinct {col})" for col in table.column_names])
execution_time = timer()
record_batch = server.executeRA(f"SELECT {cols} FROM test_table").getArrowRecordBatch()
execution_time = timer() - execution_time
print(f"Got ({record_batch.num_rows}, {record_batch.num_columns}) shape table for {execution_time} sec.")
There are a few bottlenecks that result in such a big time:
Timing for BitMap allocation in a single kernel looks like this:
New thread(110)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:95
5ms start(1ms) fetchChunks Execute.cpp:2570
83487ms start(6ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:676
83487ms start(6ms) allocateCountDistinctBuffers QueryMemoryInitializer.cpp:655
8678ms start(2357ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
8600ms start(2357ms) awaiting for a mutex... RowSetMemoryOwner.h:68
78ms start(10958ms) allocateAndZero ArenaAllocator.h:110
0ms start(10958ms) actual allocation ArenaAllocator.h:112
78ms start(10958ms) memset zero ArenaAllocator.h:116
8612ms start(11036ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
8532ms start(11036ms) awaiting for a mutex... RowSetMemoryOwner.h:68
79ms start(19568ms) allocateAndZero ArenaAllocator.h:110
0ms start(19568ms) actual allocation ArenaAllocator.h:112
79ms start(19568ms) memset zero ArenaAllocator.h:116
... allocation for the rest 8 columns ...
11ms start(83494ms) executePlanWithoutGroupBy Execute.cpp:2978
11ms start(83494ms) launchCpuCode QueryExecutionContext.cpp:596
End thread(110)
(full log, run on 547f3b9
)
The mutex, which it's awaiting most of the kernel execution time is the state mutex for a RowSetMemoryOwner
:
omniscidb/QueryEngine/Descriptors/RowSetMemoryOwner.h
Lines 63 to 67 in a48c6a1
allocateAndZero()
just allocates and initializes memory with zeros, mutex lock is redundant in this case because we're initializing a non-intersected memoryomniscidb/DataMgr/Allocators/ArenaAllocator.h
Lines 108 to 112 in a48c6a1
memset
from the lock reduced overall execution time from 125 sec. to 75 sec. but occasionally increased the time of memset
itself...
New thread(110)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:95
86ms start(0ms) fetchChunks Execute.cpp:2570
11433ms start(87ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:676
11433ms start(87ms) allocateCountDistinctBuffers QueryMemoryInitializer.cpp:655
1538ms start(87ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
0ms start(87ms) awaiting for the allocator' mutex RowSetMemoryOwner.h:68
0ms start(87ms) actual allocation ArenaAllocator.h:104
1537ms start(87ms) memset zero RowSetMemoryOwner.h:75
0ms start(1625ms) awaiting for the vector' mutex RowSetMemoryOwner.h:81
... allocation for the rest 9 columns ...
11ms start(11520ms) executePlanWithoutGroupBy Execute.cpp:2978
11ms start(11520ms) launchCpuCode QueryExecutionContext.cpp:596
End thread(110)
(full log, run on 691e3f2
)
The amount of memory, way of its allocation and initialization stayed the same, but the memset
time changed from ~80ms to the ~1500ms. I've rerun several times, but still getting these numbers, is there a logical explanation?
The second bottleneck that takes most of the execution time is the reduction, do not have much info on this, investigation is needed...
60976ms start(12304ms) collectAllDeviceResults Execute.cpp:1990
60976ms start(12304ms) reduceMultiDeviceResults Execute.cpp:935
60976ms start(12304ms) reduceMultiDeviceResultSets Execute.cpp:981
std::set
OmniSci also provides "slow" distinct count implementation via std::set
, in theory, for a small amount of distinct values in a big range, it should work much better in terms of memory allocation, but it's unknown how long the reduction stage would perform. Have no info on this.
You can check my distinct_count_timers
branch where I did the following:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.