Giter Club home page Giter Club logo

omniscidb's People

Contributors

alex-brant avatar alexbaden avatar andrewseidl avatar asuhan avatar corecursion avatar dwayneberry avatar fleapapa avatar ienkovich avatar jack-mapd avatar kurapov-peter avatar leshikus avatar m1mc avatar mapdwei avatar mattdlh avatar mattgara avatar mattpulver avatar misiugodfrey avatar norairk avatar pearu avatar pressenna avatar ptumati avatar sashkiani avatar shtilman avatar simoneves avatar smyatkin-maxim avatar steveblackmon-mapd avatar tmostak avatar vrajpandya avatar wamsiv avatar yoonminnam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

omniscidb's Issues

Select.FilterAndSimpleAggregation fails on CPU when CUDA is enabled

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Select
[ RUN      ] Select.FilterAndSimpleAggregation
Tests/SQLiteComparator.cpp:96: Failure
Expected equality of these values:
  ref_val
    Which is: 15
  omnisci_val
    Which is: 0
CPU: SELECT COUNT(*) FROM test WHERE o1 > '1999-09-08';
Tests/SQLiteComparator.cpp:96: Failure
Expected equality of these values:
  ref_val
    Which is: 0
  omnisci_val
    Which is: 15
CPU: SELECT COUNT(*) FROM test WHERE o1 <= '1999-09-08';
2022-04-16T23:19:51.852455 F 149613 0 0 ArrowStorage.cpp:122 Check failed: elems > 0 (0 > 0) 
Aborted (core dumped)

Current sort implementation is slow due to random memory access

Current sort implementation generates permutation buffer by sorting allocating indices buffer and sorting it, using the comparator, which fetches values from data buffer by the given indices and compare them. (see ResultSetComparator) It leads to random access to the data buffer, and so, this implementation does not scale well with multiple threads. We need to think, if there are better options we could do, to avoid that. Like sorting specified columns in-place, keeping there permutations and applying that permutations to the remaining columns.

Implement streaming support for jit engine

  • #269

    • Execute for specific chunks
  • Compilation with no chunks metadata

    • introduce table metadata
    • optional table/chunk metadata
    • #266
    • perfect hash function agnostic codegen
  • no group by aggregate

    • state storing (execution context)
    • workers (shared_context, local_context)
    • reduction (by demand)
  • perfect hash

    • use table metadata for perfect hash
  • baseline hash

    • allocation + resizing
    • detect hash table load level and probably stop query execution for hashtable resizing
    • plan with partial aggregation + reduction
    • relalg IR changes
    • hashtable as input (serialization)
  • projection + filters

    • result set allocation and merging
    • stop kernel if result set is full and resume with new result set
  • joins

    • split hashtable build and prob phase
    • relalg changes (new nodes build and prob)
    • hash table as input/output

Commit #3799 in modin-project causes regression on the TimeSortValues test

Information about running the TimeSortValues test can be found here #216

The given commit adds the following call of convertToArrowTable with the following timing

0ms start(0ms) columnar conversion ArrowResultSetConverter.cpp:1181
34380ms start(0ms) row-wise conversion ArrowResultSetConverter.cpp:1201
32564ms start(0ms) fetch data single thread ArrowResultSetConverter.cpp:1206
1754ms start(32564ms) append data single thread ArrowResultSetConverter.cpp:1219
0ms start(34319ms) finish builders single thread ArrowResultSetConverter.cpp:1228
$ asv run --launch-method=forkserver --config ./modin/asv_bench/asv.conf.omnisci.json -b ^omnisci.benchmarks.TimeSortValues -a repeat=1 -a number=1 --show-stderr --python=same --set-commit-hash HEAD
▒ Discovering benchmarks
▒ Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] ▒ For modin commit be2716f3 <master>:
[ 0.00%] ▒▒ Building for existing-py_localdisk2_signatov_miniconda3_envs_omnisci-dev_bin_python3.8
[ 0.00%] ▒▒ Benchmarking existing-py_localdisk2_signatov_miniconda3_envs_omnisci-dev_bin_python3.8
[ 0.00%] ▒▒▒ Importing benchmark suite produced output:
[ 0.00%] ▒▒▒▒ UserWarning: The pandas version installed 1.3.4 does not match the supported pandas version in Modin 1.3.5. This may cause undesired side effects!
[ 50.00%] ▒▒▒ Running (omnisci.benchmarks.TimeSortValues.time_sort_values--).
[100.00%] ▒▒▒ omnisci.benchmarks.TimeSortValues.time_sort_values ok
[100.00%] ▒▒▒ ================ ============
-- columns_number / ascending_list
---------------- ------------
shape 5 / False
================ ============
[10000000, 10] 44.8▒0.03s
================ ============

so

at = curs.getArrowTable()

is much slower now than

rb = curs.getArrowRecordBatch()

Extract compact column materialization from ColumnarResult to ResultSet

To do inplace argsort in our new one-column sort optimization we need to copy column into temporary buffer. It is already implemented in ColumnarResult class, but we could not directly use it, because it copies the whole table. So, we could extract column materialization function to ResultSet and use it in ColumnarResult class too.

Output buffer initialization overhead

We have an output buffer initialization overhead. Previously we worked on that problem to disable such initialization but new tests revealed another problem in this area. We might have a case where we initialize 4 bytes per each output row. For series such initialization becomes quite significant – 8-10% of the whole execution time. It should be possible to get rid of this initialization

#175

Fix cuda build

Cuda failing some tests on more recent versions of llvm.

Warnings purges

Compilations produces many warnings we need to get rid of them.

Improve performance of ArrowResultSetConverter::getArrowTable

Related to the issue #175. Need to improve performance of the function on a single column.

The plan:

  • Build OmniSciDb under Conda
  • In this environment, run it and reproduce the problem
  • Implement Zero Copy
  • Test Zero Copy implementation
  • Parallelization within a single column
  • Employ AVX512 vectorization to BITMAP creation

Incorrect importing of bools by DBEngine::importArrowTable()

DBEngine::importArrowTable() function incorrectly imports table with bools, it switches true to false and vice versa.

To reproduce create a CSV file with the following values

i,bv,val
0,False,100
1,True,200

Then load the CSV file using arrow API, and print the content of the columns. After that load the table using DBEngine::importArrowTable() function, run SELECT * from <your_table>, get the table and print its content. You will see something like this

[
  [
  0,
  1
  ]
]
[
  [
  true,   # must be false
  false   # must be true
  ]
]
[
  [
  100,
  200
  ]
]
Reproducer with pydbe and its execution results
"""
Issue 218 reproduction code 

To be launched via:
PYTHONPATH=/nfs/site/home/gnovichk/shm/modin \
MODIN_OMNISCI_LAUNCH_PARAMETERS=enable-debug-timer=1, \
data=omnitmp MODIN_EXPERIMENTAL=true \
MODIN_STORAGE_FORMAT=omnisci MODIN_ENGINE=native python <name_of_this_script.py>
"""
import pandas
import pyarrow
import sys

prev = sys.getdlopenflags()
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL

from omniscidbe import PyDbEngine
sys.setdlopenflags(prev)

def get_data():
    df = pandas.DataFrame({"i": [0,1], "bv": [False, True], "val":[100,200] })
    return df, pyarrow.Table.from_pandas(df)

server = PyDbEngine(
    data="omnitmp", 
    enable_columnar_output=1, 
    enable_lazy_fetch=0, 
    enable_debug_timer=0, 
    calcite_port=4564
)

df, table = get_data()
server.importArrowTable("test_table", table)
cursor = server.executeDML("SELECT * FROM test_table")
df_at = pyarrow.Table.to_pandas(cursor.getArrowTable())
df_rb = pyarrow.Table.to_pandas(cursor.getArrowRecordBatch())
print ("===== Original Pandas Dataframe ======================")
print (df)
print ("===== Dataframe returned via getArrowTable() =========")
print (df_at)
print ("===== Dataframe returned via getArrowRecordBatch() ===")
print (df_rb)
print ("======================================================")

Execution results:

===== Original Pandas Dataframe ======================
   i     bv  val
0  0  False  100
1  1   True  200
===== Dataframe returned via getArrowTable() =========
   i     bv  val
0  0   True  100
1  1  False  200
===== Dataframe returned via getArrowRecordBatch() ===
   i     bv  val
0  0   True  100
1  1  False  200
======================================================

insertCsvValues behaves incorrectly if misused

Following (not intended by design) usage produces double frees (double free or corruption (!prev)):

std::ifstream is(g_data_path);
std::string line;
while (std::getline(is, line)) {
  insertCsvValues("trips", line);
}

somewhere around:

#0  0x00007f03f961cc31 in arrow::ChunkedArray::Make(std::vector<std::shared_ptr<arrow::Array>, std::allocator<std::shared_ptr<arrow::Array> > >, std::shared_ptr<arrow::DataType>) () from /home/petr/anaconda3/envs/omnisci-dev/lib/libarrow.so.500
#1  0x0000557e67ead4fb in ArrowStorage::appendArrowTable (this=0x557e698a6010, at=..., table_id=<optimized out>) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/stl_vector.h:99
#2  0x0000557e67eb3674 in ArrowStorage::appendCsvData (this=0x557e698a6010, csv_data=..., table_id=<optimized out>, parse_options=...) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/atomicity.h:96
#3  0x0000557e67eb38d4 in ArrowStorage::appendCsvData (this=0x557e698a6010, csv_data=..., table_name=..., parse_options=...) at /home/petr/code/offloading/omni/ArrowStorage/ArrowStorage.cpp:589
#4  0x0000557e676bcd87 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::insertCsvValues (values=..., table_name=..., this=<optimized out>) at /data/anaconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr_base.h:1020
#5  TestHelpers::ArrowSQLRunner::insertCsvValues (table_name=..., values=...) at /home/petr/code/offloading/omni/Tests/ArrowSQLRunner/ArrowSQLRunner.cpp:318

Add taxi benchmark to github actions CPU CI

We have scripts which populate data and run queries using the google benchmark framework checked into the repo: https://github.com/intel-ai/omniscidb/blob/jit-engine/Benchmarks/taxi/taxi_reduced_bench.cpp

We should run these nightly as part of our CI. Google benchmark can output in various formats which should be readable by grafana or some other monitoring tool. We could also just output the raw values and visually check between different runs.

Google benchmark also has a compare tool which we might leverage to make runs report status automatically: https://github.com/google/benchmark/blob/main/docs/tools.md (see bottom section on U test).

Remove a single DB ID restrictions

Currently, we can only operate on data from a single database. Its ID is stored in RelAlgExecutor and Executor. We should get rid of this limitation by extending proper structures (like InputDescriptor).

A subtask would be to change how a dictionary is received from DataMgr. Currently, we use database id + dictionary id. That means, that SQLTypeInfo doesn't actually fully describe a dictionary it refers to. I think the easiest way to fix that is to encode schema id in the highest byte of dictionary id to choose a proper data provider in PersistentDataMgr and leave the rest to data providers.

Equi-join fails when joining on a `rowid` meta-column

Hash join fails when trying to join on a rowid column and throws the following exception:

TOmniSciException(error_msg=Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size)
Reproducer with pydbe
import sys

sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL
from omniscidbe import PyDbEngine
import pyarrow as pa

server = PyDbEngine()

# The behaviour is reproducible only if the length of the tables is greater than 1000
N = 1001
tbl1 = pa.Table.from_pydict({"a": range(N)})
tbl2 = pa.Table.from_pydict({"b": range(N)})

server.importArrowTable("table1", tbl1)
server.importArrowTable("table2", tbl2)

print("Join condition: rowid = b")
res = server.executeRA(
    "SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.rowid = table2.b"
).getArrowRecordBatch()  # Works
print(res)

print("Join condition: a = rowid")
res = server.executeRA(
    "SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.a = table2.rowid"
).getArrowRecordBatch()  # Fails
print(res)

print("Join condition: rowid = rowid")
res = server.executeRA(
    "SELECT table1.a, table2.b FROM table1 JOIN table2 ON table1.rowid = table2.rowid"
).getArrowRecordBatch  # Fails
print(res)

Output:

Join condition: rowid = b
pyarrow.RecordBatch
a: int64
b: int64
Join condition: a = rowid
2021-11-26T16:43:54.940381 E 2352382 0 0 DBHandler.cpp:1385 Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size
Traceback (most recent call last):
  File "/localdisk1/dchigare/repos/modin_bp/testing.py", line 24, in <module>
    res = server.executeRA(
  File "dbe.pyx", line 196, in omniscidbe.PyDbEngine.executeRA
RuntimeError: TException - service has thrown: TOmniSciException(error_msg=Hash join failed, reason(s): Could not build a 1-to-1 correspondence for columns involved in equijoin | Cannot join on rowid | Cannot fall back to loop join for non-trivial inner table size)

This behaviour is reproducible on the intel-ai/omniscidb/modin branch (commit 7871987) as well as on the latest OmniSci 5.8.0 release (linux-64/pyomniscidbe-5.8.0-py37h1234567_1_cpu)

Perf issues for binary operations with Series

Ilya investigate slow performance of a binary operation (mul was used as an example) for series in Modin on OmniSci. Several problems were revealed:
o There is an overhead per each row in dynamic modules used for service purposes (e. g. a in-memory counter of matched strings which is not registerized). With series this overhead becomes more visible because it’s paid per each 4 byte of input data. It makes kernel run t be ~1.5x slower comparing to dataframes with 10 columns and the same amount of data. Currently we don’t have a proposal for this problem
o We have an output buffer initialization overhead. Previously we worked on that problem to disable such initialization but new tests revealed another problem in this area. We might have a case where we initialize 4 bytes per each output row. For series such initialization becomes quite significant – 8-10% of the whole execution time. It should be possible to get rid of this initialization
o We spend much time on conversion of the result set into arrow record batch format. Current converter runs conversion of each column in parallel which means we have a single-threaded conversion for series output. Conversion takes 67% of all execution time. The proper solution here would be to convert into arrow table instead of a record batch. This would allow to use zero-copy conversion and remove most of the conversion overhead (would still spend some time on nulls map build)

Broken UdfTests [modin branch]

Looks like #226 broke UdfTests:

In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/udf_sample.cpp:5:
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/OmniSciTypes.h:24:
/localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/../Shared/InlineNullValues.h:371:28: error: no template named 'is_same_v' in namespace 'std'; did you mean 'is_same'?
    std::enable_if_t<!std::is_same_v<V, bool> && std::is_integral<V>::value, int> = 0>
                      ~~~~~^~~~~~~~~
                           is_same
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/type_traits:1292:12: note: 'is_same' declared here
    struct is_same
           ^
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/udf_sample.cpp:5:
In file included from /localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/OmniSciTypes.h:24:
/localdisk/ilyaenko/omnisci/build/Tests/../../Tests/Udf/../../QueryEngine/../Shared/InlineNullValues.h:371:47: error: expected '(' for function-style cast or type construction
    std::enable_if_t<!std::is_same_v<V, bool> && std::is_integral<V>::value, int> = 0>
                      ~~~~~~~~~~~~~~~~~~~~~~~ ^

some last commits in jit-engine branch cause compile errors

I used conda environment:

commit 609edcff6b2614de59f7a11f0bf2c2f6c603fb28 (HEAD)
Author: ienkovich <[email protected]>
Date:   Thu Feb 17 08:15:23 2022 -0600

    Add ArrowBasedExecuteTest skeleton.

    Signed-off-by: ienkovich <[email protected]>

/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:359:59:   required from 'std::shared_ptr<_Tp>::shared_ptr(std::_Sp_alloc_shared_tag<_Tp>, _Args&& ...) [with _Alloc = std::allocator<Calcite>; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}; _Tp = Calcite]'
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:701:14:   required from 'std::shared_ptr<_Tp> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = Calcite; _Alloc = std::allocator<Calcite>; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}]'
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/bits/shared_ptr.h:717:39:   required from 'std::shared_ptr<_Tp> std::make_shared(_Args&& ...) [with _Tp = Calcite; _Args = {int, const int&, const char (&)[1], int, int, bool, const char (&)[1]}]'
/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp:390:84:   required from here
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/aligned_buffer.h:91:28: error: invalid application of 'sizeof' to incomplete type 'Calcite'
   91 |     : std::aligned_storage<sizeof(_Tp), __alignof__(_Tp)>
      |                            ^~~~~~~~~~~
/localdisk2/signatov/miniconda3/envs/omnisci-dev/x86_64-conda-linux-gnu/include/c++/9.4.0/ext/aligned_buffer.h:91:28: error: invalid application of 'sizeof' to incomplete type 'Calcite'
make[2]: *** [Tests/CMakeFiles/ArrowBasedExecuteTest.dir/build.make:76: Tests/CMakeFiles/ArrowBasedExecuteTest.dir/ArrowBasedExecuteTest.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2932: Tests/CMakeFiles/ArrowBasedExecuteTest.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
commit 00431639e3f5071abe229fe37a531682d4b501ee (HEAD)
Author: ienkovich <[email protected]>
Date:   Thu Feb 24 06:59:34 2022 -0600

    Add Select tests to ArrowBasedExecuteTest.

    Signed-off-by: ienkovich <[email protected]>

/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp: In static member function 'static void ExecuteTestBase::import_array_test(const string&)':
/localdisk2/signatov/jit-eng/Tests/ArrowBasedExecuteTest.cpp:733:5: error: 'ASSERT' was not declared in this scope
  733 |     ASSERT(tinfo);
      |     ^~~~~~
commit 4ebdf37b5eda3e9f8a87dd6cc194712843b68f9a (HEAD)
Author: Alex Baden <[email protected]>
Date:   Tue Mar 8 23:27:12 2022 -0700

    Introduce BufferProvider

In file included from /localdisk2/signatov/jit-eng/QueryEngine/Descriptors/RelAlgExecutionDescriptor.h:23,
                 from /localdisk2/signatov/jit-eng/QueryEngine/ArrowResultSet.h:21,
                 from /localdisk2/signatov/jit-eng/Embedded/DBEngine.cpp:24:
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:198:3: note: candidate: 'ResultSet::ResultSet(int64_t, int64_t, std::shared_ptr<RowSetMemoryOwner>)'
  198 |   ResultSet(int64_t queue_time_ms,
      |   ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:198:3: note:   candidate expects 3 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:196:3: note: candidate: 'ResultSet::ResultSet(const string&)'
  196 |   ResultSet(const std::string& explanation);
      |   ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:196:3: note:   candidate expects 1 argument, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:189:3: note: candidate: 'ResultSet::ResultSet(std::shared_ptr<const Analyzer::Estimator>, ExecutorDeviceType, int, Data_Namespace::DataMgr*, BufferProvider*, int)'
  189 |   ResultSet(const std::shared_ptr<const Analyzer::Estimator>,
      |   ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:189:3: note:   candidate expects 6 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:174:3: note: candidate: 'ResultSet::ResultSet(const std::vector<TargetInfo>&, const std::vector<ColumnLazyFetchInfo>&, const std::vector<std::vector<const signed char*> >&, const std::vector<std::vector<long int> >&, const std::vector<long int>&, ExecutorDeviceType, int, const QueryMemoryDescriptor&, std::shared_ptr<RowSetMemoryOwner>, Data_Namespace::DataMgr*, BufferProvider*, int, unsigned int, unsigned int)'
  174 |   ResultSet(const std::vector<TargetInfo>& targets,
      |   ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:174:3: note:   candidate expects 14 arguments, 8 provided
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:164:3: note: candidate: 'ResultSet::ResultSet(const std::vector<TargetInfo>&, ExecutorDeviceType, const QueryMemoryDescriptor&, std::shared_ptr<RowSetMemoryOwner>, Data_Namespace::DataMgr*, BufferProvider*, int, unsigned int, unsigned int)'
  164 |   ResultSet(const std::vector<TargetInfo>& targets,
      |   ^~~~~~~~~
/localdisk2/signatov/jit-eng/QueryEngine/ResultSet.h:164:3: note:   candidate expects 9 arguments, 8 provided
make[2]: *** [Embedded/CMakeFiles/DBEngine.dir/build.make:90: Embedded/CMakeFiles/DBEngine.dir/DBEngine.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2592: Embedded/CMakeFiles/DBEngine.dir/all] Error 2
make: *** [Makefile:166: all] Error 2

Profile getArrowTable and getArrowRecordBatch functions. Compare their performance and determine performance improvement

To determine performance improvement of getArrowTable against getArrowRecordBatch we need to profile getArrowTable and getArrowRecordBatch functions. For that, perform the following:

  1. WIth a database engine, create a table with a single column of size N (varies, could be 30 mln, 3 mln, TBD) and specified fragment_size
  2. Run SQL query "SELECT * FROM <your_table>"
  3. Retrieve the query result by using getArrowTable function, measure its execution time
  4. Retrieve the query result by using getArrowRecordBatch function, measure its execution time

As the time measurement of the function's execution may vary from time to time, it is desirable to do repeated measurements for each of the steps 3 and 4. The number of repetitions must be determined empirically (consider getting std. deviation getting as small as possible by taking larger number of repetitions).

The time measurements should be done for various numbers of fragments (equivalently, fragment sizes).

The data obtained by these efforts should shed light unto overall performance of each of the functions. As both functions use parallel execution, we might see how the different number of fragments affect the performance.

Before AVX-512 parallelization is done, it is better to get best possible performance by tinkering with the current implementation of getArrowTable.

Related issue: #200

L0: segfault when trying to create spirv binary

query:

      "SELECT passenger_count, extract(year from "
      "pickup_datetime), trip_type, count(*) FROM trips GROUP BY 1, 2, "
      "3;",

backtrace:

#0 0x00007fffd766c876 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#1 0x00007fffd73ec664 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#2 0x00007fffd76d581b in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#3 0x00007fffd824ef37 in llvm::FPPassManager::runOnFunction(llvm::Function&) () from /lib/x86_64-linux-gnu/libigc.so.1
#4 0x00007fffd8250729 in llvm::FPPassManager::runOnModule(llvm::Module&) () from /lib/x86_64-linux-gnu/libigc.so.1
#5 0x00007fffd8250b91 in llvm::legacy::PassManagerImpl::run(llvm::Module&) () from /lib/x86_64-linux-gnu/libigc.so.1
#6 0x00007fffd74845cf in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#7 0x00007fffd74516ba in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#8 0x00007fffd72db551 in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#9 0x00007fffd73b4d8d in ?? () from /lib/x86_64-linux-gnu/libigc.so.1
#10 0x00007fffe7b3c597 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#11 0x00007fffe7ac5ebb in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#12 0x00007fffe7ac75c6 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#13 0x00007fffe7ac7965 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#14 0x00007fffe7aaf19d in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#15 0x0000555557e7e63e in l0::L0Device::create_module (this=0x5555592eea80, code=0x7fff3c37e840 "\003\002#\a", len=58616, log=true) at /omnisci/L0Mgr/L0Mgr.cpp:198

Improve codegen for series (columns)

There is an overhead per each row in dynamic modules used for service purposes (e. g. a in-memory counter of matched strings which is not registerized). With series this overhead becomes more visible because it’s paid per each 4 byte of input data. It makes kernel run to be ~1.5x slower comparing to dataframes with 10 columns and the same amount of data. Currently we don’t have a proposal for this problem

#175

Add AVX512 support for getArrowTable()

  • Use AVX512 intrinsics to vectorize bitmap creation
  • Provide default (non-AVX512) bitmap creation without use of intrinsics
  • Integrate AVX512-vectorized bitmap creation into parallel_for loop
  • Profile for each of the cases, AVX-512 and default (non-AVX-512)

TBB could work incorrectly win libunwind

It was found, that TBB has a strange problem when linking with libunwind.
The following code causes segfault due to register corruption

// compile with g++ -O0 -g reproducer.cpp  -ltbb -lunwind -pthread -o reproducer

#include <tbb/task_arena.h>
#include <stdexcept>
#include <tbb/task_group.h>

int main() {
    tbb::task_group tg;
    tg.run_and_wait([] {
        tbb::task_arena tbb_arena;
        tbb_arena.execute([] {
            throw std::runtime_error("work was canceled");
        });
    });
}

To bypass that problem, global task arena was removed in #201, but the problem could still exist but without visible artifacts.

Performance regression in GroupBy since 5.7.0 for small fragment sizes

GroupBy on multiple columns performance degraded x152 times comparing OmniSci v5.5.0 and v5.7.0 in the case of specifying custom table's fragment size.

The behaviour from the following reproducer could be achieved if the custom fragment size is computed by the following formula: fragment_size = int(n_cpus / n_table_rows), perf is starting to degrade from ~n_cpus = 16.

Reproducer with PyDBE

This reproducer generates a table of 10 integer cols and 30 million rows and puts it into OmniSci in two ways: with unspecified default fragment size, and the one computed by the mentioned formula.

Then it runs SELECT SUM(all_of_the_cols) FROM test_table GROUP BY columns_to_by; query for both of the tables and measures the execution time. It uses 4 columns to group on, there are 100 groups in the result.

You can uncomment the appropriate line and specify your logging directory in order to view the info from internal OmniSci's timers.

import sys
import os
from timeit import default_timer as timer

sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL
try:
    from omniscidbe import PyDbEngine # for omnisci >= 5.7.0
except ImportError:
    from dbe import PyDbEngine # for omnisci < 5.7.0

import numpy as np
import pyarrow

np.random.seed(42)

def generate_data(ncols=10, nrows=30_000_000, number_of_by_cols=2, number_of_groups=100):
    data = {
        f"col{i}": np.random.randint(low=0, high=1_000, size=nrows)
        for i in range(ncols - number_of_by_cols)
    }
    data.update(
        **{
            f"by_col{i}": np.tile(
                np.arange(number_of_groups), nrows // number_of_groups
            )
            for i in range(number_of_by_cols)
        }
    )
    return pyarrow.Table.from_pydict(data)

number_of_by_cols = 4
test_table = generate_data(number_of_by_cols=number_of_by_cols)

server = PyDbEngine(
    enable_debug_timer=1,
    # log_directory="/logging_directory", uncomment this line and enter your logging directory to get
    #                                     info from internal OmniSci's timers
)

cpu_count = os.cpu_count() # 112 at the reproducing machine
fragment_size = test_table.num_rows // cpu_count # 267857 at the reproducing machine 

server.importArrowTable("test_table_with_fragment_size", test_table, fragment_size=fragment_size)
server.importArrowTable("test_table_default_fragment_size", test_table)

groupby_arg = ", ".join(f"SUM({col})" for col in test_table.column_names)
by_cols = ", ".join([f"by_col{i}" for i in range(number_of_by_cols)])

t1 = timer()
server.select_df(f"SELECT {groupby_arg} FROM test_table_default_fragment_size GROUP BY {by_cols}")
print(f"Time for default fragment size: {timer() - t1} seconds")

t1 = timer()
server.select_df(f"SELECT {groupby_arg} FROM test_table_with_fragment_size GROUP BY {by_cols}")
print(f"Time for specified fragment size({fragment_size}): {timer() - t1} seconds")

Output:

omnisci v5.7.0
Time for default fragment size: 5.377382284961641 seconds
Time for specified fragment size(267857): 76.4291126029566 seconds

omnisci v5.5.0
Time for default fragment size: 5.003902015276253 seconds
Time for specified fragment size(267857): 0.5864222021773458 seconds

This commit (a2d77ce) was marked as suspicious by the guy who was bisecting the regressing commit.

cc @Garra1980

Select.ReturnInfFromDivByZero test fails on modin branch

Looks like the problem is in results comparison:

7: The difference between ref_val and omnisci_val is nan, which exceeds EPS * std::fabs(ref_val), where
7: ref_val evaluates to inf,
7: omnisci_val evaluates to inf, and
7: EPS * std::fabs(ref_val) evaluates to inf.
7: CPU: SELECT 2e308 FROM test;

Interrupt.Check_Query_Runs_After_Interruption is running too long

The commit

commit cc698f16e15a40a2c667f07b9ad8d2107408c40c (HEAD)
Author: Alex Baden <[email protected]>
Date:   Wed Mar 2 14:44:43 2022 -0700

    Disable ALWAYS_INLINE under GCC/G++

    ALWAYS_INLINE is only neccessary for clang-compiled C++ to LLVM IR for the JIT runtime. Disable it for GNU compilers to prevent inlining errors.

causes too long running of Interrupt.Check_Query_Runs_After_Interruption from RuntimeInterruptTest

Build errors w/ clang-13 and AVX commit

/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:5:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
                      ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:29:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
                      ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:56:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
                      ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:82:23: error: 'target' attribute takes one argument
size_t __attribute__((target("avx512bw", "avx512f"), optimize("no-tree-vectorize")))
                      ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:136:1: error: redefinition of 'gen_null_bitmap_8'
gen_null_bitmap_8(uint8_t* dst, const uint8_t* src, size_t size, const uint8_t null_val) {
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:6:1: note: previous definition is here
gen_null_bitmap_8(uint8_t* dst, const uint8_t* src, size_t size, const uint8_t null_val) {
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:140:43: error: redefinition of 'gen_null_bitmap_16'
size_t __attribute__((target("default"))) gen_null_bitmap_16(uint8_t* dst,
                                          ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:30:1: note: previous definition is here
gen_null_bitmap_16(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:147:43: error: redefinition of 'gen_null_bitmap_32'
size_t __attribute__((target("default"))) gen_null_bitmap_32(uint8_t* dst,
                                          ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:57:1: note: previous definition is here
gen_null_bitmap_32(uint8_t* dst,
^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:154:43: error: redefinition of 'gen_null_bitmap_64'
size_t __attribute__((target("default"))) gen_null_bitmap_64(uint8_t* dst,
                                          ^
/home/alexb/Projects/intel-ai/omniscidb/QueryEngine/BitmapGenerators.cpp:83:1: note: previous definition is here
gen_null_bitmap_64(uint8_t* dst,
^
8 errors generated.

Remove references to thrift

Since we are no longer using thrift, we can move the globals out of CommandLineOptions and remove the ThriftHandler folder, and all thrift install info / licenses.

Distinct count works terribly slow for a highly ranged data

The "fastest" distinct count implementation that OmniSci is always trying to use is the BitMap approach: it allocates a bit array of zeros in size of (max_col_value - min_col_value), puts a 1 at the corresponding position if the value was met, and then counts the amount of ones to get the number of unique values.

count_distinct_impl_type = CountDistinctImplType::Bitmap;
if (agg_info.agg_kind == kCOUNT) {
bitmap_sz_bits = arg_range_info.max - arg_range_info.min + 1;

The described BitMap approach suffers from a memory overhead in cases of high ranged data, when a column contains 0 and 1_000_000_000 at the same time the size of the buffer would be around 120mb, for a table with 10 columns it would be 1.2gb, each kernel has its own buffer, so the final amount of memory for such table is num_kernels * 1.2gb.

Operating with such BitMaps (allocating, initializing with zeros, reducing) seems like a heavy deal for OmniSci in the current master. For a table of 30 million rows and 10 columns, 100 distinct values in each column with the range from zero to one billion, it takes about 135 seconds to perform a distinct count (112 kernels were used).

Reproducer with pydbe
import sys
import os
from timeit import default_timer as timer

sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL
from omniscidbe import PyDbEngine

import numpy as np
import pyarrow as pa

def get_data():
    shape = (30_000_000, 10)
    nunique = 100
    max_value = 1_000_000_000

    nrows, ncols = shape
    data = {
        "col{}".format(i): np.tile(
            np.concatenate([np.arange(nunique - 1), [max_value]]), nrows // nunique
        )
        for i in range(ncols)
    }
    
    return pa.Table.from_pydict(data)

table = get_data()
print(f"Generated ({table.num_rows}, {table.num_columns}) shape table")

server = PyDbEngine(
    enable_union=1,
    enable_columnar_output=1,
    enable_lazy_fetch=0,
    null_div_by_zero=1,
    enable_watchdog=0,
    enable_debug_timer=1,
    log_directory="/localdisk/dchigare/repos/omnisci_log",
)
fragment_size = max(min(table.num_rows // os.cpu_count(), 2 ** 25), 2 ** 18)
server.importArrowTable("test_table", table, fragment_size=fragment_size)

cols = ", ".join([f"COUNT(distinct {col})" for col in table.column_names])

execution_time = timer()
record_batch = server.executeRA(f"SELECT {cols} FROM test_table").getArrowRecordBatch()
execution_time = timer() - execution_time

print(f"Got ({record_batch.num_rows}, {record_batch.num_columns}) shape table for {execution_time} sec.")

There are a few bottlenecks that result in such a big time:

BitMap allocation

Timing for BitMap allocation in a single kernel looks like this:

New thread(110)
  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:95
  5ms start(1ms) fetchChunks Execute.cpp:2570
  83487ms start(6ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:676
    83487ms start(6ms) allocateCountDistinctBuffers QueryMemoryInitializer.cpp:655
      8678ms start(2357ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
        8600ms start(2357ms) awaiting for a mutex... RowSetMemoryOwner.h:68
        78ms start(10958ms) allocateAndZero ArenaAllocator.h:110
          0ms start(10958ms) actual allocation ArenaAllocator.h:112
          78ms start(10958ms) memset zero ArenaAllocator.h:116
      8612ms start(11036ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
        8532ms start(11036ms) awaiting for a mutex... RowSetMemoryOwner.h:68
        79ms start(19568ms) allocateAndZero ArenaAllocator.h:110
          0ms start(19568ms) actual allocation ArenaAllocator.h:112
          79ms start(19568ms) memset zero ArenaAllocator.h:116
      ... allocation for the rest 8 columns ...
  11ms start(83494ms) executePlanWithoutGroupBy Execute.cpp:2978
    11ms start(83494ms) launchCpuCode QueryExecutionContext.cpp:596
End thread(110)

(full log, run on 547f3b9)
The mutex, which it's awaiting most of the kernel execution time is the state mutex for a RowSetMemoryOwner:

auto allocator = allocators_[thread_idx].get();
std::lock_guard<std::mutex> lock(state_mutex_);
auto ret = reinterpret_cast<int8_t*>(allocator->allocateAndZero(num_bytes));
count_distinct_bitmaps_.emplace_back(
CountDistinctBitmapBuffer{ret, num_bytes, /*physical_buffer=*/true});

allocateAndZero() just allocates and initializes memory with zeros, mutex lock is redundant in this case because we're initializing a non-intersected memory
void* allocateAndZero(const size_t size) {
auto ret = allocate(size);
std::memset(ret, 0, size);
return ret;
}

Freeing memset from the lock reduced overall execution time from 125 sec. to 75 sec. but occasionally increased the time of memset itself...

New thread(110)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:95
86ms start(0ms) fetchChunks Execute.cpp:2570
11433ms start(87ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:676
    11433ms start(87ms) allocateCountDistinctBuffers QueryMemoryInitializer.cpp:655
    1538ms start(87ms) allocateCountDistinctBuffer RowSetMemoryOwner.h:62
        0ms start(87ms) awaiting for the allocator' mutex RowSetMemoryOwner.h:68
        0ms start(87ms) actual allocation ArenaAllocator.h:104
        1537ms start(87ms) memset zero RowSetMemoryOwner.h:75
        0ms start(1625ms) awaiting for the vector' mutex RowSetMemoryOwner.h:81
    ... allocation for the rest 9 columns ...
11ms start(11520ms) executePlanWithoutGroupBy Execute.cpp:2978
    11ms start(11520ms) launchCpuCode QueryExecutionContext.cpp:596
End thread(110)

(full log, run on 691e3f2)
The amount of memory, way of its allocation and initialization stayed the same, but the memset time changed from ~80ms to the ~1500ms. I've rerun several times, but still getting these numbers, is there a logical explanation?

Slow reduction

The second bottleneck that takes most of the execution time is the reduction, do not have much info on this, investigation is needed...

60976ms start(12304ms) collectAllDeviceResults Execute.cpp:1990
    60976ms start(12304ms) reduceMultiDeviceResults Execute.cpp:935
        60976ms start(12304ms) reduceMultiDeviceResultSets Execute.cpp:981

(full log, run on 691e3f2)

Alternative implementation with std::set

OmniSci also provides "slow" distinct count implementation via std::set, in theory, for a small amount of distinct values in a big range, it should work much better in terms of memory allocation, but it's unknown how long the reduction stage would perform. Have no info on this.

P.S.

You can check my distinct_count_timers branch where I did the following:

  1. Placed more timers to track the distinct count execution flow (547f3b9)
  2. Freed memset from the RowSet' state lock (691e3f2)
  3. Placed timing logs at the logs folder for both of these commits

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.