clickhouse / clickhouse-connect Goto Github PK

View Code? Open in Web Editor NEW

304.0 11.0 60.0 1.56 MB

Python driver/sqlalchemy/superset connectors

License: Apache License 2.0

Python 95.29% Cython 4.60% Dockerfile 0.10%

clickhouse python

clickhouse-connect's Introduction

ClickHouse Connect

A high performance core database driver for connecting ClickHouse to Python, Pandas, and Superset

Pandas DataFrames
Numpy Arrays
PyArrow Tables
Superset Connector
SQLAlchemy 1.3 and 1.4 (limited feature set)

ClickHouse Connect currently uses the ClickHouse HTTP interface for maximum compatibility.

Installation

pip install clickhouse-connect

ClickHouse Connect requires Python 3.8 or higher.

Superset Connectivity

ClickHouse Connect is fully integrated with Apache Superset. Previous versions of ClickHouse Connect utilized a dynamically loaded Superset Engine Spec, but as of Superset v2.1.0 the engine spec was incorporated into the main Apache Superset project and removed from clickhouse-connect in v0.6.0. If you have issues connecting to earlier versions of Superset, please use clickhouse-connect v0.5.25.

When creating a Superset Data Source, either use the provided connection dialog, or a SqlAlchemy DSN in the form clickhousedb://{username}:{password}@{host}:{port}.

SQLAlchemy Implementation

ClickHouse Connect incorporates a minimal SQLAlchemy implementation (without any ORM features) for compatibility with Superset. It has only been tested against SQLAlchemy versions 1.3.x and 1.4.x, and is unlikely to work with more complex SQLAlchemy applications.

Asyncio Support

ClickHouse Connect provides an async wrapper, so that it is possible to use the client in an asyncio environment. See the run_async example for more details.

Complete Documentation

The documentation for ClickHouse Connect has moved to ClickHouse Docs

clickhouse-connect's People

Contributors

Stargazers

Watchers

clickhouse-connect's Issues

Support upload for CSV files

Uploading CSV files is a popular way to import data into a DB.
The current implementation of the client requires a program to parse CSV file content before provdiging it to insert method, which creates an unnecessary step.

Question about format_query_value()

Hello.

I have a question: Why function format_query_value() wraps a result to quotes ''? This behaviour makes it impossible to make complex queries with optional (empty) WHERE parts. Is it possible to change format_query_value() to make return result without quotes?

Thank you.

ping checks status_code should be status

This code in httpclient.py/ping()

return 200 <= response.status_code < 300

should be

return 200 <= response.status < 300

Setting send_progress_in_http_headers is not recognized by this ClickHouse server

dbt build --select tag:dbt_poc,tag:src_model -t src-poc --fail-fast 

16:40:35  Running with dbt=1.2.2
16:40:35  Partial parse save file not found. Starting full parse.
16:40:45  Found 539 models, 109 tests, 2 snapshots, 0 analyses, 594 macros, 0 operations, 13 seed files, 180 sources, 5 exposures, 0 metrics
16:40:45  
16:40:46  Encountered an error:
Database Error
  Setting send_progress_in_http_headers is not recognized by this ClickHouse server

Clickhouse version 21.12.4.1

problem occured on version v0.4.2 abt 3 hours ago

Querying DateTime64(1) columns results in exception

It seems that querying a DateTime64(1) column using the http_client.query_df (and probably any other query method) results in an exception as shown below. This does not happen with DateTime64(3) or DateTime64(6) columns. I verified that no exception is thrown if I remove the DateTime64(1) column from the select statement.

WARNING:clickhouse_connect.driver.httputil:Closed HTTP response due to unexpected exception
Traceback (most recent call last):
  File "/home/gpeev/.tox/default/lib/python3.9/site-packages/clickhouse_connect/driver/transform.py", line 42, in get_block
    column = col_type.read_column(source, num_rows, context)
  File "/home/gpeev/.tox/default/lib/python3.9/site-packages/clickhouse_connect/datatypes/base.py", line 128, in read_column
    column = self.read_column_data(source, num_rows, ctx)
  File "/home/gpeev/.tox/default/lib/python3.9/site-packages/clickhouse_connect/datatypes/base.py", line 143, in read_column_data
    return self._read_column_binary(source, num_rows, ctx)
  File "/home/gpeev/.tox/default/lib/python3.9/site-packages/clickhouse_connect/datatypes/temporal.py", line 151, in _read_binary_naive
    return numpy_conv.read_numpy_array(source, self.np_type, num_rows)
  File "clickhouse_connect/driverc/npconv.pyx", line 13, in clickhouse_connect.driverc.npconv.read_numpy_array
ValueError: cannot create an OBJECT array from memory buffer

Python version: 3.9
clickhouse-connect version: 0.5.7

JSON data insertion doesn't seem to work

Whenever passing in native JSON data, we get an exception:

This occurs because, in the lower level write_native_data method, the v var is a string, which is being appended to a byte array.

I fixed this locally by encoding the v var into utf-8. Do I misunderstand something in the insertion process that I need to correct in the client-level code?

If it helps to debug, the JSON library chosen by clickhouse-connect is ujson.

When no rows are selected, query_df returns empty list, not empty DataFrame

The docstring of the method promises "Pandas dataframe representing the result set," yet an empty list is returned if the select statement yields no rows. Perhaps an empty dataframe would be a better return value, even better if it could have appropriate columns.

clickhouse-connect version: 0.5.7

Simplify Inserts of Default Values/Dictionaries

Currently the Client insert method requires the caller to create a complete, populated dataset, including any default values. There should be mechanism where missing/default values are automatically supplied for None values when inserting into a non-nullable columns.

This is closely related to an interface which allows building an insert of Map/Dictionaries using the map/dict keys to determine column names. Missing keys would use the ClickHouse default value.

Determining "non-standard" default values from the "Describe Table" query to ClickHouse is somewhat challenging, as the default expression is a string that ClickHouse Connect would have to evaluate.

Support server side parameter binding

Currently ClickHouse Connect support client side parameter binding/substitution. We want to move that responsibility to the ClickHouse server using https://clickhouse.com/docs/en/interfaces/http/#cli-queries-with-parameters. This improves security against SQL Injection attacks and may in time allow better "prepared statement" functionality and query optimization on the server side.

Question: how to stream data

cat x.py
import clickhouse_connect

client = clickhouse_connect.get_client(host='localhost', username='default', password='')

result = client.query('SELECT toString(cityHash64(number)) FROM  numbers(10000000000) where intDiv(1,number-100000)>-1000000000000000')
for i in result.result_rows:
    print( i )

python3 x.py|tail
('17933741768352543064',)
('11086356212394800338',)
('12749720657102764978',)
('10098974843304845854',)
('9676786211223313804',)
('17244299265287719294',)
('2925889024054896551',)
('7692628340459912383',)
('738548801493469449',)
('14340599725870999718',)

the same in CH client

clickhouse :) SELECT toString(cityHash64(number)) FROM  numbers(10000000000) where intDiv(1,number-100000)>-1000000000000000;

65409 rows in set. Elapsed: 0.035 sec.

Received exception from server (version 23.1.2):
Code: 153. DB::Exception: Received from localhost:9000. DB::Exception: Division by zero: while executing 'FUNCTION intDiv(1 :: 1, minus(number, 100000) :: 4) -> intDiv(1, minus(number, 100000)) Int8 : 2'. (ILLEGAL_DIVISION)

Support for External Data for Query Processing

From documentation it's not clear does this driver support external data

https://clickhouse.com/docs/en/engines/table-engines/special/external-data/

Question: how to force install python-version (without build C-dependency code)

Hello,
I have a problem during installation process under Docker and I have a question:
Is it possible to force install python-version of clickhouse-connect without building C-dependency code?

Add package version number

Add package version number (in dunders format) so that user can track the current installed version easily.

Example: clickhouse_connect.__version__

Reference: https://packaging.python.org/en/latest/guides/single-sourcing-package-version/

AttributeError: 'tuple' object has no attribute 'dtype'

The above exception is thrown in any query that includes a LowCardinality(String) column. Regular String columns (including Nullable(String)) work fine. This problem does not happen with version 0.5.3.

Python version: 3.9
clickhouse_connect version: 0.5.4

Stack trace:

Closed HTTP response due to unexpected exception
Traceback (most recent call last):
  File "/home/gpeev/test.py", line 8, in <module>
    df = client.query_df("select ticker from vendor_data limit 10")
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/client.py", line 255, in query_df
    return self._context_query(locals(), use_numpy=True).df_result
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/client.py", line 583, in _context_query
    return self._query_with_context((self.create_query_context(**kwargs)))
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/httpclient.py", line 195, in _query_with_context
    query_result = self._transform.parse_response(byte_source, context)
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/transform.py", line 50, in parse_response
    first_block = get_block()
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/transform.py", line 42, in get_block
    column = col_type.read_column(source, num_rows, context)
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/datatypes/base.py", line 128, in read_column
    column = self.read_column_data(source, num_rows, ctx)
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/datatypes/base.py", line 140, in read_column_data
    return self._read_low_card_column(source, num_rows, ctx)
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/datatypes/base.py", line 223, in _read_low_card_column
    if ctx.use_numpy and not (self.nullable and use_none) and keys.dtype != np.object_:
AttributeError: 'tuple' object has no attribute 'dtype'

Support UInt64 values to stay unsigned

Hi,

I noticed in the docs that to better integrate with Superset, UInt64 values are interpreted as Int64 values, leading to values greater than 9223372036854775807 (2^63 - 1) to be negative.
In some usecases like fetching a user_id, it is confusing to get a negative value and have to convert it back to unsigned.
Would it be possible to implement an option in the query method to disable the conversion ?

Thanks in advance.

Cloudy

How to appropriate pass values ?

I have
paramaters={'hash_id':hash_id,'tbl_name':tbl_name}
and query like
client.query('select %s from %s',parameters)
And error : TypeError: not enough arguments for format string

Streaming queries buffering?

It would be great to understand how exactly streaming interface is behaving in clickhouse-connect

Does it request next block from clickhouse-server on iteration over stream_column_blocks()?
If so, it's possible to have some buffer of preloaded blocks in client memory and have some setting to control size of such buffer in rows/blocks numbers.

SSL: CERTIFICATE_VERIFY_FAILED with 0.4.8 version and Superset 2.0.0

Hi all,

We are using clickhouse-connect with a custom Superset 2.0.0 docker image, and since updating to 0.4.8 version of clickhouse-connect we are unable to establish a connection to a ClickHouse server (version 2.11) with SSL enabled using a self signed certificate.

We were using 0.4.7 version without problems by specifing the 'ca_cert' path in the url connection params:

clickhousedb+connect://database:XXXXXX@ch:8443/database?ca_cert=%2Fusr%2Flocal%2Fshare%2Fca-certificates%2Fcustom_ca.crt&secure=true

After updating to the clickhouse-connect package version 0.4.8 (either rebuilding the image, or updating it directly via pip inside a container) we started getting the next error:

2023-01-11 21:00:02,539:DEBUG:urllib3.connectionpool:Starting new HTTPS connection (128): ch:8443
2023-01-11 21:00:02,546:ERROR:clickhouse_connect.driver.httpclient:Unexpected Http Driver Exception
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1010, in validate_conn
conn.connect()
File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 411, in connect
self.sock = ssl_wrap_socket(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/local/lib/python3.8/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/usr/local/lib/python3.8/ssl.py", line 1040, in _create
self.do_handshake()
File "/usr/local/lib/python3.8/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='ch', port=8443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/clickhouse_connect/driver/httpclient.py", line 308, in _raw_request
response: Response = self.session.request(method, self.url,
File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='ch', port=8443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
SupersetErrorsException
Traceback (most recent call last):
File "/app/superset/databases/commands/test_connection.py", line 123, in run
raise DBAPIError(None, None, None)
sqlalchemy.exc.DBAPIError: (builtins.NoneType) None
(Background on this error at: https://sqlalche.me/e/14/dbapi)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/security/decorators.py", line 89, in wraps
return f(self, *args, **kwargs)
File "/app/superset/views/base_api.py", line 114, in wraps
raise ex
File "/app/superset/views/base_api.py", line 111, in wraps
duration, response = time_function(f, self, *args, **kwargs)
File "/app/superset/utils/core.py", line 1566, in time_function
response = func(*args, **kwargs)
File "/app/superset/utils/log.py", line 244, in wrapper
value = f(*args, **kwargs)
File "/app/superset/views/base_api.py", line 84, in wraps
return f(self, *args, **kwargs)
File "/app/superset/databases/api.py", line 709, in test_connection
TestConnectionDatabaseCommand(item).run()
File "/app/superset/databases/commands/test_connection.py", line 148, in run
raise DatabaseTestConnectionFailedError(errors) from ex
superset.databases.commands.exceptions.DatabaseTestConnectionFailedError: [SupersetError(message='(builtins.NoneType) None\n(Background on this error at: https://sqlalche.me/e/14/dbapi)', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'ClickHouse Connect', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
2023-01-11 21:00:02,553:WARNING:superset.views.base:SupersetErrorsException
Traceback (most recent call last):
File "/app/superset/databases/commands/test_connection.py", line 123, in run
raise DBAPIError(None, None, None)
sqlalchemy.exc.DBAPIError: (builtins.NoneType) None
(Background on this error at: https://sqlalche.me/e/14/dbapi)

We tried three different ways to specify the CA certificate, resulting in the same error:

Specify the 'ca_cert' param in the URL as said before.
Insert the CA cert contents in the Advanced > Security > Root Cert field from the Superset DB connection wizard.
Trust the CA cert at system level.

Al these three ways work well with the 0.4.7 version, but don't with 0.4.8 version (from pip package reposity).

Thank you.

Queries with "like" are failing with AssertionError

Getting the below error when using "like" in queries with some pattern '%abc/def%' When we do the same query with like '%' or like '%abc%' works fine. There is no error from clickhouse for any patterns, same query runs in clickhouse-cli .
Query 100503: <class 'AssertionError'> Traceback (most recent call last): File "/app/superset/sql_lab.py", line 273, in execute_sql_statement db_engine_spec.execute(cursor, sql, async_=True) File "/app/superset/db_engine_specs/base.py", line 1261, in execute raise cls.get_dbapi_mapped_exception(ex) File "/app/superset/db_engine_specs/base.py", line 1259, in execute cursor.execute(query) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/dbapi/cursor.py", line 45, in execute query_result = self.client.query(operation, parameters) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/driver/httpclient.py", line 172, in query data_result = self.parse_response(response.content, use_none) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/driver/native.py", line 39, in parse_response column, loc = col_type.read_native_column(source, loc, num_rows, use_none=use_none) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/datatypes/base.py", line 109, in read_native_column loc = self.read_native_prefix(source, loc) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/datatypes/base.py", line 96, in read_native_prefix assert v == low_card_version AssertionError 2022-07-06 10:08:06,920:ERROR:superset.sql_lab:Query 100503: <class 'AssertionError'> Traceback (most recent call last): File "/app/superset/sql_lab.py", line 273, in execute_sql_statement db_engine_spec.execute(cursor, sql, async_=True) File "/app/superset/db_engine_specs/base.py", line 1261, in execute raise cls.get_dbapi_mapped_exception(ex) File "/app/superset/db_engine_specs/base.py", line 1259, in execute cursor.execute(query) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/dbapi/cursor.py", line 45, in execute query_result = self.client.query(operation, parameters) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/driver/httpclient.py", line 172, in query data_result = self.parse_response(response.content, use_none) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/driver/native.py", line 39, in parse_response column, loc = col_type.read_native_column(source, loc, num_rows, use_none=use_none) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/datatypes/base.py", line 109, in read_native_column loc = self.read_native_prefix(source, loc) File "/app/superset_home/.local/lib/python3.8/site-packages/clickhouse_connect/datatypes/base.py", line 96, in read_native_prefix assert v == low_card_version AssertionError [SupersetError(message='clickhousedb error: ', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'ClickHouse Connect', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})] Traceback (most recent call last): File "/app/superset/views/base.py", line 208, in wraps return f(self, *args, **kwargs) File "/app/superset/utils/log.py", line 244, in wrapper value = f(*args, **kwargs) File "/app/superset/views/core.py", line 2415, in sql_json command_result: CommandResult = command.run() File "/app/superset/sqllab/command.py", line 114, in run raise ex File "/app/superset/sqllab/command.py", line 98, in run status = self._run_sql_json_exec_from_scratch() File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch raise ex File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch return self._sql_json_executor.execute( File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute raise SupersetErrorsException( superset.exceptions.SupersetErrorsException: [SupersetError(message='clickhousedb error: ', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'ClickHouse Connect', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})] 2022-07-06 10:08:06,941:WARNING:superset.views.base:[SupersetError(message='clickhousedb error: ', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'ClickHouse Connect', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]

insert None to DEFAULT defined column error

here is the table.

CREATE TABLE stock_list (code String, list_date Date, delist_date Date DEFAULT toDate('1970-01-01')) ENGINE ReplacingMergeTree ORDER BY code

not work, if None is here.

client.insert('stock_list', [['000001', pd.to_datetime('2019-10-10'), None]])

work, None is not here. And the delist_date value is the default value 1970-01-01 in the table.

client.insert('stock_list_tushare', [['000001', pd.to_datetime('2019-10-10')]], column_names=['code', 'list_date'])

client.insert_df has the same problem. None and pd.NaT does not work.

is there any mistake in the code?

Multiple hosts in connection string

Hello! psycopg2 has support multiple hosts in conn string:

postgresql+psycopg2://user:password@/dbname?host=HostA:PortA&host=HostB&host=HostC:PortC

Does the clickhouse-connect have support of multiple hosts? If no, can we add it?

http driver not respect `query_limit` parameter in macos

Test code:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host='<clickhouse_host>',
    user='default',
    password='<pwd>',
    secure=False,
    database='<schema>',
    query_limit=0,
)
result = client.query('select name from system.databases')
print(result.summary)

In macos 11.6.8 python 3.8.12 it print:

{'read_rows': '8', 'read_bytes': '1220', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '0'}

In linux (docker image python:3.8.1-slim-buster) it print:

{'read_rows': '0', 'read_bytes': '0', 'written_rows': '0', 'written_bytes': '0', 'total_rows_to_read': '0'}

Best practices with flask

Good Evening!
Thank you for the package. It seems very well written and everything works so far.
I am currently working on a project with flask and I am unsure on how to properly integrate.
I have 3 possible implementation paths:

Connection Pooling
Create a new client on each request
Create a singleton for all requests

Maybe we can discuss and then build another package upon it or write a small section in the readme as I believe this is a common use case.
Please let me know what you think - Thank you!

http cannot support large data select request

use http connect to clickhouse is considerablely slower than connect use 9000 port like clickhouse-client， clickhouse_driver Client
I Try to query same table for 5000 lines of data use clickhouse_connect.client and our old connection use clickhouse_driver.Client
clickhouse_driver.Client takes about 15.1s to get all data back to dataframe
clickhouse_connect use 13min 22s

I try to stop the query at middle , almost all time taken by socket recv buffer

Please consider to use 9000 port directly other than http, to improve select performance

NULL value in Nullable(String) column returned as empty string

The clickhouse_connect.driver.httpclient.HttpClient.query_df method seems to return empty strings instead of None where the corresponding value in the Clickhouse table's Nullable(String) column is NULL.

Python version: 3.9
clickhouse-connect version: 0.5.7

Improve numpy and Pandas query performance

Currently building a Pandas or numpy query result requires two copies of the data -- the first in native Python format (in this case, in Python array.array objects), and the second to construct the numpy array (which is also used for most Pandas data structures). By the reading the string/integer/float values directly off the wire in a numpy array, it should be possible to bring performance of queries returning a numpy array or Pandas DataFrame close to that of that of queries that return Python objects.

Polars support

Hello, thank you for this great library!
Is it planned to support Polars library at some point ?

Support Arrow and ArrowStream `format`

Wrong results for limit N BY .. in Apache superset

Very strange results using ORDER BY + LIMIT BY

Software:
clkickhouse server v.22.1.3 rev.54455 (in docker)
clickhouse-connect v.0.2.10
superset (in docker)

Data preparation:
create table default.xxx
(day_id Date,
attr1 int,
attr2 int,
sale Decimal64(2),
updated_dt Datetime
)
ENGINE = ReplacingMergeTree(updated_dt) PARTITION BY toYYYYMMDD(day_id)
ORDER BY (day_id, attr1, attr2)

insert into xxx
values( '2022-07-01', 1, 1, 100, now())
insert into xxx
values( '2022-07-01', 1, 1, 101, now())
insert into xxx
values( '2022-07-02', 1, 1, 200, now())
insert into xxx
values( '2022-07-02', 1, 2, 201, now())
insert into xxx
values( '2022-07-03', 1, 1, 200, now())
insert into xxx
values( '2022-07-03', 1, 2, 201, now())

Query1 in SQLLab in superset:

select day_id, updated_dt, sum(sale)
from default.xxx
where day_id between '2022-07-01' and '2022-07-02'
group by day_id, updated_dt
order by updated_dt DESC
limit 1 by day_id

Results1 :
results1.csv

Query2 in SQLLab in superset:

select * from (
select day_id, updated_dt, sum(sale)
from default.xxx
where day_id between '2022-07-01' and '2022-07-02'
group by day_id, updated_dt
order by updated_dt DESC
limit 1 by day_id
)
order by day_id

Results2 :
results2.csv

Query3 in SQLLab in superset:

select day_id, updated_dt, sum(sale)
from default.xxx
where day_id = '2022-07-01'
group by day_id, updated_dt
order by updated_dt desc
limit 1 by day_id

Results 3:
results3.csv

v0.3.2 SHOW CREATE TABLE doesn't fetch values

Since v0.3.2, SHOW CREATE TABLE doesn't fetch any values via dbt load_result()

Steps to reproduce:

Macros:

{% macro show_create_table2(relation) %}
    {% call statement('show_create_table', fetch_result=True) %}
        SELECT
            create_table_query AS ddl
        FROM system.tables
        WHERE
            name = '{{ relation.identifier }}' AND
            database = '{{ relation.schema }}'
    {% endcall %}

    {% do return(load_result('show_create_table').table) %}
{% endmacro %}


{% macro show_create_table3(relation) %}
    {% call statement('show_create_table', fetch_result=True) %}
        SHOW CREATE TABLE {{ relation.schema }}.{{ relation.identifier }}
    {% endcall %}

    {% do return(load_result('show_create_table').table) %}
{% endmacro %}

then in python console:

import os
import time
from clickhouse_connect.driver import HttpClient
from dbt.adapters.clickhouse import ClickHouseAdapter, ClickHouseRelation
from dbt.adapters.factory import get_adapter_by_type
from dbt.contracts.connection import Connection
from dbt.contracts.relation import Path
from dbt.lib import get_dbt_config
from dbt.tests.util import get_connection
config = get_dbt_config('/usr/app/')
ch_adapter = ClickHouseAdapter(config)
adapter = get_adapter_by_type(ch_adapter.type())
with get_connection(adapter) as conn:
    rel=adapter.get_relation(None, 'internal_analytics_src', 'src_dsas_datasources')
t1=adapter.execute_macro('show_create_table2', kwargs={"relation":rel})
t2=adapter.execute_macro('show_create_table3', kwargs={"relation":rel})

t2 is empty here, should be a table

I assume this is due to Handle queries that are actually commands, simply and clean up tests

But this command should load a table value, and it's counter-intuitive why it doesn't now.

Support Native Interface

The native interface still retains certain advantages over HTTP, including better compression and progress packets. It would be nice to support both options. At this point this would mean:

creating a nativeclient subclass and implementing the required methods
building a transport layer that converted incoming native protocol payloads to byte blocks and sent the same generated payload we now use over HTTP

However this looks like a lot less work now than it did last March

Bug: Error message should display utf8 correctly

Reference

dbt-labs/dbt-core#5375 (comment)

Expect behavior

Error Message show Chinese charactor as is.

What actually happen

I got something like \xe5\xa5\xbd

How to Reproduce Error

import clickhouse_connect

client = clickhouse_connect.get_client(
  host='ec2-xx-131-225-xxx.cn-north-1.compute.amazonaws.com.cn', 
  port=8123, 
  username='default',
  password="[hide]",
  database='mp_production',
)

query_result = client.query("""
  SELECT * FROM apply_applications where id = '好'
""")

print (query_result.result_set)

The SQL here is wrong, but the point is error message should display 好, not \xe5\xa5\xbd

Screenshot

Insert of Pandas DataFrame loses nanoseconds

If inserting a Pandas Timestamp object with nanosecond precision into a ClickHouse column with datatype DateTime(9) (which should have nanosecond precision), the nanosecond value is lost. The root cause is that the underlying transformation to a ClickHouse type for Pandas Timestamp includes a conversion to a Python datetime.datetime, which only has microsecond precision.

Query performance degrades when using gzip compression

Using @gingerwizard's favorite query, ClickHouse Connect performs pretty well with compression disabled:

client = clickhouse_connect.get_client(query_limit=0, send_progress=False, compress=False)
start = time.time()
qr = client.query('SELECT number FROM numbers(500000000)', column_oriented=True)
rc = 0
for x in qr.stream_blocks():
    rc += len(x[0])
print (rc)
print (time.time() - start)

500000000
4.34486198425293

Unfortunately the same query with compress=True is almost 10x slower:

500000000
39.716081857681274

Profiling the two use cases shows that enabling compression causes a dramatic increase in the time spent in the socket.recv_into function (plus some overhead for ungzipping). We need to determine the source of this behavior. Until then, for very large queries we will recommend disabling compression in the documentation.

Error running SET Statement

SET Statement assigns value to the param setting for the current session, using the following syntax

SET param = value

However, when using it with client.query(), we get the following error.

client.query('SET input_format_csv_use_best_effort_in_schema_inference=0')

HTTPDriver url https://xxx.us-west-2.aws.clickhouse.cloud:8443/ returned response code 400)
Code: 62. DB::Exception: Syntax error: failed at position 61 ('FORMAT') (line 2, col 2): FORMAT Native. Expected end of query. (SYNTAX_ERROR) (version 22.10.1.15995 (official build))

Is there a way to get this to work, especially for those that cannot be placed as part of the query statement but needs to be enabled beforehand? e.g. SET allow_experimental_X=1

Weird rounding of floats without calculation

Given this scenario I would expect another outcome:

create table tmp_float_test
(
    flt Float32 default 0
)
    engine = MergeTree ORDER BY flt
        SETTINGS index_granularity = 8192;

INSERT INTO tmp_float_test(flt) VALUES (6.43);

SELECT flt from tmp_float_test ORDER BY flt ASC;

This gives flt: 6.43 as a result in DataGrip for example. But running this query in python gives another result:

data = clickhouse_client.query_dataframe("SELECT flt FROM tmp_float_test ORDER BY flt DESC;").to_dict("records")
print(data)  # [{'flt': 6.429999828338623}]

So is this a problem of the db itself or is it a problem of this package? As you can see there are no calculations involved, I'm just selecting that value and still it breaks. I could apply a rounding afterwards but that shouldn't be necessary in the first place.
For decimals this doesn't happen of course so should I just switch to decimals and forget about floats at all?

Streaming results question

I am looking for an opportunity to select many rows over HTTP similar to clickhouse-driver:

https://clickhouse-driver.readthedocs.io/en/latest/quickstart.html#streaming-results

and according the documentation:

https://github.com/clickhouse/clickhouse-connect#http-client-constructorinitialization-parameters

query_limit:int LIMIT value added to all queries.
Defaults to 5,000 rows. Unlimited queries are not supported to prevent crashing the driver

As far as far as I learned there is no way to do it with clickhouse-connect Let me know if I missing something.

Thank you and have a nice day.

Empty queries cause query_df to throw a ProgrammingError

Any query that results in 0 rows causes clickhouse_connect.driver.HttpClient.query_df to raise a ProgrammingError. Same query works fine if the result contains at least one row.

Python version: 3.9

clickhouse_connect version: 0.5.3 (possibly also 0.5.4)

Stack trace:

Traceback (most recent call last):
  File "/home/gpeev/test.py", line 9, in <module>
    df = client.query_df("select * from vendor_data where model_date = '2025-01-01'")
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/client.py", line 242, in query_df
    return pandas_result(self.query(query,
  File "/home/gpeev/venv/3.9/lib/python3.9/site-packages/clickhouse_connect/driver/query.py", line 423, in pandas_result
    raise ProgrammingError('Pandas dataframes should only be constructed from column oriented query results')
clickhouse_connect.driver.exceptions.ProgrammingError: Pandas dataframes should only be constructed from column oriented query results

QueryContext.updated_copy works incorrectly

clickhouse-connect/clickhouse_connect/driver/query.py

Line 117 in 8d3b66a

self.parameters.update(parameters or {}),

Call of self.parameters.update(parameters or {}) won't return updated copy, instead it returns None (https://docs.python.org/3/library/stdtypes.html#dict.update). It also updates current instance parameters! As a result QueryContext.updated_copy returns QueryContext with empty parameters and substitution does not work.

insert_df has more undocumented side effects

I found two more problems related to #112 whereby the insert_df method is mutating its df parameter:

if df has a numpy.dtype('float64') column, it's converted to object
if df has a pandas.DatetimeTZDtype(tz='<some timezone>') column, it's converted to object

Inserts for Several Numpy Datatypes Fail

When inserting a Numpy array into ClickHouse using the client.insert method, if the destination column is a "big integer" type (size more than 64 bits), a datetime type column, or the insert contains NaN values the insert will fail.

Code: 354. DB::Exception: inflateReset failed: buffer error: While executing Native. (ZLIB_INFLATE_FAILED)

Using Python (pandas) to insert data into ClickHouse (24000 files (dataframes))

       df = pd.read_parquet(files[i], engine='fastparquet')
       df['t'] = pd.to_datetime(df['t'] * 1000000)    
       fname = os.path.basename(files[i])
       df['Ticker'] = fname[:len(fname)-8]
       df.dropna(inplace=True)
       client.insert_df('equities.m1m', df)

For some files (approx 1-2%) I (always) get the following error:
Code: 354. DB::Exception: inflateReset failed: buffer error: While executing Native. (ZLIB_INFLATE_FAILED) (version 22.13.1.1561 (official build))

I have checked the data for those and everything looks ok, ie size wise they are not big, values are ok etc.

How can i fix this error?

OS: Linux Mint 21.1 Cinnamon 5.6.5
Thanks

TypeError: Cannot interpret 'datetime64[ns, UTC]' as a data type

Problem

pd.to_datetime() produces datetime64[ns, UTC] data type that is not a sub dtype of np.datetime64

Notebook to reproduce error

https://gist.github.com/DerekChia/bbfd22d744f41ef6bdb5884d6ff62c8a

Incorrect DateTime values on applying timezone

clickhouse-connect does not take into account if timezone is specified in functions like DateTime(<date>, <timezone) and toTimezone(<date>, <timezone)

Consider this table:

CREATE TABLE default.tbl
(
    `id` UInt32,
    `created_date` Nullable(DateTime())
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192

Adding some data...

INSERT INTO default.tbl VALUES (1,toDateTime('2017-08-04 06:09:00))

Querying through ClickHouse Play or Tabix:

SELECT toTimezone(created_date,'America/Chicago') AS created_date FROM default.tbl

Returns, different date, applying the timezone ✅

created_date
2017-08-04 01:09:00

Querying through Clickhouse connect:

client = clickhouse_connect.get_client(...)
client.query_df(query="SELECT toTimezone(created_date,'America/Chicago') AS created_date FROM default.tbl")

Returns, without applying the timezone ❌

          created_date
0 2017-08-04 06:09:00

alternatives tried, but didnt work

Used custom QueryContext with different timezone from the QueryContext through create_query_context(query=query_str).updated_copy()
Tried editing the server_tz param in the client

Environment:
Clickhouse version: 23.1.3.5
Clickhouse connect version: 0.5.7
the application runs in an Docker image

Any workarounds?

clickhouse-connect 0.5.2 is not considering proxy variables

clickhouse-connect 0.5.2 is not considering proxy variables. Added http_proxy, https_proxy. When I try with 0.1.5 version its able to connect. Could someone help with this?

Thanks in advance.

insert_df has undocumented side effects

The insert_df method can modify its df parameter, for example replacing pandas.NaT values with None. This is not documented and unexpected behavior. Please either fix or document properly.

Python version: 3.9
clickhouse_connect version: 0.5.5

get_client + small query_limit

>>> client = clickhouse_connect.get_client(send_progress=False, compress=False , query_limit=5000)

>>> client = clickhouse_connect.get_client(send_progress=False, compress=False , query_limit=100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/clickhouse_connect/__init__.py", line 8, in get_client
    return create_client(**kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/clickhouse_connect/driver/__init__.py", line 59, in create_client
    return HttpClient(interface, host, port, username, password, database, settings=settings, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/clickhouse_connect/driver/httpclient.py", line 160, in __init__
    comp_setting = self.server_settings['enable_http_compression']
KeyError: 'enable_http_compression'

and you use query_limit for internal queries, like SELECT name, value, changed, description, type, readonly FROM system.settings limit 5000

and BTW, I think you don't use system.settings column description.

Support SQLAlchemy AggregateFunction type for Superset

Hi there!

I use Clickhouse for our BI (superset). Sometimes we have data marts which contain AggregateFunction(xxx, yyy) columns. This is useful when we want to calculate aggregation at different levels (hourly, daily, etc) but we don't want to store raw data.

I'd like to see such columns serialised in superset datasets. I won't use them as dimensions, filters, etc, but I'll create a metric like uniqMerge(aggregation_column) and will be able to compute aggregations at different levels as I wish without storing raw data.

Currently we cant import such datasets into Superset because of clickhousedb error: AggregateFunction(uniq, UUID) deserialization not supported

My setup: clickhouse-connect==0.5.3 and superset==2.0.1

Part of table DDL (one of many):

CREATE TABLE table_with_aggregates
(
    hour        DateTime,
    platform    LowCardinality(String),
    count       SimpleAggregateFunction(sum, UInt64),
    users       AggregateFunction(uniq, UUID),
)

Improve Insert Performance

Insert data transformation is currently all in Python. This should be moved to C/Cython where expensive. It should also be possible to optimize Numpy/Pandas inserts by reading the Numpy buffer directly for int/float datatypes.

Tasks

Beta Give feedback

Optimize String inserts in C #201
Direct copy of numpy arrays to output buffer
Buffered bytearray destination for all writes to reduce allocations
Datetime optimizations in C
Decimal optimizations in C
Dynamically calculate block size
Options