nikepan / clickhouse-bulk Goto Github PK

View Code? Open in Web Editor NEW

468.0 8.0 87.0 168 KB

Collects many small inserts to ClickHouse and send in big inserts

License: Apache License 2.0

Go 98.45% Makefile 0.27% Dockerfile 1.28%

clickhouse clickhouse-server clickhouse-bulk

clickhouse-bulk's Introduction

ClickHouse-Bulk

Simple Yandex ClickHouse insert collector. It collect requests and send to ClickHouse servers.

Installation

Download binary for you platorm

Use docker image

or from sources (Go 1.13+):

git clone https://github.com/nikepan/clickhouse-bulk
cd clickhouse-bulk
go build

Features

Group n requests and send to any of ClickHouse server
Sending collected data by interval
Tested with VALUES, TabSeparated formats
Supports many servers to send
Supports query in query parameters and in body
Supports other query parameters like username, password, database
Supports basic authentication

For example:

INSERT INTO table3 (c1, c2, c3) VALUES ('v1', 'v2', 'v3')
INSERT INTO table3 (c1, c2, c3) VALUES ('v4', 'v5', 'v6')

sends as

INSERT INTO table3 (c1, c2, c3) VALUES ('v1', 'v2', 'v3')('v4', 'v5', 'v6')

Options

-config - config file (json); default config.json

Configuration file

{
  "listen": ":8124",
  "flush_count": 10000, // check by \n char
  "flush_interval": 1000, // milliseconds
  "clean_interval": 0, // how often cleanup internal tables - e.g. inserts to different temporary tables, or as workaround for query_id etc. milliseconds
  "remove_query_id": true, // some drivers sends query_id which prevents inserts to be batched
  "dump_check_interval": 300, // interval for try to send dumps (seconds); -1 to disable
  "debug": false, // log incoming requests
  "dump_dir": "dumps", // directory for dump unsended data (if clickhouse errors)
  "clickhouse": {
    "down_timeout": 60, // wait if server in down (seconds)
    "connect_timeout": 10, // wait for server connect (seconds)
    "tls_server_name": "", // override TLS serverName for certificate verification (e.g. in cases you share same "cluster" certificate across multiple nodes)
    "insecure_tls_skip_verify": false, // INSECURE - skip certificate verification at all
    "servers": [
      "http://127.0.0.1:8123"
    ]
  },
  "metrics_prefix": "prefix"
}

Environment variables (used for docker image)

CLICKHOUSE_BULK_DEBUG - enable debug logging
CLICKHOUSE_SERVERS - comma separated list of servers
CLICKHOUSE_FLUSH_COUNT - count of rows for insert
CLICKHOUSE_FLUSH_INTERVAL - insert interval
CLICKHOUSE_CLEAN_INTERVAL - internal tables clean interval
DUMP_CHECK_INTERVAL - interval of resend dumps
CLICKHOUSE_DOWN_TIMEOUT - wait time if server is down
CLICKHOUSE_CONNECT_TIMEOUT - clickhouse server connect timeout
CLICKHOUSE_TLS_SERVER_NAME - server name for TLS certificate verification
CLICKHOUSE_INSECURE_TLS_SKIP_VERIFY - skip certificate verification at all
METRICS_PREFIX - prefix for prometheus metrics

Quickstart

./clickhouse-bulk and send queries to :8124

Metrics

manual check main metrics curl -s http://127.0.0.1:8124/metrics | grep "^ch_"

ch_bad_servers 0 - actual count of bad servers
ch_dump_count 0 - dumps saved from launch
ch_queued_dumps 0 - actual dump files id directory
ch_good_servers 1 - actual good servers count
ch_received_count 40 - received requests count from launch
ch_sent_count 1 - sent request count from launch

Tips

For better performance words FORMAT and VALUES must be uppercase.

clickhouse-bulk's People

Contributors

Stargazers

Watchers

Forkers

f4llou7 dimakein tuxrace garyaa dvyushin88 dink10 aryzhenko abeusher contiamo bloogrox 220-volt the-alchemist trendingtechnology developgo jneo8 lenstr archsaber denisov-vlad pombredanne dewey363 godwhoa supergod keevol chorusone wchch recoilme anshlykov whiteeagle88 farwydi sadmoondog savely-krasovsky onlyone0001 2hangchen powered357 isgasho des1re7 awesomedatatool vafir j4ckzh0u splichy vladplskv uncleweb hadryan brunoprogramming belonace n0740 khorevaa chocolatkey ramazanpolat yusufozturk sarir dianplus maskshell khlystov xjx79 arthurmartiros chendo clinkz24 agilebyte kioco postoplan4github belonesox 674345386 congwen emptyshadow standardgalactic rad1k4l kaulssggx bytadaniel ajunlonglive yzinkovets laashub-soa javasalondev fearlsgroove vkirilenko alexxero jorgenpo valodzka aryanbaghi ergoz dern3rd igorvoltaic fasthca gauzy-gossamer capricornusx kokizzu

clickhouse-bulk's Issues

Make option to choose place for logs storing

Now all logs store in syslog, that is not very useful. Please, make option to choose place for logs storing (file, syslog, work without logging) and make possible to set up log level (for ex. I don't want to store logs like ' clickhouse-bulk[40680]: 2020/08/17 13:58:01 INFO: send 0 rows to ...')

feature request: re-send dumped queries

It'd be great to have an (either automatic or manual; or actually both :)) ability to easily resend the queries that were failed and dumped.

Reducing log output

Hey there,

To install clickhouse-bulk on our server I added a systemd service for it. I just noticed that we frequently get issues with our journal log filling up the whole disk space.
I've tried to reconfigure systemd-journal to limit disk usage, but as the clickhouse-bulk log spits out so much data we miss out a lot of data then (e.g. log file is growing >100mb per hour)

Could me make the "sending x rows" and "sent x rows" messages configurable so I can deactivate them and only log warn/error messages?

Thanks in advance

combine

while got a error,the data send failed until restart

while got a error,the data send failed until restart，is a bug?

ERROR: server error (400) Wrong server status 400 after updating to ClickHouse server version 21.4.4.30 (official build)

Hi.
The proxy can not execute queries an prints this log after the update but it has worked before the update

request: "INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid2`,`created_at`) VALUES\n('7ecd41eb-58b6-44d2-ab59-01235bc32135',86,'00806453-89a0-4fd2-9f9f-2b012f45049e','0069f823-f901-48c6-b8bb-3d5a5d61d470','4264487b-fa40-47ae-939b-a492df46caaa','1618914155')"
2021/04/21 07:12:41.249482 INFO: sending 1 rows to http://192.168.88.1:8123 of INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid1`,`created_at`) VALUES
2021/04/21 07:12:41.257645 INFO: sent 1 rows to http://192.168.88.1:8123 of INSERT INTO `display` (`uuid`,`user_id`,`app_uuid`,`uuid1`,`uuid1`,`created_at`) VALUES
2021/04/21 07:12:41.257817 ERROR: server error (400) Wrong server status 400:

Please take a look, it is very urgent for us.
Best Regards
Arthur

503 when i send quires to 8214 port

Hi
I cant send quires to clickhouse-bulk
I add clickhouse -bulk to my docker compose

clickhouse:
    image: yandex/clickhouse-server:21.1.2
    ports:
      - 8123:8123
      - 9090:9000

clickhouse-bulk:
    image: nikepan/clickhouse-bulk:1.3.3
    ports:
     - "8124:8124"
    environment:
     - CLICKHOUSE_SERVERS=http://0.0.0.0:8123

When i tried to send quires to port 8214 i get 503

*   Trying 127.0.0.1:8124...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8124 (#0)
> POST /?query=CREATE&DATABASE&IF&NO&EXISTS&test HTTP/1.1
> Host: 127.0.0.1:8124
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=UTF-8
< Date: Mon, 26 Jul 2021 23:10:49 GMT
< Content-Length: 0
< 
* Connection #0 to host 127.0.0.1 left intact

Both containers work and i can send curl http://0.0.0.0:8124/metrics | grep "^ch_"
What am I doing wrong?

Incorrect ParseQuery logick

ParseQuery in in collector.go incorrectly handles case when there are params before and after query

clickhouse-bulk/collector.go

Line 315 in b34e73c

if eoq >= 0 {

It might cause incorrect results or panic: runtime error: slice bounds out of range depending on params length.

Instead of

if eoq >= 0 {
			q = queryString[i+6 : eoq+6]
			params = queryString[:i] + queryString[eoq+7:]
}

It should be:

if eoq >= 0 {
			q = queryString[i+6 : i+eoq+6]
			params = queryString[:i] + queryString[i+eoq+7:]
}

Example of problematic string:
queryString = "a=11111111111111111111111111111&query=insert into x format fmt&a=1"

Option to periodically flush collected data from memory to the disk if needed.

This option could be helpful in cases when service might be killed with oom or unexpected reboot server, etc. To prevent losing all of the data collected in memory and guarantee delivery after recovery service.
Useful new options:

Enable flush to disk
How often
Retention policy
Could this feature be implemented?

Doesn't work with the phpClickHouse client

When doing request using the following lib https://github.com/smi2/phpClickHouse which is the best one at the moment. It seems like the clickhouse-bulk except auth config via query param, but the new version of the phpClickHouse lib is using authorization via headers.

Unfortunately the phpClickHouse doesn't provide a way to send custom query param so I can do a workaround to add username/password to query string too.

Is there any solution for this? Or could you update the lib to except the param via new way also.

The one of the official auth methods of Clickhouse is with headers: X-ClickHouse-Key, X-ClickHouse-User

Here all options:

Using HTTP Basic Authentication. Example:
$ echo 'SELECT 1' | curl 'http://user:password@localhost:8123/' -d @-
In the ‘user’ and ‘password’ URL parameters. Example:
$ echo 'SELECT 1' | curl 'http://localhost:8123/?user=user&password=password' -d @-
Using ‘X-ClickHouse-User’ and ‘X-ClickHouse-Key’ headers. Example:
$ echo 'SELECT 1' | curl -H 'X-ClickHouse-User: user' -H 'X-ClickHouse-Key: password' 'http://localhost:8123/' -d @-

https://clickhouse.tech/docs/en/interfaces/http/

High Memory Usage

I use the bulker one of my projects with the binary. I get logs from AWS Lambda and sent to bulker. I discovered that memory usage increase linearly and never stabilizes. Do you have any clue what can be the reason?

Here is my config file

  "listen": ":8124",
  "flush_count": 100000,
  "flush_interval": 5000,
  "dump_check_interval": 300,
  "debug": false,
  "dump_dir": "dumps",
  "clickhouse": {
    "down_timeout": 60,
    "connect_timeout": 10,
    "servers": [
      "http://X.X.X.X:8123",
      "http://X.X.X.X:8123",
      "http://X.X.X.X:8123",
      "http://X.X.X.X:8123"
    ]
  }
}```

ERROR 502: No working clickhouse servers

{
  "listen": ":8123",
  "flush_count": 10000,
  "flush_interval": 3000,
  "debug": true,
  "dump_dir": "dumps",

  "clickhouse": {
    "down_timeout": 300,
    "servers": [
      "http://0.0.0.0:8070"
    ]
  }
}

2018/07/11 09:15:27 query query=Insert+into+Log_buffer+FORMAT+JSONEachRow&input_format_skip_unknown_fields=1 {"ts":"2018-07-11 09:15:27","level":"DEBUG","logger":"plugins.base_core","pid":19847,"procname":"wkr:1","file":"base_core.py:352","body":"Action start for '***********'","node":"US-2","jobid":"51399907","uid":"2","type":"monitor","plug":"*****"}
2018/07/11 09:15:27 Send ERROR 502: No working clickhouse servers

while direct insert into CH works fine -

$ curl 0.0.0.0:8087
Ok.

🙏

PS Не выдерживает нагрузки?

Does not work with 21.4.3.21 :(

just getting 400 status

Not working?

Hello. I write simple test script on go:

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "strings"
)

func main() {
    fmt.Println("Hello, playground")
    for i := 0; i < 500; i++ {
        post(fmt.Sprintf("(%d)", i))
    }
    println("done")
}

func post(b string) {
    bod := strings.NewReader(b)
    req, err := http.NewRequest("POST", "http://127.0.0.1:8124/?query=INSERT%20INTO%20t%20VALUES", bod)
    if err != nil {
        panic(err)
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    _, err = ioutil.ReadAll(resp.Body)

    if err != nil {
        panic(err)
    }
}

I have default params in config
A see in log this records:

2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (493)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (494)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (495)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (496)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (497)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (498)
2019/11/18 18:14:07 DEBUG: query query=INSERT%20INTO%20t%20VALUES (499)
2019/11/18 18:14:08 INFO: send 500 rows to http://u:pass@ip:8123 of INSERT INTO t VALUES

But i see in CH:
curl 'some:8123?query=SELECT%20MAX(a)%20FROM%20t'
255
Looks like first packet was sended 2 times:

Deadlocks while trying to write dumps and uploading them to database at the same time

There are two locks (FileDumper.mu and Clickhouse.mu) that are called in different order.

When dumping files to disk Clickhouse.Dump locks Clickhouse.mu, then FileDumper.Dump is called with a lock on FileDumper.mu. At the same time, FileDumper.Listen calls ProcessNextDump with a FileDumper.mu lock and then calls SendQuery->GetNextQuery with a Clickhouse.mu lock.

One potential solution is to remove a lock from Clickhouse.Dump, since it already locks FileDumper.mu in FileDumper.Dump.

No working clickhouse servers

{
"listen": ":8124",
"flush_count": 10000,
"flush_interval": 1000,
"dump_check_interval": 300,
"debug": false,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 60,
"connect_timeout": 10,
"servers": [
"http://clickhouse:[email protected]:8123"
]
}
}

$ curl -s http://127.0.0.1:8124/metrics | grep "^ch_" ch_bad_servers 0 ch_dump_count 0 ch_good_servers 0 ch_queued_dumps 0 ch_received_count 0 ch_sent_count 0

$curl http://clickhouse:[email protected]:8123 Ok.

It seems that he does not see servers, how can I solve this problem?

Ethernal cycle of requests with errors

For example, if I send INSERT request with wrong date format "2020-02-06 07:15:23.364727" (ruby Time format), then this request will never been executed and makes computation power leak.

Could you make a solution for this?

For example: if exception is not about clickhouse server connection then remove this bad request from a sending cycle to a separate dump file with list of bad requests.

Very strange error on insert

Periodicaly not found or database, or auth error

-- auto-generated definition
create table test
(
id Int32,
name String
)
engine = MergeTree PARTITION BY id
PRIMARY KEY id
ORDER BY (id, name)
SETTINGS index_granularity = 8192;

⇨ http server started on [::]:8124
2021/09/10 21:26:10.503957 DEBUG: query INSERT INTO gc.test (id, name) VALUES (7, 'xcvbx')
2021/09/10 21:26:11.506905 INFO: sending 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:11.521327 INFO: sent 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:16.768973 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:17.504073 INFO: sending 1 rows to http://10.0.10.142:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:17.517043 INFO: sent 1 rows to http://10.0.10.142:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:17.517161 ERROR: Send (500) Wrong server status 500:
response: Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))

request: "INSERT INTO gc.test (id, name) VALUES\n(8, 'xcvbx')"; response Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))

2021/09/10 21:26:19.228692 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:19.508245 INFO: sending 1 rows to http://10.0.10.143:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:19.522896 INFO: sent 1 rows to http://10.0.10.143:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:19.523012 ERROR: Send (500) Wrong server status 500:
response: Code: 516, e.displayText() = DB::Exception: chtdidx: Authentication failed: password is incorrect or there is no user with such name (version 21.2.2.8 (official build))

2021/09/10 21:26:22.539692 DEBUG: query INSERT INTO gc.test (id, name) VALUES (8, 'xcvbx')
2021/09/10 21:26:23.503982 INFO: sending 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES
2021/09/10 21:26:23.531772 INFO: sent 1 rows to http://10.0.10.141:8123 of INSERT INTO gc.test (id, name) VALUES

Basic authentication credentials being leaked in logs

Version: 1.3.3

The upstream servers credentials are leaked into the log output on:

Initial boot: use servers XXXX
Sending metrics: sending N rows to XXXX
Sent metrics: sent N rows to XXXX

Many files in dumps directory

Hi!

Right now, on one of the instances there are 197 files in dumps directory, they contain seemingly failed queries and it looks like this:

-rw-r--r--    1 app     app          5130 Nov 26 13:21 dump202311241214102-98-500.dmp
-rw-r--r--    1 app     app         12358 Nov 26 13:21 dump202311241214102-99-500.dmp
...

app@clickhouse-bulk-576cb9c658-h2dvx $ ls -1 /app/dumps | wc -l
197

What does clickhouse-bulk do with all these files, should it remove them after resending?

Multiple clickhouse server data not send it.

We have following configuration.

{
"listen": ":8124",
"flush_count": 30000,
"flush_interval": 1000,
"debug": false,
"dump_dir": "dumps",
"clickhouse": {
"down_timeout": 300,
"servers": [
"http://127.0.0.1:8123",
"http://172.16.10.78:8123"
]
}
}

ConnectTimeout option is not tend to work properly

Hi,

At NewClickhouse method your have been implemented not right behavior for the ConnectTimeout options:

c.ConnectTimeout = connectTimeout
if c.ConnectTimeout > 0 {
    c.ConnectTimeout = 10
}

So, if I will set any positive value for ConnectTimeout - it will be not used but rewritten to 10 seconds;

ERROR: Send (503) No working clickhouse servers; response

Периодически под нагрузкой падает такой лог

clickhouse-bulk_1 | 2021/03/05 11:03:10.847398 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847752 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847858 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.847954 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848023 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848081 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848224 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848425 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848488 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848615 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848839 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.848950 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.849238 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.849771 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850151 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850361 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850426 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850513 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850565 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:03:10.850753 ERROR: Send (503) No working clickhouse servers; response
clickhouse-bulk_1 | 2021/03/05 11:04:32.243796 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:04:42.244128 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:04:42.244156 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:04:52.244517 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:04:52.244552 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:02.244919 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:02.244950 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:12.245236 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:12.245261 INFO: sending 26 rows to http://default:root@11111111:8123 of INSERT INTO lkdn_profiles.employees (
clickhouse-bulk_1 | 2021/03/05 11:05:22.245596 ERROR: server down (502): Post http://default:***@11111111:8123: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
clickhouse-bulk_1 | 2021/03/05 11:05:22.245626 ERROR: server error (503) No working clickhouse servers

но при этом сам сервер кх жив
echo 'SELECT 1' | curl 'http://default:root@1111:8123/' --data-binary @-
1

Bulk inserting is not working

Hi, first of all, thank your making clickhouse-bulk 💐

I am running with this config

{
  "listen": ":8125",
  "flush_count": 10000,
  "flush_interval": 3000,
  "debug": true,
  "dump_dir": "dumps",
  "clickhouse": {
    "down_timeout": 300,
    "servers": [
      "http://127.0.0.1:8123"
    ]
  }
}

Shouldn't this config collect and insert incoming requests in every 3 seconds, in bulk? I am watching logs and seeing every insert I send via HTTP (I am doing insert by python's requests) being processed immediately as I send. What am I doing wrong?

Can I ask if it supports tcp 9906 port?

Is it possible to start the service with tcp and send it to port 9906 of clickhouse？

How to guarantee data reliability?

The data will multi in memory, and if the service dead when the data is saved too much, the data will be lost

connect.Ping() results in bad connection error

One common practice after creating the connection is to check for Ping/Pong from server

if err := connect.Ping(); err != nil {
	logger.Fatal(err)
	return nil, err
}

This method works as intended when connecting to a clickhouse server via either http or tcp.

When i connect instead to clickhouse-bulk http i receive bad connection error from the driver.

'FORMAT' in string field broke insert query

For example, this query:

INSERT INTO test (date, args) VALUES ('2019-06-13', 'query=select%20args%20from%20test%20group%20by%20date%20FORMAT%20JSON')

or this

INSERT INTO test (date, args) VALUES ('2019-06-13', 'query=select%2520args%2520from%2520test%2520group%2520by%2520date%2520FORMAT%2520JSON')

generates an error:

2019/06/13 12:49:46 Send ERROR 500: Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected ( before: \'NDA\', ...: (at row 2)

Clarity on "For better performance words FORMAT and VALUES must be uppercase."

Just curious the reason for "For better performance words FORMAT and VALUES must be uppercase." Why is this so?

feature request: add support for CSVWithNames format

Hi!
I use library that can send inserts only in CSVWithNames format and it doesn't work with clichkhouse-bulk.
It will be cool if clickhouse-bulk supports CSVWithNames format too.

Request Timeout if send more then X rows at once

Hello, Nikolay!

I got Request Timeout while sending more then X (about 1000 in my case) rows at ones.
I batching requests on client side (nodejs) and sending this batch every 2500 ms. Everything goes well until my batch size reaches about 1000-1100 rows. I'm using offiсial clickhouse client for nodejs.

bulk instance deployed on another instance available via 1Gbit network.

Have you any idea why and how it could happen?

Credential disclosure in logs

If forwarding queries on to a server requiring authentication, if URL is of the form http://username:password@localhost:8123, these credentials are disclosed in the log file by

clickhouse-bulk/clickhouse.go

Line 182 in 4f084dd

log.Printf("INFO: send %+v rows to %+v of %+v\n", r.Count, url, r.Query)

We should redact the password portion of this string before echoing it.

App stops sending data to Clickhouse after receiving error from it

The last lines in log file:

2023/04/21 08:29:37.756987 ERROR: server down (502): Post "CH_URL": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023/04/21 08:29:37.756995 ERROR: Send (503) No working clickhouse servers; response

After that, data isn't sent to the CH server and the pod's RAM has increased over time.

I've checked app status and it looks ok:

/app # curl -s http://127.0.0.1:8124/metrics | grep "^ch_"
ch_bad_servers 0
ch_dump_count 14763
ch_good_servers 1
ch_queued_dumps 14743
ch_received_count 5.4660103e+07
ch_sent_count 2.154301e+06

combine chproxy and clickhouse-bulk

I want to add bulk features to chproxy，can you help me how to do it?

http interface

Why do you HTTP interface for Clickhouse?
Is it better than clickhouse-go ?

Respect database specifier in connection string

So I have two databases, we shall call them default and newdb containing identical tables, and if I connect to clickhouse directly using http://username:password@localhost:8123/default or http://username:password@localhost:8123/newdb I am able to submit queries to the correct database.

However, with clickhouse-bulk, the exact same connection strings as above, inserts to both databases are aggregated into the same DB.

I can (and am now) running a copy of clickhouse-bulk per database, but this seems sub-optimal. Absolute minimum case, clickhouse-bulk should reject queries sent to /newdb if it is only going to insert into a single DB.

Q: is bulk a remedy for max_connections_count

Hi there,
Could you kindly confirm my initial thoughts about using your tool as a savior to my system?

I have a scenario where small inserts of data are posted to Clickhouse (e.g gps updates from a number of mobile devices).
Often Clickhouse returns http 500 error due to max connection count reached or due to timeout.
There are some MV that are being calculated on inserts so that might slow it down.
I changed default value from 100 to 500 but it doesn't seem to help just more queries are waiting.

I thought that using your tool can improve the situation as bulk inserts are advised due to performance boost.
Other option that I think of is usage of buffer tables.

Thanks!

clickhouse-bulk stops to resend old dumps after while

We have noticed, that we have some old dump with 502 error. They are not resend. When service is restarted dumps are resent after 5 minutes. Looks like we have some bug with d.LockedFiles and these dumps are locked.