pingcap / tiflow Goto Github PK

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)

License: Apache License 2.0

Makefile 0.19% Go 87.21% Shell 10.65% Dockerfile 0.06% Python 0.60% JavaScript 0.21% HTML 0.01% TypeScript 0.80% PLpgSQL 0.13% Less 0.01% Smarty 0.01% ReScript 0.13%

cdc tidb mysql kafka ticdc dm

tiflow's People

Stargazers

Watchers

Forkers

amyangfei july2993 zier-one suzaku zhanglei shafreeck 5kbpers myonkeminta sdgdsffdsfff forkkit damon-guo overvenus yujuncen cwen0 bobotu crazycs520 lichunzhu joeky888 jangocheng oscar-zyh liuzix wingerx wolfstudy qinggniq shldreams wangshuil cocotyty tsthght dixingxing0 3pointer wjhuang2016 colins110 foreverolo handlerww liukehao zhouqiang-cl ti-srebot zhouranran sultan8252 alexisbook ninglin-p gs80140 xinydev astroprofundis yyt030 clix-dev-llc mianhk ben2077 rycaon clearwater1990 gengliqi sticnarf caitinchen zjj2wry yuqi1129 gejibin delta-ci sunxiaoguang dengqee oh-my-tidb 3aceshowhand zoujia-cm donvi hi-rustin sabaping minipeanut asddongmen dveeden youjiali1995 ti-chi-bot isgasho mashenjun rizalgowandy yanspirit glorv tiancaiamao danielzhangqd zipfast frankyu001 hundundm isabst lmagic233 swm623 hi-ernest wangpo1991 weihuaronaldo erwadba nchuxyz panda-sheep sdojjy xy720 rleungx okjiang gmhdbjd lance6716 little-wallace leavrth allinchen yufan022 fatelei

tiflow's Issues

Add a straightforward replication status query command

Feature Request

Is your feature request related to a problem? Please describe:

Currently a user can't detect whether a replication task is running normally.

Describe the feature you'd like:

TiCDC should collect replication status, and provide a convenient way to query them, the status maybe include

health status of each capture
replication gap, including resolved ts, checkpoint ts
replication speed
more replication details, such as table assignment

UCP: Fix DDL incompatible problem between upstream and downstream

Description

Currently TiCDC could replicate some DDLs that are not compatible between upstream and downstream. If DDL is executed failed in downstream, the replication task will be stopped.

The compatibility issues exist in higher version TiDB and lower version TiDB, or TiDB and MySQL.

To improve user experience, we can:

Add an option to filter incompatible DDL automatically if the DDL is ignorable.
In some cases, rewrite the DDL and ensure it can be executed in the downstream.

Score

Mentor(s)

@amyangfei

Recommended Skills

Go programming

Support cache only mode in changefeed

Feature Request

Is your feature request related to a problem? Please describe:

CDC has a common use case as follows:

We use br or dumpling/lightning to import full data to downstream, the full dump data has a checkpoint ts
After the full data is imported, we start a CDC incremental replication task from checkpoint ts

However the restore/import procedure could be very long and TiKV doesn't have a long enough GC interval for us to catch data from the checkpoint ts, so the use case is often like this

We start a CDC changefeed, captures data change from upstream, caches them in CDC and doesn't - replicate to downstream.
Restore or import data as usual and get the checkpoint ts.
After the full data set is restored/imported, notify the CDC to start replicating from checkpoint ts

Describe the feature you'd like:

We need CDC to provide a new work mode:

receives changed data from TiKV, caches them in memory only (If we use kafka or other message queue as sink, we don't need to cache data in memory)
receives a notification and starts replicating from a given TSO

changefeed restriction

when user creates a new chagnefeed, TiCDC should have some pre verification for the changefeed config, including

The changefeed StartTs must be larger than tikv_gc_safe_point
The must be unique, which means TiCDC can't have more than one changefeed to replicate to .
- Here we should make a definition for unique data flow, maybe (schema, table, target sink)?
- We also need to have a boundary for each definition, such as
  - What if there exist three downstream TiDB addresses and they share the same TiKV store, then how can TiCDC recognize it.
  - The table-filter in changefeed config may be various, is it possible to check table confliction accurately
- What should TiCDC do when it detects data flow confliction

Some tables failed to sync to the downstream mysql database

Version Info

/ # ./cdc version
Release Version:
Git Commit Hash: 80ee381230d5b7a3181464ad874f9a54c9220184
Git Branch: master
UTC Build Time: 2019-12-13 09:54:47
Go Version: go version go1.13.4 linux/amd64 

sh-4.2# ./tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: HEAD
UTC Build Time: 2019-12-02 09:22:52
GoVersion: go version go1.13.4 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

sh-4.2# ./tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/ # ./pd-server -V
Release Version: v4.0.0-alpha-200-gf7f643c61
Git Commit Hash: f7f643c6138cc5240d954bfa1a560e3b14bfdc6e
Git Branch: HEAD
UTC Build Time:  2019-12-13 11:42:23

Description

use test-infra to test cdc,
found that some tables failed to sync to mysql and some error log in cdc log

tidb tables: 
+----------------+
| Tables_in_test |
+----------------+
| amfev          |
| cmqwqm         |
| dcnlvf         |
| dofvv          |
| eacnnohz       |
| iyotimi        |
| mnrbu          |
| mvtabkee       |
| nqjftinj       |
| phfxkijuy      |
| pklmxor        |
| sflzhfns       |
| sfrpmvpa       |
| sxdho          |
| t1576252933    |
| wzgzdkdwq      |
| xlfmlygpi      |
| zpmlqvdcr      |
+----------------+
18 rows in set (0.00 sec)

mysql tables: 
+----------------+
| Tables_in_test |
+----------------+
| amfev          |
| cmqwqm         |
| dcnlvf         |
| dofvv          |
| eacnnohz       |
| mvtabkee       |
| phfxkijuy      |
| pklmxor        |
| sflzhfns       |
| sfrpmvpa       |
| sxdho          |
| wzgzdkdwq      |
| xlfmlygpi      |
| zpmlqvdcr      |
+----------------+
14 rows in set (0.00 sec)

cdc log:

[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]

the full log: http://139.219.11.38:8000/qTl2Q/cdc-2019-12-13T16-05-14.537.log

Set up capture to use mysqlSink by default

Our first milestone is to make CDC works for syncing to MySQL/TiDB, which corresponds to the mysqlSink in our code.

We should update the way Capture interact with Sink so that the necessary information can be passed to create a mysqlSink

Failed to replicate when pk is a generated column

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

create table t (a int, b int as (a + 1) stored primary key);
insert into t(a) values (1),(2), (3);
update t set a = 10 where a = 1;

What did you expect to see?

mysql [email protected]:aa> select * from t;
+----+----+
| a  | b  |
+----+----+
| 2  | 3  |
| 3  | 4  |
| 10 | 11 |
+----+----+

What did you see instead?

mysql [email protected]:aa> select * from t;
+---+---+
| a | b |
+---+---+
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
+---+---+
3 rows in set

Error 1298: Unknown or incorrect time zone: 'UTC'

topology:
Upstream TiDB: 172.16.6.206
Downstream: MySQL 5.6.46

1. Version Info

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: master
UTC Build Time: 2019-10-14 03:55:02
GoVersion: go version go1.13 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/pd-server -V
Release Version: v4.0.0-alpha-191-g7811255c
Git Commit Hash: 7811255c7345503ed5f44afb981bbf9712fd25c6
Git Branch: master
UTC Build Time:  2019-12-06 05:07:33

2. Reproduce steps

mysql -h 127.0.0.1 -P 4000 -u root

CREATE table test.simple2(id int primary key, val int);
CREATE table test.simple2(id int primary key, val int);

## start_ts=$(($(date +%s%N | cut -b1-13)<<18)) => 413114580074496000

INSERT INTO test.simple1(id, val) VALUES (1, 1);
INSERT INTO test.simple1(id, val) VALUES (2, 2);
INSERT INTO test.simple1(id, val) VALUES (3, 3);
UPDATE test.simple1 set val = 22 where id = 2;
DELETE from test.simple1 where id = 3

mysql -h 127.0.0.1 -P 3306 -u root -e 'create database test'

nohup /home/tidb/cdc server --pd-endpoints http://172.16.6.206:2379 &

/home/tidb/cdc cli --pd-addr http://172.16.6.206:2379 --start-ts=413114580074496000 --sink-uri 'root@tcp(127.0.0.1:3306)/test'

3. Expected and Got

cdc.log:

[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE if not exists mysql.stats_top_n (\n\t\ttable_id bigint(64) NOT NULL,\n\t\tis_index tinyint(2) NOT NULL,\n\t\thist_id bigint(64) NOT NULL,\n\t\tvalue longblob,\n\t\tcount bigint(64) UNSIGNED NOT NULL,\n\t\tindex tbl(table_id, is_index, hist_id)\n\t);"] [job="ID:38, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:37, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.003 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.stats_top_n] [id=37]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE IF NOT EXISTS mysql.expr_pushdown_blacklist (\n\t\tname char(100) NOT NULL\n\t);"] [job="ID:40, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:39, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.103 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.expr_pushdown_blacklist] [id=39]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE IF NOT EXISTS mysql.opt_rule_blacklist (\n\t\tname char(100) NOT NULL\n\t);"] [job="ID:42, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:41, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.153 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.opt_rule_blacklist] [id=41]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE table test.simple1(id int primary key, val int)"] [job="ID:44, Type:create table, State:synced, SchemaState:public, SchemaID:1, TableID:43, RowCount:0, ArgLen:0, start time: 2019-12-09 21:17:06.003 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=test.simple1] [id=43]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE table test.simple2(id int primary key, val int)"] [job="ID:46, Type:create table, State:synced, SchemaState:public, SchemaID:1, TableID:45, RowCount:0, ArgLen:0, start time: 2019-12-09 21:17:12.253 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=test.simple2] [id=45]
[2019/12/09 21:20:10.845 -05:00] [DEBUG] [client.go:228] ["singleEventFeed quit"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [processor.go:353] ["Checkpoint worker exited"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [client.go:235] ["EventFeed disconnected"] [span="{\"Start\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABs\",\"End\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABt\"}"] [checkpoint=413124368270098433] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).singleEventFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:408\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:227\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:30\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:215\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).EventFeed.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:188\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [scheduler.go:313] ["stop to run processor"] ["changefeed id"=245b6079-015f-4707-9f18-78bca094b6cf]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [client.go:228] ["singleEventFeed quit"]
[2019/12/09 21:20:10.846 -05:00] [ERROR] [server.go:80] ["run server"] [error="Error 1298: Unknown or incorrect time zone: 'UTC'\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).Emit\n\tgithub.com/pingcap/ticdc@/cdc/sink/mysql.go:141\ngithub.com/pingcap/ticdc/cdc.(*processor).syncResolved\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:587\ngithub.com/pingcap/ticdc/cdc.(*processor).Run.func3\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:283\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [client.go:235] ["EventFeed disconnected"] [span="{\"Start\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABs\",\"End\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABt\"}"] [checkpoint=413124368270098433] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).singleEventFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:408\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:227\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:30\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:215\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).EventFeed.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:188\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [capture_info.go:128] ["watchC from etcd close normally"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [owner.go:372] ["handleWatchCapture quit"]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [etcd.go:205] ["update subchangefeed info success"] ["changefeed id"=6cdfb9e6-e0ec-4933-bd77-b269946cd685] ["capture id"=a3d0a077-497e-4b4a-a7c3-cb186e9e110d] [modRevision=232] [info="{\"checkpoint-ts\":0,\"resolved-ts\":413124368270098433,\"table-infos\":[{\"id\":45,\"start-ts\":413124328229699584}],\"table-p-lock\":null,\"table-c-lock\":null}"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [processor.go:330] ["Local resolved worker exited"]

Capture info is deleted when owner changed multiple times

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

start a CDC server

What did you expect to see?

cdc cli capture list returns the server

What did you see instead?

[2020/03/18 19:44:09.797 +08:00] [INFO] [root.go:47] ["init log"] [file=ticdc_1.log] [level=debug]
[2020/03/18 19:44:09.797 +08:00] [INFO] [version.go:34] ["Welcome to Change Data Capture (CDC)"] [release-version=v4.0.0-beta.2] [git-hash=63b1db95df26ef914bc1f1dc29ddfa4936100ff8] [git-branch=master] [utc-build-time="2020-03-13 09:45:32"] [go-version="go version go1.13 linux/amd64"]
[2020/03/18 19:44:09.797 +08:00] [INFO] [server.go:76] ["creating CDC server"] [pd-addr=http://hw-dt-wms-warp1-tidb01:2379] [status-host=127.0.0.1] [status-port=8301]
[2020/03/18 19:44:09.804 +08:00] [INFO] [capture.go:96] ["creating capture"] [capture-id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.805 +08:00] [INFO] [client.go:134] ["[pd] create pd client with endpoints"] [pd-address="[http://hw-dt-wms-warp1-tidb01:2379]"]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:226] ["[pd] update member urls"] [old-urls="[http://hw-dt-wms-warp1-tidb01:2379]"] [new-urls="[http://10.232.0.109:2379,http://10.232.0.166:2379,http://10.232.0.212:2379]"]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://10.232.0.212:2379] [old-leader=]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=6804742633952162675]
[2020/03/18 19:44:09.812 +08:00] [INFO] [http_status.go:54] ["status http server is running"] [addr=127.0.0.1:8301]
[2020/03/18 19:44:09.819 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.819 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.819 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ea25e]
[2020/03/18 19:44:10.317 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 19:44:10.318 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=93442]
[2020/03/18 21:49:11.764 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:49:11.764 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:11.764 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:11.765 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:49:13.448 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:13.448 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:13.448 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee261]
[2020/03/18 21:49:13.733 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 21:49:13.734 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=97243]
[2020/03/18 21:51:42.785 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:51:42.785 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:51:42.785 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:51:42.785 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:51:55.899 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:06.900 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:17.901 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:28.901 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:30.285 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:52:30.286 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:52:30.286 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee387]
[2020/03/18 21:52:30.410 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 21:52:30.411 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=97424]
[2020/03/18 21:54:06.330 +08:00] [INFO] [manager.go:301] ["watch failed, owner is deleted"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.330 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:54:06.330 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.330 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:54:06.331 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: requested lease not found"]
[2020/03/18 21:54:06.333 +08:00] [INFO] [manager.go:207] ["etcd session encounters the error of lease not found, closes it"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: requested lease not found"]
[2020/03/18 21:54:06.333 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee446]

Out of order problem occurs between puller and kv client

Feature Request

In current kv client, we process kv event Entry in a *cdcpb.Event_Entries_ one by one.

put entry to eventCh -> put a sorter item to sorter -> 
  put entry to eventCh -> put a sorter item to sorter -> 
    put entry to eventCh -> put a sorter item to sorter -> ...

So we will generate multiple commit event entry with the same commit ts to puller and one resolved ts event generated by the sorter, all of them have the same commit ts.

kv -> resolve -> kv -> kv

In fact this is not exactly as our design.

Describe alternatives you've considered:

We can remove the sorter mechanism and forward event totally based on the resolved ts from TiKV

avoid the sleep style

the processor can only handle about 400 txn, and cpu of server < 10%

https://github.com/pingcap/ticdc/blob/d2621b3f0f65f33567fa6bf772b93e8b2aee1128/cdc/processor.go#L712

after avoiding this sleep style and doing nothing it can improve much more

Support TLS and online reload with new certs

Description

Support TLS and online reload new certs.

Task list

Enable TLS for PD client and TiKV Client #347
Enable TLS for HTTP client #347
Enable TLS for MySQL sink #742
Enable TLS for Kafka sink #764
Sppport reloading certs online. #347
Validate Common Name #747

Value

Value Description

Improves security.
Allow TiCDC to be used in public could platform.

Value Score

Optional value: 1~5

Workload Estimation

1 Point for 1 Person/Work Day

5 Points

Refine command line help output

Feature Request

Is your feature request related to a problem? Please describe:

The output from cdc -h is not straightforward or providing an easy for a newbie to start to use CDC rapidly.

Describe the feature you'd like:

Provide straightforward help content
Provide a brief, clear instruction for starting a CDC service in help.

Add README.md for the project

Please include at least the following information:

How to run unit test
How to set up a development environment

Add document about how to write block-allow-list config

Feature Request

Is your feature request related to a problem? Please describe:

The block-allow-list config is not straightforward, we should provide a detail usage document for it

Make region cache used in kv client more robust

Feature Request

Is your feature request related to a problem? Please describe:

first we setup a TiDB cluster and CDC replication job, especially with the following TiKV config, which aims to have lot of region split during benchmark

[coprocessor]
region-max-keys = 1200
region-split-keys = 1000

start a sysbench test, check whether CDC replication works well

Describe the feature you'd like:

Currently CDC reuses the region cache lib in TiDB, and is able to handle normal region split. But in the above benchmark scenario, it always fails.
We need to dig into this problem and make the kv client more robust.

Support NATS / LIftBridge

Description

We cant run kafka as we just dont like the legacy of java.

It would be awesome if you also support LIftBridge. It has a GRPC interface as well as a NATS interface.
Its basically like Kafka but written in Golang

https://github.com/liftbridge-io/liftbridge

Value Score

Optional value: 1~5

Workload Estimation

1 Point for 1 Person/Work Day

20 Points

EventFeed may miss some regions when region is splitting

The function partialRegionFeed (in cdc/kv/client.go) accepts a region info from parameter and may reload region info from region cache before sending request. So it's possible that the region has changed after a split. As a result it will get a smaller region after calling regionCache.LocateKey that doesn't cover the range hold by the parameter regionInfo, so the remaining part of the range will be missing.

run server: schema 68 not found

topology:
Upstream TiDB: 172.16.5.86
Downstream: MySQL 5.7.28

1. Version Info

/data1/deploy1/bin/tidb-server -V
Release Version: v4.0.0-alpha-1148-g5da10ffec
Git Commit Hash: 5da10ffecc280136b2041801b23034c557e41751
Git Branch: HEAD
UTC Build Time: 2019-12-12 03:12:21
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306

/data1/deploy1/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   38579ea3e2ed08dc5bd724b2c0cda82b4588c42f
Git Commit Branch: master
UTC Build Time:    2019-12-09 04:37:17
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/data1/deploy1/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   38579ea3e2ed08dc5bd724b2c0cda82b4588c42f
Git Commit Branch: master
UTC Build Time:    2019-12-09 04:37:17
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)
[tidb@localhost tidb-ansible]$ /data1/deploy1/bin/pd-server -V
Release Version: v4.0.0-alpha-197-gbd7b3f46
Git Commit Hash: bd7b3f46eef5dfb8241bcdcea27c68454b2f1f1c
Git Branch: master
UTC Build Time:  2019-12-12 02:16:14

2. Reproduce steps

../go-tpc/bin/go-tpc --time=400m tpch --host 172.16.5.86 -P 4000 -T 1 --sf=1 prepare // 灌数据
// get ts
+ mysql -h 172.16.5.86 -uroot -P4000 -e 'drop database if exists tmp_db' // 创建表
+ mysql -h 172.16.5.86 -uroot -P4000 -e 'create database tmp_db'

./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log

3. Expected and Got

$ cat cdc_server.log

Error: run server: schema 68 not found
Usage:
  cdc server [flags]

Flags:
  -h, --help                  help for server
      --pd-endpoints string   endpoints of PD, separated by comma (default "http://127.0.0.1:2379")
      --status-addr string    bind address for http status server (default "127.0.0.1:8300")

Global Flags:
      --log-file string    log file path (default "cdc.log")
      --log-level string   log level (etc: debug|info|warn|error) (default "debug")

run server: schema 68 not found

cdc.log
ddl_history.log

use sink uri in changefeed info as uri

currently pass a DSN of mysql driver actually don't support other schema

support confige concurrency when replication data to downstream mysql

https://github.com/pingcap/ticdc/blob/d2621b3f0f65f33567fa6bf772b93e8b2aee1128/cdc/sink/mysql.go#L122

consider just use pkg/loader from tidb-binlog

Replace Loader with Lightning

Description

Replace Loader with Lightning

Background

In DM, the loader.Loader struct implements loading data from mydumper output files into TiDB. Since v3.0.3, Lightning supports the TiDB backend, which enables Lightning to do the same thing.

Proposal

We consider the TiDB backend mode of Lightning the better implementation of the two, because:

It has more complete parsing support that can succeed in cases loader.Loader would fail;
It has more fine-grain concurrency control, allowing users to config read and write concurrency differently.

So we propose to replace Loader with Lightning.

Success Criteria

Reimplement loader.Loader (which is an implementation of the Unit interface) with Lightning;
Maintain backward-compatibility of config files like task.yaml

Value

Enhance the ability of load data.

Score

1500

SIG slack channel

sig-migrate

Mentor

csuzhangxc

TODO list

Refactor Lightning to make the loader path usabale as a library in DM
A basic replacement of loader that implements essential methods of the Unit interface (eg. Init, Process) and leave the other methods with dummy implementations
Implement Close
Implement Pause and Resume
Implement Status, Error, Type, IsFreshTask

Documentations

Project

N/A

create database failed if the schema `test` does not exist in downstream

[2020/03/07 10:53:45.345 -05:00] [INFO] [mysql.go:97] ["execute DDL failed, but error can be ignored"] [query="create database cdc_bench"] [error="Error 1049: Unknown database 'test'"] [errorVerbose="Error 1049: Unknown database 'test'
github.com/pingcap/errors.AddStack
        github.com/pingcap/[email protected]/errors.go:174
github.com/pingcap/errors.Trace
        github.com/pingcap/[email protected]/juju_adaptor.go:15
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDL
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:109
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDLWithMaxRetries.func1
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:95
github.com/pingcap/ticdc/pkg/retry.Run.func1
        github.com/pingcap/ticdc@/pkg/retry/retry.go:31
github.com/cenkalti/backoff.RetryNotify
        github.com/cenkalti/[email protected]+incompatible/retry.go:37
github.com/cenkalti/backoff.Retry
        github.com/cenkalti/[email protected]+incompatible/retry.go:24
github.com/pingcap/ticdc/pkg/retry.Run
        github.com/pingcap/ticdc@/pkg/retry/retry.go:30
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDLWithMaxRetries
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:94
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).EmitDDLEvent
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:89
github.com/pingcap/ticdc/cdc.(*changeFeed).handleDDL
        github.com/pingcap/ticdc@/cdc/owner.go:900
github.com/pingcap/ticdc/cdc.(*ownerImpl).handleDDL
        github.com/pingcap/ticdc@/cdc/owner.go:811
github.com/pingcap/ticdc/cdc.(*ownerImpl).run
        github.com/pingcap/ticdc@/cdc/owner.go:1118
github.com/pingcap/ticdc/cdc.(*ownerImpl).Run
        github.com/pingcap/ticdc@/cdc/owner.go:1076
github.com/pingcap/ticdc/cdc.(*Capture).Start.func1
        github.com/pingcap/ticdc@/cdc/capture.go:150
golang.org/x/sync/errgroup.(*Group).Go.func1
        golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
        runtime/asm_amd64.s:1357"]

cdc server produced 5Gi logs in ten minutes

Version Info

/ # ./cdc version
Release Version:
Git Commit Hash: 80ee381230d5b7a3181464ad874f9a54c9220184
Git Branch: master
UTC Build Time: 2019-12-13 09:54:47
Go Version: go version go1.13.4 linux/amd64 

sh-4.2# ./tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: HEAD
UTC Build Time: 2019-12-02 09:22:52
GoVersion: go version go1.13.4 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

sh-4.2# ./tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/ # ./pd-server -V
Release Version: v4.0.0-alpha-200-gf7f643c61
Git Commit Hash: f7f643c6138cc5240d954bfa1a560e3b14bfdc6e
Git Branch: HEAD
UTC Build Time:  2019-12-13 11:42:23

Description

use test-infra to test cdc, found that cdc server produced 5Gi logs in ten minutes

log:

[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]

Add test for alter pk in upstream

Feature Request

Is your feature request related to a problem? Please describe:

set alter-pk =true
alter table sbtest_pk add primary key(id);

check downstream status

Describe the feature you'd like:

Describe alternatives you've considered:

[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="alter table sbtest_pk add primary key(id)"] [job="ID:3484, Type:add primary key, State:synced, SchemaState:public, SchemaID:3413, TableID:3482, RowCount:0, ArgLen:0, start time: 2020-03-11 23:37:14.406 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:415220638759256066"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="drop table sbtest_pk"] [job="ID:3485, Type:drop table, State:synced, SchemaState:none, SchemaID:3413, TableID:3482, RowCount:0, ArgLen:0, start time: 2020-03-11 23:45:02.856 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:367] ["drop table success"] [name=sbtest_pk] [id=3482]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="CREATE TABLE `sbtest_pk` (   `id` int(11) NOT NULL,   `k` int(11) NOT NULL DEFAULT '0',   `c` char(120) NOT NULL DEFAULT '',   `pad` char(60) NOT NULL DEFAULT '' )"] [job="ID:3487, Type:create table, State:synced, SchemaState:public, SchemaID:3413, TableID:3486, RowCount:0, ArgLen:0, start time: 2020-03-11 23:45:08.256 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:383] ["create table success"] [name=cdc_sbtest.sbtest_pk] [id=3486]

reuse the same grpc client for puller

currently every puller create a client, and every table will create a puller, so we may create many grpc client

refine replication forward model in CDC

In CDC we have the following replication model

1. kv client recvs data
2. kv client sends data to puller via an event chan
3. puller adds data to a buffer, sorts data and re-constructs transactions
4. puller sends transactions to tableInfo (managed in a processor) via a txn chan
5. procssor pulls all txns from txn chan of each tableInfo (with txn ts no more than CDC GlobalResolvedTs)

In the large number of regions test, we found the replication blocked with buffer in 3, chan in 4 full and no data was pulled in step 5. Maybe this is also part reason for slow replication and low throughput. We should have a better data forward model, for the following consideration:

table level sort and data flow should not block the kv client pulling messages from TiKV
If we use MQ such as Kafka as sink, we even don't need the transaction reconstruction in step 3 (sort in step 3 is enough).

We can separate this refactor into multiple small changes, including:

implement a new memory cache mechanism, always receive messages from TiKV and buffer messages in memory if they are not consumed fast enough.
limit memory usage in optimization 1
we can forward messages to the sink module regardless of the global resolved ts, and only consider the global resolved ts when we really replicate data to the target.

Sink interface should accept row and resolve event

so can support some sink don't care txn.
the implementation of the sink can reconstruct txn by resolve event internally if need.

implement a limit memory buffer

https://github.com/pingcap/ticdc/blob/d2621b3f0f65f33567fa6bf772b93e8b2aee1128/cdc/puller/buffer.go#L40

currently, we will stop to consume kv events from tikv once the rest pipeline is slow.

implement a limit memory buffer and use it to buffer event from tikv, we should consume events from tikv ASAP and fail the changfeed if the rest pipeline of feed is too slow.

confusing warn log at start up

[2020/02/18 19:05:47.839 +08:00] [WARN] [disk.go:56] ["Mkdir temporary file error"] [tmpDir=/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/tidb-server-tidb-server] [error="mkdir /var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/tidb-server-tidb-server: file exists"]

➜  ticdc git:(ana) ✗ fd disk.go ./vendor
vendor/github.com/pingcap/tidb/util/chunk/disk.go
vendor/github.com/shirou/gopsutil/disk/disk.go

cause tidb/util/chunk/disk.go init a temporary dir(we start multi instance of tidb)

func init() {
    err := os.RemoveAll(tmpDir) // clean the uncleared temp file during the last run.
    if err != nil {
        log.Warn("Remove temporary file error", zap.String("tmpDir", tmpDir), zap.Error(err))
    }
    err = os.Mkdir(tmpDir, 0755)
    if err != nil {
        log.Warn("Mkdir temporary file error", zap.String("tmpDir", tmpDir), zap.Error(err))
    }
}

kv client failed to fetch some region data

At about 2020/03/08 05:27:06.141 -04:00 the replication does not forward any anymore, because one of table's(sbtest3) resolved ts is not forward

CDC uses #308 version, TiKV uses 5kbpers/tikv@1765a5b version

➜  curl -s http://172.16.5.113:10080/tables/cdc_bench/sbtest3/regions |grep region_id
   "region_id": 176,
   "region_id": 192,
   "region_id": 204,
   "region_id": 160,

cdc log: http://139.219.11.38:8000/KJTrQ/issue_321_cdc.log.tar.gz
tikv log: http://139.219.11.38:8000/NK0s0/tikv.log.tar.gz

some abnormal behavior:

region_id=160, last register at

[2020/03/08 05:26:56.099 -04:00] [INFO] [endpoint.rs:242] ["cdc register region"] [region_id=160]

but last resolved ts in TiKV is

[2020/03/08 05:29:05.462 -04:00] [INFO] [delegate.rs:279] ["resolved ts updated"] [resolved_ts=415146900050149856] [region_id=160]

region_id=176, not register after deregister

[2020/03/08 05:26:56.099 -04:00] [INFO] [endpoint.rs:242] ["cdc register region"] [region_id=176]
[2020/03/08 05:26:56.104 -04:00] [INFO] [endpoint.rs:169] ["cdc deregister region"] [error="Some(Request(message: \"peer is not leader for region 176, leader may None\" not_leader { region_id: 176 }))"] [conn_id=Some(ConnID(6))] [downstream_id=Some(DownstreamID(10))] [region_id=176]

fail to run test for go 1.14

make test

which bin/failpoint-ctl >/dev/null 2>&1 || CGO_ENABLED=0 GO111MODULE=on go build  -trimpath -o bin/failpoint-ctl github.com/pingcap/failpoint/failpoint-ctl
mkdir -p "/tmp/tidb_cdc_test"
$(echo $(for p in $(go list ./...| grep -vE 'vendor|proto|ticdc\/tests'); do echo ${p#"github.com/pingcap/ticdc/"}|grep -v "github.com/pingcap/ticdc"; done) | xargs bin/failpoint-ctl enable >/dev/null)
ok  	github.com/pingcap/ticdc	0.070s	coverage: 100.0% of statements
{"level":"info","ts":"2020-03-03T17:40:20.472+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:50545"]}
{"level":"info","ts":"2020-03-03T17:40:20.473+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:50546"]}
{"level":"info","ts":"2020-03-03T17:40:20.474+0800","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.3","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.14","go-os":"darwin","go-arch":"amd64","max-cpu-set":12,"max-cpu-available":12,"member-initialized":false,"name":"default","data-dir":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:50545"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:50546"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-03-03T17:40:20.566+0800","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0/member/snap/db","took":"90.734245ms"}
fatal error: checkptr: unsafe pointer conversion

goroutine 157 [running]:
runtime.throw(0x6c41cad, 0x23)
	/usr/local/go/src/runtime/panic.go:1112 +0x72 fp=0xc0007d97a0 sp=0xc0007d9770 pc=0x40379f2
runtime.checkptrAlignment(0xc0001d9370, 0x6a84ee0, 0x1)
	/usr/local/go/src/runtime/checkptr.go:18 +0xb7 fp=0xc0007d97d0 sp=0xc0007d97a0 pc=0x4009617
go.etcd.io/bbolt.(*Bucket).write(0xc0007d9948, 0x0, 0x0, 0x0)
	/Users/huangjiahao/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:624 +0x15c fp=0xc0007d9838 sp=0xc0007d97d0 pc=0x59c87bc
go.etcd.io/bbolt.(*Bucket).CreateBucket(0xc000200018, 0x858a188, 0x7, 0x7, 0xc0007d9ba8, 0x5a862c

error: mkdir /tmp/tidb-server-cdc: file exists

1. Version Info

./resources/bin/br version
Release Version:
Git Commit Hash: 719cac031a89dff89e8c8d3f2c10d988bf401617
Git Branch: master
UTC Build Time:  2019-12-09 03:32:23
Race Enabled:  false

2. Reproduce steps

../go-tpc/bin/go-tpc --time=400m tpch --host 172.16.5.86 -P 4000 -T 1 --sf=1 prepare
mysql -h 172.16.5.86 -uroot -P4000 -e 'drop database if exists tmp_db'
mysql -h 172.16.5.86 -uroot -P4000 -e 'create database tmp_db'

./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log
./resources/bin/cdc cli --pd-addr http://172.16.5.86:2379 --start-ts 1 --sink-uri 'root@tcp(127.0.0.1:3306)/test'

kill -9 $(pgrep cdc)
./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log
./resources/bin/cdc cli --pd-addr http://172.16.5.86:2379 --start-ts 1 --sink-uri 'root@tcp(127.0.0.1:3306)/test

3. Log

$ cat cdc_server.log

[2019/12/12 13:49:29.296 +08:00] [WARN] [disk.go:56] ["Mkdir temporary file error"] [tmpDir=/tmp/tidb-server-cdc] [error="mkdir /tmp/tidb-server-cdc: file exists"]

kv client panic

test on pr #308

fatal error: sync: RUnlock of unlocked RWMutex

goroutine 508 [running]:
runtime.throw(0x1e130ef, 0x21)
	runtime/panic.go:774 +0x72 fp=0xc0017637f0 sp=0xc0017637c0 pc=0x42f612
sync.throw(0x1e130ef, 0x21)
	runtime/panic.go:760 +0x35 fp=0xc001763810 sp=0xc0017637f0 pc=0x42f595
sync.(*RWMutex).rUnlockSlow(0xc0008b64a0, 0xc0bfffffff)
	sync/rwmutex.go:80 +0x3f fp=0xc001763838 sp=0xc001763810 pc=0x46f42f
sync.(*RWMutex).RUnlock(...)
	sync/rwmutex.go:70
github.com/pingcap/ticdc/cdc/kv.(*CDCClient).receiveFromStream(0xc0006365d0, 0x2128360, 0xc000944000, 0xc0001cef90, 0xc000051fe0, 0x12, 0x5, 0x2145bc0, 0xc0007405a0, 0xc000445860, ...)
	github.com/pingcap/ticdc@/cdc/kv/client.go:563 +0x3a4 fp=0xc001763ec8 sp=0xc001763838 pc=0x14391b4
github.com/pingcap/ticdc/cdc/kv.(*CDCClient).dispatchRequest.func1(0xc0008ad768, 0x0)
	github.com/pingcap/ticdc@/cdc/kv/client.go:290 +0xc2 fp=0xc001763f58 sp=0xc001763ec8 pc=0x1443612
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc0001cef90, 0xc00016ad20)
	golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x64 fp=0xc001763fd0 sp=0xc001763f58 pc=0xe0da34
runtime.goexit()
	runtime/asm_amd64.s:1357 +0x1 fp=0xc001763fd8 sp=0xc001763fd0 pc=0x45f131
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

full stdout log: http://139.219.11.38:8000/suZMH/20200306_1258_cdc_stdout.log

Naming

If this tool will be used by tikv alone in the future, then the name, tidb-cdc, is not accurate.

TiKV EventFeed supports receiving data with regions more than 1024

Feature Request

Is your feature request related to a problem? Please describe:

update merge-schedule-limit = 0 in pd.toml to disable region merge
prepare data

sysbench --config-file=config oltp_insert --rand-seed=$RANDOM --tables=1 --table-size=8000000 prepare

split table into 3997 regions

mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (0) and (1100000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (1100000) and (2200000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (2200000) and (3300000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (3300000) and (4400000) regions 1000"

start a CDC server and create a changefeed
we found not all regions data have been received in CDC, up to 3080 approximately

in TiKV log of each node, only 1024 regions are registered.

➜  grep "cdc register region" tikv.log|wc -l 
1024

Describe the feature you'd like:

TiKV EventFeed supports receiving data with regions more than 1024.

resolved ts is not forward after one TiKV crash

we start a sysbench task (sysbench oltp_insert --tables=1 --threads=200) and a CDC changefeed
the replication works well, at about 2020/03/10 04:28:40.560 -04:00 one TiKV is killed because of OOM. The CDC doesn't receive any region data anymore. (Except for a region_id=8)

Note: has a special tikv config to test frequently region split

[coprocessor]

region-max-keys = 3000
region-split-keys = 2500

Have some doubts:

CDC has almost pulled all changefeed event in time, why TiKV's memory consumes so much.
regions seem not quite balance, one store (172.16.5.210) does most work
CDC fails to continue replicate after one TiKV crashes.

cdc log: http://139.219.11.38:8000/cDBuP/cdc.log.tar.gz

Improve test coverage by 5%

Slow replication speed with large scale of regions

Feature Request

Is your feature request related to a problem? Please describe:

set tikv config

[coprocessor]
region-max-keys = 6000
region-split-keys = 5000

prepare data

sysbench oltp_write_only --create_secondary=off --rand-seed=$RANDOM --tables=1 --table-size=10000000 prepare

start a CDC server and create a changefeed
run sysbench

sysbench oltp_write_only --create_secondary=off --rand-seed=$RANDOM --tables=1 --table-size=10000000 run

Describe the feature you'd like:

we found the replication speed is quite slow
and the puller entry buffer will be filled up very fast

Besides found the span_frontier tasks too much CPU:

profile file: http://139.219.11.38:8000/oitOt/pprof.cdc.samples.cpu.005.pb.gz

`sh tests/run.sh --debug` success but server fails to start acttually

very easy to happen.

look well

...
| tikv_gc_life_time     | 10m0s                                                                                           | All versions within life time will not be collected by GC, at least 10m, in Go format. |
| tikv_gc_last_run_time | 20200218-18:51:14 +0800                                                                         | The time when last GC starts. (DO NOT EDIT)                                            |
| tikv_gc_safe_point    | 20200218-18:41:14 +0800                                                                         | All versions after safe point can be accessed. (DO NOT EDIT)                           |
+-----------------------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
create changefeed ID: 04ab9a26-a5fb-42a4-a89f-7853ebf725e4 detail {"sink-uri":"root@tcp(127.0.0.1:3306)/","opts":{},"create-time":"2020-02-18T18:51:18.754343+08:00","start-ts":414717857956102145,"target-ts":0,"admin-job-type":0,"config":{"filter-case-sensitive":false,"filter-rules":null,"ignore-txn-commit-ts":null}}
You may now debug from another terminal. Press [ENTER] to exit.

log of cdc.server

Error: run server: create change feed 04ab9a26-a5fb-42a4-a89f-7853ebf725e4: create schema store failed: [tikv:9001]PD server timeout
Usage:
  cdc server [flags]

Flags:
  -h, --help                  help for server
      --pd-endpoints string   endpoints of PD, separated by comma (default "http://127.0.0.1:2379")
      --status-addr string    bind address for http status server (default "127.0.0.1:8300")

Global Flags:
      --log-file string    log file path (default "cdc.log")
      --log-level string   log level (etc: debug|info|warn|error) (default "debug")

run server: create change feed 04ab9a26-a5fb-42a4-a89f-7853ebf725e4: create schema store failed: [tikv:9001]PD server timeout
 +08:00] [INFO] [client.go:134] ["[pd] create pd client with endpoints"] [pd-address="[http://127.0.0.1:2379]"]
[2020/02/18 18:51:17.738 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://127.0.0.1:2379] [old-leader=]
[2020/02/18 18:51:17.738 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=6794737329153784617]
[2020/02/18 18:51:17.738 +08:00] [INFO] [http_status.go:54] ["status http server is running"] [addr=0.0.0.0:8300]
[2020/02/18 18:51:17.771 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=8db4a77e-bf7d-4a20-bc13-de2975abc096]

make a better processor aliveness check and ensure RTO < 30s

Feature Request

Is your feature request related to a problem? Please describe:

We have s simple processor aliveness check, basically check either resolvedTs or checkpointTs is updated in one minute, which doesn't meet the requirement of RTO < 30s

In some tests we found kv client may block or with some other reasons, the resolvedTs and checkpointTs can't be updated in time, which means the not real-time of replication status doesn't always mean abnormal of a processor

Describe the feature you'd like:

Design a better aliveness check strategy, which satisfies

can find a processor abnormal (have many scenarios, like network partition, capture node crashes etc.) and rebalance tables belong to it to other processors less than 30s.

bug: table test.binlog_recover_and_insert already exists

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.
the ci unstable problem:
run integration tests
summar.log.tar.gz

failed to start cdc with multiple pd address

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

./cdc cli --pd-addr=172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329 --sink-uri="mysql://root:[email protected]:13307/" --start-ts 0
Error: [pd] failed to get cluster id

What did you expect to see?

start a changefeed

What did you see instead?

[pd] failed to get cluster id"] [url=http://172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329] [error="error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329: too many colons in address\" target:172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329 status:TRANSIENT_FAILURE"]

UCP: Refine log and stdout of cdc cli tool

Description

Currently, TiCDC cli outputs logs to cdc.log by default and, as a command-line tool, it is not reasonable and not easy for user to debug problems. However if we log to stdout directly, we may get a log of noise logs.

To improve user experience when using cdc cli, we can:

By default, output logs to /tmp/cdc.log,
print necessary information to stdout, including error message, running result etc.

Score

Mentor(s)

@amyangfei

Recommended Skills

Go programming

Question ask: How to know the changefeed task is completed?

I'm going to bench the performance of CDC. I populate some data into the upstream TiDB cluster and after it's finished, I run cdc cli, and get some info like:

create changefeed detail &{SinkURI:root@tcp(127.0.0.1:3306)/test Opts:map[] CreateTime:2019-12-09 21:11:09.219849281 -0500 EST m=+0.017868850 StartTs:413114580074496000 TargetTs:0 Info:<nil>}

The sync task is running on cli server, so How do I know data is synced?

endless log 'region not found on incremental scan' of tikv

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do?
If possible, provide a recipe for reproducing the error.
sh tests/run.sh --debug
sysbench load some data at upstream.
What did you expect to see?
replication works normally.
What did you see instead?
stop replicate after some time (no more data at down stream).

log of tikv.log continues printing (even after stop cdc server means we can make sure no any more request to tikv)
endless retry of tikv?

[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.257 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]

version of tikv: ad59724513ab83461c54c1996f89235301a036d7

the region not.... log in tikv is filtered
cdc.tar.gz
issue268.tar.gz

Support batch stream in kv client

Feature Request

Is your feature request related to a problem? Please describe:

The original TiKV EventFeed API will be changed to a duplex stream. ref:
https://docs.google.com/document/d/1SN3ztOXy2QTlCS1Qp9dUWTBfxowx-nIGpkuCw2ccULM/edit

Describe the feature you'd like:

The kv client has a clear input and ouput:

input: a key range
output: kv change log from all regions located in the given kv range

Things need to be done:

We need to refine our kv client in CDC to support new API.
Have a better error handling in kv client

support dry-run way of mysql sink or a dry-run sink for test

this help test lag & latency regard the sink

Potential goroutine leaks when tables are removed

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

We have processors for each change feed task, the processor is essentially a goroutine. When all the tables processed by this processor are all removed, the processor should have stopped.

Reproduce steps:

set up the test environment
create database foo, table foo.user, and insert one entry.
create a change feed replication job
ensure the processor has been started from the ticdc's log
drop database foo from upstream
the processor never stops even the table has been removed and there is nothing to do now.

Solution:

The task status should be removed if there is no table left.

What did you expect to see?

The processor exits when all its tables are dropped.

What did you see instead?

The processor keeps running.

Versions of the cluster
- Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
```
master
```
- TiCDC version (execute cdc version):
```
master
```

Open TiCDC protocol

Open TiCDC protocol project

Overview

Design an open TiCDC protocol, make other applications easy to access TiCDC data.
Support Kafka and Pulsar as downstream data-targets for TiCDC.

Background

TiCDC(TiDB Change Data Capture) is a new distributed incremental replication tool for TiDB ecosystem. TiCDC is still in development, but it already works properly in the experimental environment. When the TiCDC cluster starts, an owner will be voted and other nodes are named processor. The processors pull change key-value logs from TiKV, assemble logs into transactions, output to downstream data-target.The owner watches the replication progress of processors and coordinates them to ensure the transaction order.

Problem Statement

The current design of TiCDC is for MySQL protocol downstream data-targets. It cannot accept any Non-MySQL protocol data-targets. It's hard to access TiCDC data for other applications.

Success Criteria

Under the premise of ensuring correctness, TiCDC can output to Kafka/Pulsar.
Exist a library can consume Kafka/Pulsar and parse open TiCDC protocol.
Guarantee transaction integrity of downstream data even the TiCDC cluster is broken.

TODO list

Design an open TiCDC protocol.
Implement TiCDC output using open TiCDC protocol.
Implement open TiCDC protocol parser library and consumer client.
Support output data to Kafka.
Support output data to Pulsar.
Correctness and stability test.

Difficulty

Easy

Score

2100

Mentor(s)

@leoppro

Recommended Skills

Go language
Kafka/Pulsar operation
Basic understanding of TiKV

References

TiCDC distributed design(Chinese version)

TiCDC HA design(Chinese version)

TiCDC github repo

pingcap / tiflow Goto Github PK

tiflow's People

Stargazers

Watchers

Forkers

tiflow's Issues

Feature Request

Description

Score

Mentor(s)

Recommended Skills

Feature Request

Version Info

Description

Bug Report

1. Version Info

2. Reproduce steps

3. Expected and Got

Bug Report

Feature Request

Description

Category

Task list

Value

Value Description

Value Score

Workload Estimation

Feature Request

Feature Request

Feature Request

Description

Value Score

Workload Estimation

1. Version Info

2. Reproduce steps

3. Expected and Got

Description

Background

Proposal

Success Criteria

Category

Value

Score

SIG slack channel

Mentor

TODO list

Documentations

Project

Version Info

Description

Feature Request

1. Version Info

2. Reproduce steps

3. Log

Feature Request

Feature Request

Feature Request

Bug Report

Bug Report

Description

Score

Mentor(s)

Recommended Skills

Bug Report

Feature Request

Bug Report

Open TiCDC protocol project

Overview

Background

Problem Statement

Success Criteria

TODO list

Difficulty

Score

Mentor(s)

Recommended Skills

References

Recommend Projects

Recommend Topics

Recommend Org