tikv / pd Goto Github PK

View Code? Open in Web Editor NEW

1.0K 89.0 718.0 85.95 MB

Placement driver for TiKV

License: Apache License 2.0

Go 99.46% Makefile 0.26% Shell 0.26% Dockerfile 0.01%

hacktoberfest

pd's Introduction

PD

PD is the abbreviation for Placement Driver. It manages and schedules TiKV clusters.

PD supports fault-tolerance by embedding etcd.

If you're interested in contributing to PD, see CONTRIBUTING.md. For more contributing information about where to start, click on the contributor icon above.

Build

Make sure Go (version 1.21) is installed.
Use make to install PD. pd-server will be installed in the bin directory.

Usage

PD can be configured using command-line flags. For more information, see PD Configuration Flags.

Single node with default ports

You can run pd-server directly on your local machine. If you want to connect to PD from outside, you can let PD listen on the host IP.

# Set HOST_IP to the address you want to listen on
export HOST_IP="192.168.199.105"

pd-server --name="pd" \
          --data-dir="pd" \
          --client-urls="http://${HOST_IP}:2379" \
          --peer-urls="http://${HOST_IP}:2380" \
          --log-file=pd.log

Using curl to view PD members:

curl http://${HOST_IP}:2379/pd/api/v1/members

{
  "members": [
    {
      "name": "pd",
      "member_id": 15980934438217023866,
      "peer_urls": [
        "http://192.168.199.105:2380"
      ],
      "client_urls": [
        "http://192.168.199.105:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v6.1.3",
      "git_hash": "1a4e975892512a97fb0e5b45c9be69aa76148793"
    }
  ]
}

You can also use httpie to call the API:

http http://${HOST_IP}:2379/pd/api/v1/members

Access-Control-Allow-Headers: accept, content-type, authorization
Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
Access-Control-Allow-Origin: *
Content-Length: 1003
Content-Type: application/json; charset=UTF-8
Date: Mon, 12 Dec 2022 13:46:33 GMT

{
  "members": [
    {
      "name": "pd",
      "member_id": 15980934438217023866,
      "peer_urls": [
        "http://192.168.199.105:2380"
      ],
      "client_urls": [
        "http://192.168.199.105:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v6.1.3",
      "git_hash": "1a4e975892512a97fb0e5b45c9be69aa76148793"
    }
  ]
}

Docker

You can choose one of the following methods to get a PD image:

Build locally:
```
docker build -t pingcap/pd .
```
Pull from Docker Hub:
```
docker pull pingcap/pd
```

Then you can run a single node using the following command:

# Set HOST_IP to the address you want to listen on
export HOST_IP="192.168.199.105"

docker run -d -p 2379:2379 -p 2380:2380 --name pd pingcap/pd \
          --name="pd" \
          --data-dir="pd" \
          --client-urls="http://0.0.0.0:2379" \
          --advertise-client-urls="http://${HOST_IP}:2379" \
          --peer-urls="http://0.0.0.0:2380" \
          --advertise-peer-urls="http://${HOST_IP}:2380" \
          --log-file=pd.log

Cluster

As a component of the TiKV project, PD needs to run with TiKV to work. The cluster can also include TiDB to provide SQL services. For detailed instructions to deploy a cluster, refer to Deploy a TiDB Cluster Using TiUP or TiDB on Kubernetes Documentation.

pd's People

Contributors

Stargazers

Watchers

Forkers

tiancaiamao akihirosuda tangfeixiong overvenus nkgfirecream threebearsdan josejamilena 932280190 andelf smalllove1459 xiechuanj hamo scottzzqnwpu goofusuper neoguojing unship mickyzhou cmbchina-ita jzoe alston111111 liwenzhong leadnt zhaodiankui leonliu315 iamzhout hongyuanw innerr cluo appsou xingyuyu zhengwanbo givemesunshine aliyx zhipwang fancy2110 shizy818 john-deng usophia dantin wangzhen0101 dreamquster ming535 sacus jzhang zhaoshiling1017 learninto qinguoan bhzhu203 philo-he blacktear23 ryan-git leex12 liangshang sdh517 kayaklee connor1996 forging2012 oh-my-tidb wentaoxu nolouch siddontang skyformat99 namanlin foxmailed ciscoxll wenlaizhou yndb datahoecn divver lightkool xx-qgh ergesun linuxgit zhexuany wyatt88 holys newlysoft weekface im-kulikov dxf1122 woodwhite spongedu mloves0824 tj-jianbing clenken duolahaitun yinixu9506 jomenxiao nanne007 dpmail shuangquansq henrybit myonkeminta tangyiyong nettedfish hejy12 genfossil diycp stevenberge wcq881215

pd's Issues

use etcd client context instead of origin context.Background()

Now we use context.Background() to generate the context, like https://github.com/pingcap/pd/blob/master/server/leader.go#L139

But it is better to use etcd client ctx directly, so if the etcd client is closed, we can cancel the transaction quickly.

/cc @qiuyesuifeng

support HTTP interface, deprecate old one.

v1 version url

GET /v1/cluster/tso
GET /v1/clusterleader
POST /v1/cluster/id
POST /v1/cluster/bootstrap
GET /v1/cluster/bootstrapped
GET /v1/cluster/config
PUT /v1/cluster/config
GET /v1/cluster/store/id
PUT /v1/cluster/store/id  
GET /v1/cluster/region/key
POST /v1/cluster/ask_change_peer
POST /v1/cluster/ask_split

@qiuyesuifeng

[proposal] region status dashboard

pd exposes a http dashboard to show the status of each region.

@iamxy @qiuyesuifeng

use keep alive connection for sending command to raft server

move template into same place with pd-server

as we see dockerfile:

    cp -f ./bin/pd-server /go/bin/pd-server && \
    cp -rf ./templates /go/templates

It is very strange that we use pd-server in bin but use templates in parent folder. we should keep them in bin.

How to get key address?

I think new flow of get key address like this:

(optional) ticlient getAllNode including node_id and address.
ticlient request pd getRegion(key) and the latter return a region
ticlient pick region.peers[0], find corresponding node address(from cache at step 1 or getNode) and send to tikv
if peers[0] is leader exactly, then everything is ok. Or go next step
if peers[0] is not leader, tikv will return Not Leader error and leader info
ticlient receive Not Leader error, then send the right peer leader

I want to add a new method `GetKeyAddress` in pd-client. Because ticlient need pass regionID, peer to tikv.
func (c *Client) GetKeyAddress(k []byte) (region metapb.Region, peer Peer, error) { region := c.GetRegion(k) peerLeader := raftCluster.getRegionLeader(region.GetRegionID(), region.GetPeers()[0]) return region, peerLeader, nil }

Return cancelled error when job handling sometimes.

Now if returns error, we will re-try the job, but sometimes we should cancel this error.

Define a special error like "errCancelled" to cancel this job.

add more metrics for etcd

We should add more metrics. Like

etcd transaction duration.
etcd transaction counter, error counter, not success counter.
generate id counter.
generate TSO counter.

Anything to add?
/cc @qiuyesuifeng

move docs to https://github.com/pingcap/docs

save next tso time in etcd

Let's assume the TsoSaveInterval is 3s which means we will save last TSO in etcd every 3s.
Now we save last TSO (like t1) in etcd, and when the pd down, another pd starts, it will wait until to t1 + 2 * 3s to ensure we can generate TSO safely, but it may still be unsafe.

We can save next TSO in etcd instead, e.g, assuming now is t1, we will save t2( t1 + 3s) in etcd, so if another pd leader starts, it can wait only exceeding t2 and then generate TSO.

/cc @qiuyesuifeng @disksing

remove store explicitly.

/cc @qiuyesuifeng

We support gc dead store, but do we need to support removing a store explicitly?

support get region by id

now we only support get region by key, I think sometimes we may need get region by region id too.
If support, we should add a proto in pdpb.proto, maybe in GetMeta is OK, like

message GetMetaRequest {
    optional MetaType meta_type    = 1;
    optional uint64 node_id        = 2;
    optional uint64 store_id       = 3;
    optional bytes region_key      = 4;
    optional uint64 cluster_id     = 5;
    optional uint64 region_id    = 6;
}

If the meta_type is Region and region_key is set, we will use get region, or else region_id is set, we will use get region by id.

@qiuyesuifeng

redirect request to real leader

Like etcd, if we send request to a follower, the follower will redirect the request to leader instead.
But the redirect may have two ways:

follower returns 301 and let client to redirect to leader
follower proxy the command to leader

We should check which etcd uses, and determine which we should use.

/cc @ngaut @qiuyesuifeng @huachaohuang

remove root path flag.

We used root path before to distinguish different pd clusters in etcd, but now pd embeds etcd, and each pd cluster has one etcd, so we don't need use root path again.

We should change pd, tikv and tidb all.

/cc @disksing

API: list pd members

like etcd get members API, but we should do it for ourselves.

GET http://127.0.0.1:9090/api/v1/members

members = [
    {
       "name"
       "client-urls"
       "peer-urls"
    }
]

dynamic set rebalance speed with API

Assume our Database contain many GB or TB datas, when we start our tikv nodes one by one, the earlier started nodes will campaign election earlier than others and win the election then become region leader, result in most regions' leader located at few nodes. What we want is the leader transfer from these few nodes to others as soon as possible, so we need high rebalance speed. But when the DB is providing OLTP services, we should limit the rebalance speed.

Add peer GC stratege if peer has been isolate for a while.

Auto-balance region leader number.

We can support base leader auto-balance at first, we don't consider balance region here now, and we don't consider the store machine load too, only focus leader number auto-balance.

Assume we have 3 stores (1, 2, 3) and the max peer count is 3.

Note that the balance is approximate, e,g, store 1 may have 10 leaders, 2 has 11 leaders and 3 has 9 leaders.

How to:

Support leader transfer command.
The leader peer in a region will report HeartBeat to pd, so pd can know how many leaders in a store.
Pd will do auto-balance leaders regularly (the interval may be set in config file or passed by flag), or be invoked with Restful API manually later.
Support rule, e,g, we only want leaders are all in store 1, 2, so we can't transfer leader to 3.

Problems:

How to determine to transfer a leader from one store to another? E,g, store 1 has 100 leaders, but 2 has only 1, we may know that we should transfer some leaders from 1 to 2.
If we know that we should do from 1 to 2, how to select the region leaders? Random is ok?

/cc @ngaut @qiuyesuifeng @disksing

Check the variables whether should be exported in project.

export region status through api

Now we can get only region meta information through api, we can't get region status.

embed etcd

etcd-io/etcd#5925

support removing pd

Now we can only use etcd's API to remove pd, see https://github.com/coreos/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#remove-a-member

But we should hide etcd concept for user. we can support removing pd with name.

DELETE http://127.0.0.1:9090/api/v1/members/name

/cc @qiuyesuifeng

update region information on etcd asynchronously.

In our implementation, we update etcd synchronous if the region information is changed.
Now we only use pd for quick recover, so it is better to make update asynchronously.

/cc @siddontang @disksing

use metric

We can choose the go statsd client https://github.com/etsy/statsd/wiki

@qiuyesuifeng

refactor pd job mechanism

Problems

Now we may have lots of jobs in pd, because we don't handle duplicated command for a region.
And even worse, if one leader 1 in a region like 1 asks to add peer, pd may first create the ConfChange Add peer 2 for it, but doesn't execute it quickly, a time later, the peer 1 asks to add peer again, pd may create the ConfChange Add peer 3 for it, so we have two different jobs for the same region at same time.
If a job for a region is not executed successfully because of timeout or other errors, pd will retry it in loop, this will block any other region following jobs too.

~~## Principle~~

One region can only do one thing at same time, if the leader in the region asks for adding peer, pd will skip any other commands for this region after the adding peer is finished (success or failure both ok).

Expanding 1, if pd decides how to do for the command, e,g, for region a, adding peer 3 for it, it will execute this job all the time, no matter how much times it retries or the leader asks.

The execution for one region won't block other region jobs.

~~## How to~~ (Deprecated)

Use region id for job, like /job/region_id -> job, so we can discard duplicated region jobs and guarantee one job for the region at same time.~~

Pd can scan multi region jobs and execute concurrently.

Expanding 2, above scanning may have a problem, we will get less region id jobs every time, e.g, if region 1 and 2 both have job, but region 2 is earlier, but we still scan region 1 first. So we should have also using the job list like before, so we have two KV pair, one is region_id -> job_id, the other is job_id -> job.
So the job meta is: /job_region/region_id -> job_id and /job/job_id -> job.

Guarantee all job executions are retry-able and can't corrupt the raft group, e.g, adding peer 3 for region a, we must be sure if adding peer 3 ok, another retry adding peer 3 is failure.

Should have a cancel mechanism if the job can't be executed successfully for a long time, pd may cancel the job or notify the user to handle it manually.

/cc @ngaut @qiuyesuifeng @disksing @tiancaiamao

rebanlance policy bug

I start 3 tikv node and config pd max-peer-count=1, after I have inserted some data, the only region transfer from one node to another and never stop. ooooooops

combine flag default value and config default value

now we define many default values for configs, but at same time, we use another ones for flags. But they are same, so we can combine these in one place.

Maybe we can add a Flag field config like etcd.

/cc @qiuyesuifeng @huachaohuang

support retry transaction if failed sometimes.

not all transactions need to retry if failed, we should check all txns carefully and decide which txn should to do.

fix ci build error

Now ci builds failed. we should fix it first.

gc dead store

If the store was down, pd should be told about it or pd should check whether the store is alive or not for a long time.

If pd ensures that the store is down, it will do balance for the region which contains the store, maybe we can do it in following cases:

When the leader of region reports Heartbeat, pd finds the region contains the dead store and then does balance.
Pd chooses a region which contains the dead store regularly, then does balance.

/cc @qiuyesuifeng @disksing @zhangjinpeng1987

Optimize updating region meta and processing config change in one region heartbeat.

Now we process config change and update region meta in different region heartbeats.
In fact, we can handle these in one, like:
1 check epoch configure version for updating region meta.
2 check whether need to do configure change.

/cc @siddontang @disksing

reduce flags.

Now we have many flags in PD, it is a horrible thing for users. We should simplify them.

how to remove peer

When a peer/store is down, pd should do auto-balance to guarantee the raft group has correct replicas.

TiKV store reports information to Pd regularly, if pd doesn't receive the message from a store for a long time and pd also can't send command to it, pd can think the store is down, and can notify users for it. (Notice that pd doesn't do auto-balance directly here.)
The leader in a raft group detects that it can't communicate with a follower for a long time, so it tells pd about it, pd guarantees that the peer or the whole store contains this peer is down, it can send ConfChange remove command to the leader, then the leader remove the dead peer. Later, the leader peer can ask ConfChange add command to pd again if it finds the replica number is less.
The dead peer may be isolated, so it can't receive the ConfChange remove log, and can't remove itself, if the peer finds it can't receive any message from other peers for a long time, it asks the pd whether it is still in the region, if not, remove itself directly.

/cc @ngaut @qiuyesuifeng @tiancaiamao

can etcd Election speed up leader failover

https://github.com/coreos/etcd/blob/master/clientv3/concurrency/election.go

Refactor configurations

I think we should put command line options and configurations related code in one place.

API: get PD leader

GET http://127.0.0.1:9090/api/v1/leader

{
   "addr": 
   "pid": 
}

Be careful that this is to get pd leader, not etcd leader. And TiKV can use this new API to get pd leader.

bench: 1 million region heartbeat

reduce listening port

After we embed etcd, we have 4 public port, 1234, 9090, 2379 and 2380, too many.
We can use only 2379 and 2380 instead, but need do some thing.

Fix etcd-io/etcd#6014
Register a gRPC handler for TiDB use.
Support gRPC Restful gateway, like etcd, for TiKV use.
For TiKV, if the gRPC gateway has bad performance (we need to test with about 100000+ regions heartbeat), we may consider using websocket.

/cc @ngaut @qiuyesuifeng

check data dir for joining pd.

we must guarantee that the joining pd is clear and new.
So we can't re-add a removed PD to cluster.

use prometheus for metric

Like etcd, we can use memory prometheus.

curl http://127.0.0.1:1234/metrics

add join support

Now adding a member dynamically is not easy, see https://coreos.com/etcd/docs/latest/runtime-configuration.html
And we have to another supply initital_cluster_state flag too, I think we can use another way to do it.

# start first pd
pd-server -name pd1 -host pd1

# start second pd and join the cluster
pd-server -name pd2 -host pd2 -join "pd1=http://pd1:2379"

# start third pd and joint the cluster
pd-server -name pd3 -host pd3 -join "pd1=http://pd1:2379"
# or
pd-server -name pd3 -host pd3 -join "pd2=http://pd2:2379"
# or
pd-server -name pd3 -host pd3 -join "pd1=http://pd1:2379,pd2=http://pd2:2379"

We start first pd and then use join mechanism to let following pd to join the cluster. The join arg can contain any valid pds in current cluster

/cc @disksing @qiuyesuifeng @iamxy

Add system time monitor.

See pingcap/tidb#1484

use environment for initial cluster only for test

After #215 , we may meet some problems in our current cluster test, so we need keep origin initial_cluster, but not in command flag, only in environment like PD_TEST_INITIAL_CLUSTER

return not leader error with real leader address

client may send command to not leader pd, now we just close it violently, but we should return an error first and then close it gracefully.

use errorpb instead of old error defined in pdpb

[error] parse cmd flags err flag provided but not defined: -initial-cluster

Compiled at latest master branch.

➜  bin git:(master) ✗ ./pd-server --cluster-id=1 \
          --host="127.0.0.1" \
          --name=pd1 \
          --port=11234 \
          --http-port=19090 \
          --client-port=12379 \
          --peer-port=12380 \
          --initial-cluster="pd1=http://127.0.0.1:12380"
flag provided but not defined: -initial-cluster
Usage of pd:
  -L string
        log level: debug, info, warn, error, fatal (default "info")
  -advertise-client-port uint
        advertise port for client traffic (default '{client-port}')
  -advertise-peer-port uint
        advertise port for peer traffic (default '{peer-port}')
  -advertise-port uint
        advertise server port (deprecate later) (default '{port}')
  -client-port uint
        port for client traffic (default 2379)
  -cluster-id uint
        initial cluster ID for the pd cluster
  -config string
        Config file
  -data-dir string
        path to the data directory (default 'default.{name}') (default "default.pd")
  -host string
        host for outer traffic (default "127.0.0.1")
  -http-port uint
        http port (deprecate later) (default 9090)
  -initial-cluser string
        initial cluster configuration for bootstrapping (default '{name}=http://{host}:{advertise-peer-port}')
  -initial-cluster-state string
        initial cluster state ('new' or 'existing') (default "new")
  -name string
        human-readable name for this pd member (default "pd")
  -peer-port uint
        port for peer traffic (default 2380)
  -port uint
        server port (deprecate later) (default 1234)
2016/07/28 15:09:40 main.go:37: [error] parse cmd flags err flag provided but not defined: -initial-cluster

use etcd leader directly

After we embed etcd, we can use etcd leader as PD leader directly. no need to use a leader and watch it any more.

We can check etcd leader every 1s, or election timeout time, or lesser.
Notice that we may change the transaction later, now we use a leader equal operation in transaction, but later it can't work.
We can use key modified, we can hold the last modified for each key, and use this in transaction check operation.

/cc @qiuyesuifeng @huachaohuang

Remove multiple cluster support from pd.

Each cluster can have it's own pd. This change would make pd simpler and clean.

run_cluster.sh doesn't form etcd cluster

run_cluster.sh doesn't form etcd cluster just because they can't connect to each other:

2016-04-06 12:55:08.363265 I | raft: 2a388ca50baf376d is starting a new election at term 7
2016-04-06 12:55:08.363301 I | raft: 2a388ca50baf376d became candidate at term 8
2016-04-06 12:55:08.363307 I | raft: 2a388ca50baf376d received vote from 2a388ca50baf376d at term 8
2016-04-06 12:55:08.363316 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to 4246095a1fe1162a at term 8
2016-04-06 12:55:08.363323 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to b305748795677bd4 at term 8
2016/04/06 12:55:08 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:32379: getsockopt: connection refused"; Reconnecting to "127.0.0.1:32379"
2016-04-06 12:55:09.763264 I | raft: 2a388ca50baf376d is starting a new election at term 8
2016-04-06 12:55:09.763308 I | raft: 2a388ca50baf376d became candidate at term 9
2016-04-06 12:55:09.763318 I | raft: 2a388ca50baf376d received vote from 2a388ca50baf376d at term 9
2016-04-06 12:55:09.763326 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to 4246095a1fe1162a at term 9
2016-04-06 12:55:09.763352 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to b305748795677bd4 at term 9
2016/04/06 12:55:10 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:32379: getsockopt: connection refused"; Reconnecting to "127.0.0.1:32379"
2016/04/06 12:55:10 main.go:38: [error] create pd server err grpc: timed out trying to connect

pd: 2ec824a
Env: Docker 1.11-snapshot, Ubuntu 15.10

It seems that docker run -p is not working well with etcd.
Is is possible to use docker compose so as to eliminate -p? (excepts PD_ADVERTISE_ADDR)

use unix socket instead of free port

Now we meet binding address used error frequently in CI, this is caused by free port mechanism.
I see that etcd can support unix socket, so we may use this instead of port.

tikv / pd Goto Github PK

pd's Introduction

PD

Build

Usage

Single node with default ports

Docker

Cluster

pd's People

Contributors

Stargazers

Watchers

Forkers

pd's Issues

Problems

Recommend Projects

Recommend Topics

Recommend Org