Giter Club home page Giter Club logo

pd's Introduction

PD

Check Status Build & Test Status TSO Consistency Test Status GitHub release Go Report Card codecov

PD is the abbreviation for Placement Driver. It manages and schedules TiKV clusters.

PD supports fault-tolerance by embedding etcd.

contribution-map

If you're interested in contributing to PD, see CONTRIBUTING.md. For more contributing information about where to start, click on the contributor icon above.

Build

  1. Make sure Go (version 1.21) is installed.
  2. Use make to install PD. pd-server will be installed in the bin directory.

Usage

PD can be configured using command-line flags. For more information, see PD Configuration Flags.

Single node with default ports

You can run pd-server directly on your local machine. If you want to connect to PD from outside, you can let PD listen on the host IP.

# Set HOST_IP to the address you want to listen on
export HOST_IP="192.168.199.105"

pd-server --name="pd" \
          --data-dir="pd" \
          --client-urls="http://${HOST_IP}:2379" \
          --peer-urls="http://${HOST_IP}:2380" \
          --log-file=pd.log

Using curl to view PD members:

curl http://${HOST_IP}:2379/pd/api/v1/members

{
  "members": [
    {
      "name": "pd",
      "member_id": 15980934438217023866,
      "peer_urls": [
        "http://192.168.199.105:2380"
      ],
      "client_urls": [
        "http://192.168.199.105:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v6.1.3",
      "git_hash": "1a4e975892512a97fb0e5b45c9be69aa76148793"
    }
  ]
}

You can also use httpie to call the API:

http http://${HOST_IP}:2379/pd/api/v1/members

Access-Control-Allow-Headers: accept, content-type, authorization
Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
Access-Control-Allow-Origin: *
Content-Length: 1003
Content-Type: application/json; charset=UTF-8
Date: Mon, 12 Dec 2022 13:46:33 GMT

{
  "members": [
    {
      "name": "pd",
      "member_id": 15980934438217023866,
      "peer_urls": [
        "http://192.168.199.105:2380"
      ],
      "client_urls": [
        "http://192.168.199.105:2379"
      ],
      "deploy_path": "/",
      "binary_version": "v6.1.3",
      "git_hash": "1a4e975892512a97fb0e5b45c9be69aa76148793"
    }
  ]
}

Docker

You can choose one of the following methods to get a PD image:

  • Build locally:

    docker build -t pingcap/pd .
  • Pull from Docker Hub:

    docker pull pingcap/pd

Then you can run a single node using the following command:

# Set HOST_IP to the address you want to listen on
export HOST_IP="192.168.199.105"

docker run -d -p 2379:2379 -p 2380:2380 --name pd pingcap/pd \
          --name="pd" \
          --data-dir="pd" \
          --client-urls="http://0.0.0.0:2379" \
          --advertise-client-urls="http://${HOST_IP}:2379" \
          --peer-urls="http://0.0.0.0:2380" \
          --advertise-peer-urls="http://${HOST_IP}:2380" \
          --log-file=pd.log

Cluster

As a component of the TiKV project, PD needs to run with TiKV to work. The cluster can also include TiDB to provide SQL services. For detailed instructions to deploy a cluster, refer to Deploy a TiDB Cluster Using TiUP or TiDB on Kubernetes Documentation.

pd's People

Contributors

amoebaprotozoa avatar andremouche avatar baurine avatar binshi-bing avatar breezewish avatar bufferflies avatar c4pt0r avatar cabinfeverb avatar connor1996 avatar disksing avatar ethercflow avatar gregwebs avatar huachaohuang avatar hundundm avatar husharp avatar jmpotato avatar lhy1024 avatar luffbee avatar ngaut avatar nolouch avatar okjiang avatar overvenus avatar qiuyesuifeng avatar rleungx avatar shafreeck avatar siddontang avatar sourcelliu avatar v01dstar avatar xhebox avatar yisaer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pd's Issues

support HTTP interface, deprecate old one.

v1 version url

GET /v1/cluster/tso
GET /v1/clusterleader
POST /v1/cluster/id
POST /v1/cluster/bootstrap
GET /v1/cluster/bootstrapped
GET /v1/cluster/config
PUT /v1/cluster/config
GET /v1/cluster/store/id
PUT /v1/cluster/store/id  
GET /v1/cluster/region/key
POST /v1/cluster/ask_change_peer
POST /v1/cluster/ask_split

@qiuyesuifeng

move template into same place with pd-server

as we see dockerfile:

    cp -f ./bin/pd-server /go/bin/pd-server && \
    cp -rf ./templates /go/templates

It is very strange that we use pd-server in bin but use templates in parent folder. we should keep them in bin.

How to get key address?

I think new flow of get key address like this:

  1. (optional) ticlient getAllNode including node_id and address.
  2. ticlient request pd getRegion(key) and the latter return a region
  3. ticlient pick region.peers[0], find corresponding node address(from cache at step 1 or getNode) and send to tikv
  4. if peers[0] is leader exactly, then everything is ok. Or go next step
  5. if peers[0] is not leader, tikv will return Not Leader error and leader info
  6. ticlient receive Not Leader error, then send the right peer leader
I want to add a new method `GetKeyAddress` in pd-client. Because ticlient need pass regionID, peer to tikv.
func (c *Client) GetKeyAddress(k []byte) (region metapb.Region, peer Peer, error) {
    region := c.GetRegion(k)
    peerLeader := raftCluster.getRegionLeader(region.GetRegionID(), region.GetPeers()[0])
    return region, peerLeader, nil
}

add more metrics for etcd

We should add more metrics. Like

  • etcd transaction duration.
  • etcd transaction counter, error counter, not success counter.
  • generate id counter.
  • generate TSO counter.

Anything to add?
/cc @qiuyesuifeng

save next tso time in etcd

Let's assume the TsoSaveInterval is 3s which means we will save last TSO in etcd every 3s.
Now we save last TSO (like t1) in etcd, and when the pd down, another pd starts, it will wait until to t1 + 2 * 3s to ensure we can generate TSO safely, but it may still be unsafe.

We can save next TSO in etcd instead, e.g, assuming now is t1, we will save t2( t1 + 3s) in etcd, so if another pd leader starts, it can wait only exceeding t2 and then generate TSO.

/cc @qiuyesuifeng @disksing

support get region by id

now we only support get region by key, I think sometimes we may need get region by region id too.
If support, we should add a proto in pdpb.proto, maybe in GetMeta is OK, like

message GetMetaRequest {
    optional MetaType meta_type    = 1;
    optional uint64 node_id        = 2;
    optional uint64 store_id       = 3;
    optional bytes region_key      = 4;
    optional uint64 cluster_id     = 5;
    optional uint64 region_id    = 6;
}

If the meta_type is Region and region_key is set, we will use get region, or else region_id is set, we will use get region by id.

@qiuyesuifeng

redirect request to real leader

Like etcd, if we send request to a follower, the follower will redirect the request to leader instead.
But the redirect may have two ways:

  • follower returns 301 and let client to redirect to leader
  • follower proxy the command to leader

We should check which etcd uses, and determine which we should use.

/cc @ngaut @qiuyesuifeng @huachaohuang

remove root path flag.

We used root path before to distinguish different pd clusters in etcd, but now pd embeds etcd, and each pd cluster has one etcd, so we don't need use root path again.

We should change pd, tikv and tidb all.

/cc @disksing

API: list pd members

like etcd get members API, but we should do it for ourselves.

GET http://127.0.0.1:9090/api/v1/members

members = [
    {
       "name"
       "client-urls"
       "peer-urls"
    }
]

dynamic set rebalance speed with API

Assume our Database contain many GB or TB datas, when we start our tikv nodes one by one, the earlier started nodes will campaign election earlier than others and win the election then become region leader, result in most regions' leader located at few nodes. What we want is the leader transfer from these few nodes to others as soon as possible, so we need high rebalance speed. But when the DB is providing OLTP services, we should limit the rebalance speed.

Auto-balance region leader number.

We can support base leader auto-balance at first, we don't consider balance region here now, and we don't consider the store machine load too, only focus leader number auto-balance.

Assume we have 3 stores (1, 2, 3) and the max peer count is 3.

Note that the balance is approximate, e,g, store 1 may have 10 leaders, 2 has 11 leaders and 3 has 9 leaders.

How to:

  • Support leader transfer command.
  • The leader peer in a region will report HeartBeat to pd, so pd can know how many leaders in a store.
  • Pd will do auto-balance leaders regularly (the interval may be set in config file or passed by flag), or be invoked with Restful API manually later.
  • Support rule, e,g, we only want leaders are all in store 1, 2, so we can't transfer leader to 3.

Problems:

  • How to determine to transfer a leader from one store to another? E,g, store 1 has 100 leaders, but 2 has only 1, we may know that we should transfer some leaders from 1 to 2.
  • If we know that we should do from 1 to 2, how to select the region leaders? Random is ok?

/cc @ngaut @qiuyesuifeng @disksing

refactor pd job mechanism

Problems

  1. Now we may have lots of jobs in pd, because we don't handle duplicated command for a region.
  2. And even worse, if one leader 1 in a region like 1 asks to add peer, pd may first create the ConfChange Add peer 2 for it, but doesn't execute it quickly, a time later, the peer 1 asks to add peer again, pd may create the ConfChange Add peer 3 for it, so we have two different jobs for the same region at same time.
  3. If a job for a region is not executed successfully because of timeout or other errors, pd will retry it in loop, this will block any other region following jobs too.

## Principle

  1. One region can only do one thing at same time, if the leader in the region asks for adding peer, pd will skip any other commands for this region after the adding peer is finished (success or failure both ok).
  2. Expanding 1, if pd decides how to do for the command, e,g, for region a, adding peer 3 for it, it will execute this job all the time, no matter how much times it retries or the leader asks.
  3. The execution for one region won't block other region jobs.

## How to (Deprecated)

  1. Use region id for job, like /job/region_id -> job, so we can discard duplicated region jobs and guarantee one job for the region at same time.~~
  2. Pd can scan multi region jobs and execute concurrently.
  3. Expanding 2, above scanning may have a problem, we will get less region id jobs every time, e.g, if region 1 and 2 both have job, but region 2 is earlier, but we still scan region 1 first. So we should have also using the job list like before, so we have two KV pair, one is region_id -> job_id, the other is job_id -> job.
    So the job meta is: /job_region/region_id -> job_id and /job/job_id -> job.
  4. Guarantee all job executions are retry-able and can't corrupt the raft group, e.g, adding peer 3 for region a, we must be sure if adding peer 3 ok, another retry adding peer 3 is failure.
  5. Should have a cancel mechanism if the job can't be executed successfully for a long time, pd may cancel the job or notify the user to handle it manually.

/cc @ngaut @qiuyesuifeng @disksing @tiancaiamao

rebanlance policy bug

I start 3 tikv node and config pd max-peer-count=1, after I have inserted some data, the only region transfer from one node to another and never stop. ooooooops

gc dead store

If the store was down, pd should be told about it or pd should check whether the store is alive or not for a long time.

If pd ensures that the store is down, it will do balance for the region which contains the store, maybe we can do it in following cases:

  • When the leader of region reports Heartbeat, pd finds the region contains the dead store and then does balance.
  • Pd chooses a region which contains the dead store regularly, then does balance.

/cc @qiuyesuifeng @disksing @zhangjinpeng1987

reduce flags.

Now we have many flags in PD, it is a horrible thing for users. We should simplify them.

how to remove peer

When a peer/store is down, pd should do auto-balance to guarantee the raft group has correct replicas.

  1. TiKV store reports information to Pd regularly, if pd doesn't receive the message from a store for a long time and pd also can't send command to it, pd can think the store is down, and can notify users for it. (Notice that pd doesn't do auto-balance directly here.)
  2. The leader in a raft group detects that it can't communicate with a follower for a long time, so it tells pd about it, pd guarantees that the peer or the whole store contains this peer is down, it can send ConfChange remove command to the leader, then the leader remove the dead peer. Later, the leader peer can ask ConfChange add command to pd again if it finds the replica number is less.
  3. The dead peer may be isolated, so it can't receive the ConfChange remove log, and can't remove itself, if the peer finds it can't receive any message from other peers for a long time, it asks the pd whether it is still in the region, if not, remove itself directly.

/cc @ngaut @qiuyesuifeng @tiancaiamao

API: get PD leader

GET http://127.0.0.1:9090/api/v1/leader

{
   "addr": 
   "pid": 
}

Be careful that this is to get pd leader, not etcd leader. And TiKV can use this new API to get pd leader.

reduce listening port

After we embed etcd, we have 4 public port, 1234, 9090, 2379 and 2380, too many.
We can use only 2379 and 2380 instead, but need do some thing.

  • Fix etcd-io/etcd#6014
  • Register a gRPC handler for TiDB use.
  • Support gRPC Restful gateway, like etcd, for TiKV use.
  • For TiKV, if the gRPC gateway has bad performance (we need to test with about 100000+ regions heartbeat), we may consider using websocket.

/cc @ngaut @qiuyesuifeng

add join support

Now adding a member dynamically is not easy, see https://coreos.com/etcd/docs/latest/runtime-configuration.html
And we have to another supply initital_cluster_state flag too, I think we can use another way to do it.

# start first pd
pd-server -name pd1 -host pd1

# start second pd and join the cluster
pd-server -name pd2 -host pd2 -join "pd1=http://pd1:2379"

# start third pd and joint the cluster
pd-server -name pd3 -host pd3 -join "pd1=http://pd1:2379"
# or
pd-server -name pd3 -host pd3 -join "pd2=http://pd2:2379"
# or
pd-server -name pd3 -host pd3 -join "pd1=http://pd1:2379,pd2=http://pd2:2379"

We start first pd and then use join mechanism to let following pd to join the cluster. The join arg can contain any valid pds in current cluster

/cc @disksing @qiuyesuifeng @iamxy

[error] parse cmd flags err flag provided but not defined: -initial-cluster

Compiled at latest master branch.

➜  bin git:(master) ✗ ./pd-server --cluster-id=1 \
          --host="127.0.0.1" \
          --name=pd1 \
          --port=11234 \
          --http-port=19090 \
          --client-port=12379 \
          --peer-port=12380 \
          --initial-cluster="pd1=http://127.0.0.1:12380"
flag provided but not defined: -initial-cluster
Usage of pd:
  -L string
        log level: debug, info, warn, error, fatal (default "info")
  -advertise-client-port uint
        advertise port for client traffic (default '{client-port}')
  -advertise-peer-port uint
        advertise port for peer traffic (default '{peer-port}')
  -advertise-port uint
        advertise server port (deprecate later) (default '{port}')
  -client-port uint
        port for client traffic (default 2379)
  -cluster-id uint
        initial cluster ID for the pd cluster
  -config string
        Config file
  -data-dir string
        path to the data directory (default 'default.{name}') (default "default.pd")
  -host string
        host for outer traffic (default "127.0.0.1")
  -http-port uint
        http port (deprecate later) (default 9090)
  -initial-cluser string
        initial cluster configuration for bootstrapping (default '{name}=http://{host}:{advertise-peer-port}')
  -initial-cluster-state string
        initial cluster state ('new' or 'existing') (default "new")
  -name string
        human-readable name for this pd member (default "pd")
  -peer-port uint
        port for peer traffic (default 2380)
  -port uint
        server port (deprecate later) (default 1234)
2016/07/28 15:09:40 main.go:37: [error] parse cmd flags err flag provided but not defined: -initial-cluster

use etcd leader directly

After we embed etcd, we can use etcd leader as PD leader directly. no need to use a leader and watch it any more.

We can check etcd leader every 1s, or election timeout time, or lesser.
Notice that we may change the transaction later, now we use a leader equal operation in transaction, but later it can't work.
We can use key modified, we can hold the last modified for each key, and use this in transaction check operation.

/cc @qiuyesuifeng @huachaohuang

run_cluster.sh doesn't form etcd cluster

run_cluster.sh doesn't form etcd cluster just because they can't connect to each other:

2016-04-06 12:55:08.363265 I | raft: 2a388ca50baf376d is starting a new election at term 7
2016-04-06 12:55:08.363301 I | raft: 2a388ca50baf376d became candidate at term 8
2016-04-06 12:55:08.363307 I | raft: 2a388ca50baf376d received vote from 2a388ca50baf376d at term 8
2016-04-06 12:55:08.363316 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to 4246095a1fe1162a at term 8
2016-04-06 12:55:08.363323 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to b305748795677bd4 at term 8
2016/04/06 12:55:08 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:32379: getsockopt: connection refused"; Reconnecting to "127.0.0.1:32379"
2016-04-06 12:55:09.763264 I | raft: 2a388ca50baf376d is starting a new election at term 8
2016-04-06 12:55:09.763308 I | raft: 2a388ca50baf376d became candidate at term 9
2016-04-06 12:55:09.763318 I | raft: 2a388ca50baf376d received vote from 2a388ca50baf376d at term 9
2016-04-06 12:55:09.763326 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to 4246095a1fe1162a at term 9
2016-04-06 12:55:09.763352 I | raft: 2a388ca50baf376d [logterm: 1, index: 3] sent vote request to b305748795677bd4 at term 9
2016/04/06 12:55:10 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:32379: getsockopt: connection refused"; Reconnecting to "127.0.0.1:32379"
2016/04/06 12:55:10 main.go:38: [error] create pd server err grpc: timed out trying to connect

  • pd: 2ec824a
  • Env: Docker 1.11-snapshot, Ubuntu 15.10

It seems that docker run -p is not working well with etcd.
Is is possible to use docker compose so as to eliminate -p? (excepts PD_ADVERTISE_ADDR)

use unix socket instead of free port

Now we meet binding address used error frequently in CI, this is caused by free port mechanism.
I see that etcd can support unix socket, so we may use this instead of port.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.