sofastack / sofa-registry Goto Github PK

SOFARegistry is a production-level, low-latency, high-availability service registry powered by Ant Financial.

Home Page: https://www.sofastack.tech/sofa-registry/docs/Home

License: Apache License 2.0

Java 99.89% Shell 0.09% Makefile 0.01% Dockerfile 0.01%

sofa-registry's Introduction

SOFARegistry

SOFARegistry 是蚂蚁金服开源的一个生产级、高时效、高可用的服务注册中心。SOFARegistry 最早源自于淘宝的 ConfigServer，十年来，随着蚂蚁金服的业务发展，注册中心架构已经演进至第六代。目前 SOFARegistry 不仅全面服务于蚂蚁金服的自有业务，还随着蚂蚁金融科技服务众多合作伙伴，同时也兼容开源生态。SOFARegistry 采用 AP 架构，支持秒级时效性推送，同时采用分层架构支持无限水平扩展。

功能特性

支持服务发布与服务订阅
支持服务变更时的主动推送
丰富的 REST 接口
采用分层架构及数据分片，支持海量连接及海量数据
支持多副本备份，保证数据高可用
基于 SOFABolt 通信框架，服务上下线秒级通知
AP 架构，保证网络分区下的可用性

需要

编译需要 JDK 8 及以上、Maven 3.2.5 及以上。

运行需要 JDK 6 及以上，服务端运行需要 JDK 8及以上。

推荐使用JDK 8，JDK 16尚未被测试，可能会有兼容性问题

文档

贡献

如何参与 SOFARegistry 代码贡献

致谢

SOFARegistry 最早源于阿里内部的 ConfigServer，感谢毕玄创造了 ConfigServer，使 SOFARegistry 的发展有了良好的基础。同时，部分代码参考了 Netflix 的 Eureka，感谢 Netflix 开源了如此优秀框架。

开源许可

SOFARegistry 基于 Apache License 2.0 协议

sofa-registry's People

Contributors

Stargazers

Watchers

Forkers

jewin hadoop835 zhoudaqing caojie09 p79n6a liutw123 stateis0 dyb10101 synex-wh kqsmea8 yaoqi hqq2023623 softliumin fengjiachun geekcheng yatian wangqytot zhi-qi newlysoft huangxiaofeng10047 sky-tc atellwu jangocheng viyond chakra-coder hanzhaoxin dulong ygypc wangruihuano quanquan-dane heatwei scdxcc skymysky wangchengqun wooooshihui colinger 1008610010 weizai118 edtenz diffblue-benchmarks houjingchao ziyilin guanchao-yang zhaojigang purebaba easfire architecturetech fan2008shuai fangzhimeng colinbetter wangshuangquan-a youngercloud driphub cclearnjava xiaoyumo topgunbasten 00jy116 baidulinux qian-tan alishangtian captainjack0x7c8 anupis xuechaos sudarsan-sridharan s-xq ruider wanglf1207 timxim paulllb shuanys minsuoqingcheng zhaozhy andrew-wwj luozhiyun993 haipop tonycody shadowmanxx lightclouds917 straybirdzls applebite marcotansir seraphpan yangdaotan andypeng2015 wangyvqi zhatin lishuai2016 gooyoit uangguan charygao liuhangyang manu2013 seasidesky khotyn altafyanto cgb-tech-favorite whxiang draperchen rosegaller linjunqin

sofa-registry's Issues

有管理控制台吗

Datum version number is generated when writing to memory

Describe the bug

The version number of datum indicates that a datainfoid data is changed in a dataCenter record, because the previous version number directly uses the timestamp to be repeated in the case of concurrency, and a unique id is used instead of the timestamp in a timestamp related manner.
The datum version was originally generated when the pub was entered or the clientoff generated the unpub, because there is a 500ms delay, but for the snapshot to directly modify the memory, the version number may be inconsistent when put into the memory, and the large version may overwrite the small version, resulting in data. Serious problems such as loss of synchronization notifications

Modify

The datum version number is generated at the time of writing to the memory, but the version number is not updated for the synchronization entry and entering the memory version number, and only the version number is reconstructed for the access data pub and unpub. And for the case where the snapshot is directly written into the memory, the version prints the input before and after the log to ensure the sequential increment sequence.

Improving consistency between SessionServer and DataServer

The background

The reasons for the inconsistency between SessionServerServer and DataServer：

SessionServer sending clientOff is not absolutely reliable
SessionServer disconnection DataServer over 30 seconds triggers SessionServer Off, causing DataServer to be erased incorrectly
DataServer outage may cause datum loss in memory queues
DataServer outage exceeds N (N is the number of copies), and some datum will be permanently missible.
Disadvantage of consistency hash: When DataServer list changes frequently, SessionServers may send datum to incorrect DataServer at certain times, resulting in dirty datum and missing datum

How to ensure consistency?

Consistency between client and SessionServer:
- Pub / unpub sent by client: Write directly to SessionServer memory, which is 100% successful. Success means that client and SessionServer are identical.
- Client disconnection: SessionServer is connected aware and executes clientOff to delete pub datum in memory. Because of the bolt heartbeat, the reliability of the perception of connection survival can be guaranteed.
Eventual consistency between SessionServer and DataServer
- SessionServer asynchronous threads are responsible for sending specific pub/unpub/clientOff datum to DataServer, retry if failed, and drop and log if retry fails a certain number of times.
  - Special scenario: If the client disconnects/SessionServer downtime, the client will reconnect to other SessionServers and send all datum again, and the old datum will be deleted because of the client off/SessionServer off/expired.
- SessionServer regularly sends renew (including heartbeat and checksum) to ensure that the wrong datum can be corrected; Use expired mechanism to regularly clean up expired datum.

Design for SessionServer

Design for DataServer

DataServer reconnects meta logic bugs causing all dataServers to fail to connect to meta

Describe the bug

All sessions in the stable environment cannot be registered with the new pub, and the data node is calculated incorrectly. The link between the dataServer and the meta is invalid.log as:
Query the data node, all data nodes are continuously reporting errors, failing to establish a link with the meta-meta node(ip 214), and not connected with other remaining metas
The error code is as follows
SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

Is there a long polling push-pull combination mechanism between client and session?

如题，如果没有做推拉结合的话，至少心跳要有吧
如果client断电了，session节点是发现不了的

Dataserver reaches working state but data is not synchronized

Describe the bug

Expanding new data nodes to achieve working state but the data is not fully synchronized, resulting in new node data holdings not meeting expectations

Analytical solution

The data synchronization of the new expansion node depends on the notification of other nodes. After receiving the notification, it starts to pull the corresponding version data, and knows that the data is synchronized before the status update of the data node is obtained. The code is as follows:

The actual situation has received the data pull notification, and there is no older version of the situation, but the data is really missing, only the pull data request is not initiated, the specific code modification point is as follows

Sometimes receive "Not leader" response from leader in OnStartingFollowing()

Describe the bug

Problem:
- OnStartingFollowing(): When you add Node to leader, you receive a response from leader that returns "Not leader" (our own implementation).
Reason
- In the internal process of jraft, node becomes leader - > sends leader heartbeat to all nodes - > other nodes become follower (asynchronous callback onStarting Following) - > leader receives all follower ACK (asynchronous callback onLeaderStart).
- So visit leader in onStarting Following, and leader will report "Not leader" because it has not yet executed onLeaderStart (because our code sets leaderTerm in onLeaderStart).
Restoration plan
- Do not judge leader based on leaderTerm, but directly on Node.isLeader().
Interim plan
- In the short term, you can sleep in on StartingFollowing() for a few seconds (for example, 3s) and then visit the leader.

provide an interface to get session data

In SOFADashboard, I hope to use SOFARegistry as a registry implementation, but currently SOFARegistry does not provide any interface to get service information.

add sofa hessian blacklist

Describe the bug

For security reasons, hessian needs to configure the blacklist file to take effect.
And add jvm parameter for list to take effect

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

Upgrade org.eclipse.jetty:jetty-server to version 9.4.17.v20190418

Describe the bug

fix security alert,Upgrade org.eclipse.jetty:jetty-server to version 9.4.17.v20190418

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

Provides querying all other roles ip list information interface on session and data

New Rest Api

Provides querying all other roles ip list information RESTFul api on session and data

session digest/<NodeType>/serverList/query

$ curl http://localhost:9603/digest/data/serverList/query
["x.x.x.43","x.x.x.58","x.x.x.128"]

data digest/<NodeType>/serverList/query

$ curl http://localhost:9622/digest/data/serverList/query
{"yansong-test":["x.x.x.58","x.x.x.128"]}

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

DataServer as a server to store other dataServer link information error

Describe the bug

The test verifies the data synchronization process between dataServers. When the link is normal, the data backup notification information in the cluster is reported. The ip error of the link cannot be found.
connect log eg:
The above logs are the link log and the broken link log printed at the time of the link event of the bolt. Note: The link triggered by the link 43:44066 is triggered before the link 43:55174 chain break event, and the last link status 43:44066 is on the successful link
However, the current dataserver stores the client dataserver link stored by the server as the Map, so if the storage relationship of the key is determined only by ip in the above process, the previously broken link event is triggered after the new connection is established (the above is). In this way, the event of the broken link is deleted by the same deletion logic of ip, and the link is actually connected but the storage relationship cannot be found. code show as below
Data synchronization logic and error exceptions are as follows
SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

Dataserver expansion data synchronization process is affected by cleanup tasks

Describe the bug

The test process finds that if the regular cleaning task is set frequently, the expansion startup data may be cleaned up by the cleanup task. For example, the d1 d2 d3 node expands the d4 node because the final list of meta notifications to each node has a sequence, if d4 is at the end The notification is 4 nodes. If there are other nodes synchronizing the data before, the cleaning task will calculate according to the 3 nodes that the currently synchronized data does not belong to d4.
In addition, not only new nodes, but also other remaining node data may be relocated, so it is very likely that data will be cleaned up at this moment.

Modify

Even if the cleaning task frequency is very slow, it is still possible to encounter the cleaning task at the time of expansion, and the cleaning task time can only be reset when there is a data synchronization message acceptance time.

Wrong HttpCode of HealthResource

Describe the bug

The HTTP code that HealthResource returns when it is healthy must be 200, otherwise 500 should be returned.

注册中心启动失败,leadr选举地址为docker网卡地址，localhost:9622 health check failed.

使用默认配置文件，DefualtDataCenter:本机地址
启动报错，leader选举地址为docker网卡对应地址

Your scenes

错误信息如下
registry-raft.log

[2019-06-25 16:55:50,350][INFO][main][FSMCallerImpl] - Starts FSMCaller successfully.
[2019-06-25 16:55:50,489][WARN][main][LocalSnapshotStorage] - No data for snapshot reader /home/wsq/Applications/SOFA/sofa-registry-integration/raftData/snapshot
[2019-06-25 16:55:50,637][INFO][main][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> init, term: 0, lastLogId: LogId [index=0, term=0], conf: 192.168.155.60:9614, old_conf:
[2019-06-25 16:55:50,670][INFO][main][RaftGroupService] - Start the RaftGroupService successfully.
[2019-06-25 16:55:52,466][INFO][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> term 0 start preVote
[2019-06-25 16:55:52,474][WARN][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> can't do preVote as it is not in conf <192.168.155.60:9614>
[2019-06-25 16:55:53,597][INFO][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> term 0 start preVote
[2019-06-25 16:55:53,598][WARN][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> can't do preVote as it is not in conf <192.168.155.60:9614>
[2019-06-25 16:55:54,728][INFO][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> term 0 start preVote
[2019-06-25 16:55:54,729][WARN][JRaft-ElectionTimer][NodeImpl] - Node <RegistryGroup_DefaultDataCenter/172.17.0.1:9614> can't do preVote as it is not in conf <192.168.155.60:9614>
registry-integration-std.out
Command: java -Dregistry.integration.home=/home/wsq/Applications/SOFA/sofa-registry-integration -Dspring.config.location=/home/wsq/Applications/SOFA/sofa-registry-integration/conf/application.properties -Duser.home=/home/wsq/Applications/SOFA/sofa-registry-integration -server -Xms512m -Xmx512m -Xmn256m -Xss256k -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/wsq/Applications/SOFA/sofa-registry-integration/logs/registry-integration-gc.log -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/wsq/Applications/SOFA/sofa-registry-integration/logs -XX:ErrorFile=/home/wsq/Applications/SOFA/sofa-registry-integration/logs/registry-integration-hs_err_pid%p.log -XX:-OmitStackTraceInFastThrow -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+CMSClassUnloadingEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -jar /home/wsq/Applications/SOFA/sofa-registry-integration/registry-integration.jar --logging.config=/home/wsq/Applications/SOFA/sofa-registry-integration/conf/logback-spring.xml
Sofa-Middleware-Log SLF4J : Actual binding is of type [ com.alipay.remoting Logback ]
[2019-06-25 17:25:46,010][INFO][main][MetaServerBootstrap] - Open session node register server port 9610 success!
[2019-06-25 17:25:46,033][INFO][main][MetaServerBootstrap] - Open data node register server port 9611 success!
[2019-06-25 17:25:46,064][INFO][main][MetaServerBootstrap] - Open meta server port 9612 success!
[2019-06-25 17:25:51,753][INFO][main][MetaServerBootstrap] - Open http server port 9615 success!
[2019-06-25 17:25:53,651][INFO][main][MetaServerBootstrap] - Raft server port 9614 start success!group RegistryGroup
[2019-06-25 17:25:53,653][INFO][main][MetaServerBootstrap] - Raft client connect success!
[2019-06-25 17:25:53,682][INFO][main][MetaServerBootstrap] - Raft start CliService success!
[2019-06-25 17:25:53,692][INFO][main][MetaServerInitializerConfiguration] - Started MetaServer
[2019-06-25 17:25:59,264][INFO][main][RegistryApplication] - localhost:9615 health check success.
[2019-06-25 17:26:07,227][INFO][main][DataServerBootstrap] - [DataServerBootstrap] begin start server
[2019-06-25 17:26:07,349][INFO][main][DataServerBootstrap] - Data server for session started! port:9620
[2019-06-25 17:26:07,414][INFO][main][DataServerBootstrap] - Data server for sync started! port:9621
[2019-06-25 17:26:08,602][INFO][main][DataServerBootstrap] - Open http server port 9622 success!
[2019-06-25 17:26:09,085][INFO][main][DataServerBootstrap] - [DataServerBootstrap] raft client started!Leader is 172.17.0.1:9614
[2019-06-25 17:26:09,138][INFO][main][DataServerBootstrap] - [DataServerBootstrap] start server success
[2019-06-25 17:26:09,300][ERROR][main][RegistryApplication] - localhost:9622 health check failed.
[2019-06-25 17:26:10,342][ERROR][main][RegistryApplication] - localhost:9622 health check failed.
[2019-06-25 17:26:11,360][ERROR][main][RegistryApplication] - localhost:9622 health check failed.

Environment

---- 网卡信息

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether a0:c5:89:4e:5e:26 brd ff:ff:ff:ff:ff:ff
inet 192.168.155.60/24 brd 192.168.155.255 scope global dynamic noprefixroute wlp2s0
valid_lft 71228sec preferred_lft 71228sec
inet6 fe80::1138:8b8e:eff4:7199/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:87:2f:e8:ef brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:87ff:fe2f:e8ef/64 scope link
valid_lft forever preferred_lft forever

SOFARegistry version:

V5.2.0

JVM version (e.g. java -version):

java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

OS version (e.g. uname -a):

Linux wsq-pc 4.19.49-1-MANJARO #1 SMP PREEMPT Sun Jun 9 20:24:20 UTC 2019 x86_64 GNU/Linux

Maven version:

Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-05T03:00:29+08:00)

Regular version detection rotation training logic failure

Describe the bug

The stable environment finds that some machines can receive pushes after they have been online for a long time. Generally, the new registration pub is not pushed at the startup time, and the push is received after subsequent changes.
Troubleshoot the log, write the data from the pub data, the dataServer notifies the session change and then gets the data and pushes it, but the push fails.
According to the normal logic, the push fails, and the version number corresponding to the datainfoid should not be updated to the latest version number generated by the pub, so that the subsequent rotation training task compensates for the push failure and initiates the push guarantee again. However, after the current push fails, the version number of the datainfoid is updated immediately, and the subsequent rotation training task is not guaranteed to be pushed again.
We found that in the PushTaskClosure logic of the push task confirmation, the start time should add the push task to be confirmed in the map but the log shows

[2019-04-11 10:55:38,209][INFO][PushTaskClosureCheck-5-thread-29][PushTaskClosure] - Push task queue size 0,map size 0

So if the map is 0, and there is no task to be confirmed to join the queue, this will directly confirm the version directly, this logic has problems, the code is as follows

As for the process that the map is 啥 0, the logic initiates the push task. The earliest logic is that the task is added to the queue to be confirmed before the push task is initiated, and the subsequent waits are completed one by one. This process has become asynchronously added to the task due to the optimization of performance time, so there is a case where the map is size=0 at PushTaskClosure.start.

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

During meta startup, leader may not register itself

Describe the bug

During meta startup, leader may not register itself
- In the startProcess of setLeaderProcessListener, if sendNotify throw an exception, then registerCurrentNode() will not execute. This will result in incomplete metaLists for data and session, lack of leader ip, and subsequent data will fail to start.

引用了错误的jetty版本10.0.0-alpha0

Describe the bug

clone源码后，执行

mvn clean package -DskipTests

Expected behavior

编译成功

Actual behavior

报错:

/E:/sofastack/sofa-registry/server/remoting/http/src/main/java/com/alipay/sofa/registry/remoting/jersey/jetty/server/HttpChannelOverHttpCustom.java:[19,30] 无法访问org.eclipse.jetty.http.HttpVersion
[ERROR]   错误的类文件: D:\tool\apache-maven-3.5.3\repo\org\eclipse\jetty\jetty-http\10.0.0-alpha0\jetty-http-10.0.0-alpha0.jar(org/eclipse/jetty/http/HttpVersion.class)
[ERROR]     类文件具有错误的版本 55.0, 应为 52.0

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

sofa-registry/server/remoting/http/src/main/java/com/alipay/sofa/registry/remoting/jersey/jetty/server/HttpChannelOverHttpCustom.java:[19,30]

sofa-registry/server/pom.xml

 <jetty.version>[9.4.17.v20190418,)</jetty.version>

可能原因是Mavenc**仓库中的jetty最新版本是 10.0.0-alpha0 用是是jdk11编译的发布时间是:2019-07-13

Environment

SOFARegistry version: master and v5.2.0
JVM version (e.g. java -version): 1.8.0_171
OS version (e.g. uname -a): Win10 2018-11-10 14:38
Maven version: 3.5.3
IDE version: CMD

After the client sends a sub successfully, it is not guaranteed to subscribe to the data.

Describe the bug

current:
- When the session receives the sub, it will pull data from the memory cache or access data, and then push it to the client. There is a problem: If this process fails (retry 3 times), it will cause the client data to be missing. Even if the session has the logic of fetchData, it does not solve the problem that this single sub data is backward. Only once there is a data change, will there be a chance to push the sub again. Related log: grep -F "Discarding a task of SubscriberRegisterFetchTaskDispatcher-" common-default.log
solution:
- After the first time the sub-received result is re-triggered for 5 times, the logical dataInfoId version number of the fetchData is set to 0, so that the subsequent rotation training task will continue to believe that there is no sub-update success, and the continuous rotation training is updated.
- The rotation training version is set to 0, because we have already saved the dataInfoId version in each subscriber before, if the rotation training will not push again because of the low version, avoid unnecessary repeated push.
- For temporary push and empty push (specially the message dataid is pushed in the clientoff operation), the two scenarios do not affect the rotation of the rotation training version because there is no specific sub information or the sub information is deleted.

Environment

SOFARegistry version:5.2.0
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

是否有支持dubbo的计划？

有的话，计划什么版本什么时间？

当前master代码打包出来的服务端启动健康检查报错

Your question

hi, 按照 https://www.sofastack.tech/sofa-registry/docs/Server-QuickStart 中自己下载源码包打包的方式，最后启动 server，健康检查报错如下：

还有这个监控检查是改了吗？lsof -i:9603 发现没进程占用

Your scenes

describe your use scenes (why need this feature)

Your advice

describe the advice or solution you'd like

Environment

SOFARegistry version: 5.2.1-SNAPSHOT
JVM version (e.g. java -version): 1.8
OS version (e.g. uname -a): linux or mac
Maven version: 3.x
IDE version: idea

千万pub+sub部署咨询

咨询下，如果生产上需支持千万级别的pub+sub，以各500w为例，meta/session/data节点建议如何部署：是否需要部署多个区域？各个节点分别需要几台？服务器配置推荐什么规格？

各节点服务器配置咨询

请分别提供下meta/session/data节点运行的服务器推荐配置：包括OS版本/位数、CPU性能、存储要求、网卡要求、是否可使用虚拟机等。谢谢~

Push task to confirm thread blocking problem

Describe the bug

In the case of a large amount of pressure data, there is a large number of rejection errors for the pushconfirm push confirmation thread pool.
After each sub is pushed, the version is updated one by one to ensure that the push is successful one by one. This process requires each push task to complete the status update. The original method uses the blocking queue acquisition mode, which causes a large number of confirmation tasks to be completed. The timeout situation is serious and all threads cannot be used.

Modify

For confirmation that no separate thread is used, use the time wheel to periodically check the confirmation status, and all the confirmations cannot be completed within the time limit, that is, the version is not updated, and the next rotation is continued.

SessionCacheService: should throw exception when failed to fetch datum

Describe the bug

When datum acquisition fails, no exception is thrown, which eventually causes SessionServer to push empty data to Client, and the datum is not queried when SessionServer polls Data Server regularly.

Expected behavior

If datum acquisition fails, should throw exception, and invoke sessionInterests.checkAndUpdateInterestVersionZero(), so that the datum will fetch again when SessionServer polls Data Server regularly.

Actual behavior

When datum acquisition fails, no exception is thrown

Minimal yet complete reproducer code (or GitHub URL to code)

@Override
public Value getValue(final Key key) {
    Value payload = null;
    try {
        payload = readWriteCacheMap.get(key);
    } catch (Throwable t) {
        LOGGER.error("Cannot get value for key :" + key, t);
    }
    return payload;
}

@Override
public Map<Key, Value> getValues(final Iterable<Key> keys) {
    Map<Key, Value> valueMap = null;
    try {
        valueMap = readWriteCacheMap.getAll(keys);
    } catch (ExecutionException e) {
        LOGGER.error("Cannot get value for keys :" + keys, e);
    }
    return valueMap;
}

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

本文件注册中心

Your question

说明文档中有一段话，“本地文件注册中心一般用于测试” ，
是本地文件注册中心有性能问题么？

Your scenes

describe your use scenes (why need this feature)

Your advice

describe the advice or solution you'd like

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

请教和nacos的对比,有什么特色功能

Your question

请教和nacos的对比,有什么特色功能

Your scenes

describe your use scenes (why need this feature)

Your advice

describe the advice or solution you'd like

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

More data nodes cannot reach the working state problem

Describe the bug

The performance test process found that about 40 machines were started at the same time, and there would be no working problems.
And the problem continues to be unable to reach working

Analytical solution

Most machines that have not reached the working state are waiting for the same machine to send status, because both initialization machines have no data, and all the machine's init status messages are issued to update all current list status information to achieve working.
Through threaddump, this machine with no sending status, consistently unable to issue init messages and keep retrying, the thread has been stuck in the infinite loop part. This place may be retried for invalid links for infinite loop retry

Modify this part to get the data node information guarantee is normal, in addition to maintaining the continuous task in the DataServerNodeFactory here to update the link, the data node instance is also updated at any time, and the link is guaranteed to be valid when the failed link is obtained.

Context-inconsistent null check

Describe the bug

不一致的null检查

sofa-registry/server/server/session/src/main/java/com/alipay/sofa/registry/server/session/scheduler/task/DataChangeFetchTask.java

Lines 308 to 313 in ae8d216

 taskEvent.setSendTimeStamp(DatumVersionUtil.getRealTimestamp(datum.getVersion())); 

 taskEvent.setAttribute(Constant.PUSH_CLIENT_SUBSCRIBERS, subscribers); 

 taskEvent.setAttribute(Constant.PUSH_CLIENT_DATUM, datum); 

 taskEvent.setAttribute(Constant.PUSH_CLIENT_URL, new URL(address)); 

 int size = datum != null && datum.getPubMap() != null ? datum.getPubMap().size() : 0;

在308行处使用 datum没有做null检查,
在313行处使用 datum又做了null检查,
同时308行和313行之间的代码没有对datum进行赋值,
这样导致逻辑冲突，漏掉或者多余一处null检查,
同样的问题存在于文件

sofa-registry/server/server/session/src/main/java/com/alipay/sofa/registry/server/session/scheduler/task/DataChangeFetchCloudTask.java

Lines 281 to 286 in ae8d216

 int size = datum != null && datum.getPubMap() != null ? datum.getPubMap().size() : 0; 

 taskLogger.info( 

 "send {} taskURL:{},dataInfoId={},dataCenter={},pubSize={},subSize={},taskId={}", 

 taskEvent.getTaskType(), subscriber.getSourceAddress(), datum.getDataInfoId(), 

 datum.getDataCenter(), size, subscribers.size(), taskEvent.getTaskId());

等处

upgrade jraft from 1.2.4 to 1.2.5

Describe the bug

upgrade jraft from 1.2.4 to 1.2.5

TaskDispatcher

为什么使用“任务分发模型”？虽然它实现了流量塑形、任务过期、任务去重等功能，但Eureka的这种“生产者消费者模型”感觉实现有些复杂。用简单的生产者消费者模型好像也能实现这种功能。为什么要使用三个队列：acceptorQueue/reprocessQueue, pendingTasks, itemWorkQueue？真的有必要吗？
client在发现session server的时候需要先传一个session server的hostPort？需要物理机部署多个session server（如果是一个session server的话，有单点的风险）？现在在进行docker化，蚂蚁怎么部署的？如果是用物理机的话，一个session server物理机硬件宕机，还需要新启动的服务使用刚宕机的IP或者记得修改DNS？难道LVS或HaProxy？
辛苦解惑一下。谢谢！

Dataserver expansion or node change causes data migration process to increase memory

Describe the bug

All nodes on the dataserver are pressured to measure the maximum amount of storage. After the data is expanded, the new node immediately finds frequent fgc.
On the remaining machines after the same amount of data has been relocated, there are no more fgc occurrences than the new node.

Analytical solution

dump memory to see the new node's string type object is significantly more than the original node, and basically expands about 10 times, the main object of the resident old is the string takes up a larger
All previous access data including synchronous and backup data will perform the same character caching function for pub, but the data relocation at the initial stage of capacity expansion is not processed, and the processing problem is solved.

MetaNodeExchanger 的connectServer应该只是连接其中一台meatserver就好吧

代码中看到是连接这个数据中心所有的meatserver的，这个不太合理吧，连接其中一台就可以了吧

TaskQueue.cleanCompletedTasks has redundant code

Your question

TaskQueue.cleanCompletedTasks method return vod
so the taskMap.size() is redundant

Your scenes

in master brance

Your advice

delete this one or other reason?

Environment

SOFARegistry version: master
JVM version (e.g. java -version): jdk1.8
OS version (e.g. uname -a):
Maven version:
IDE version:

About RenewDecorate lastUpdateTimestamp

Describe the bug

RenewDecorate 的lastUpdateTimestamp属性定义不够明确

      public boolean isExpired() {
        return System.currentTimeMillis() > lastUpdateTimestamp + duration;
     }

    public void reNew() {
        lastUpdateTimestamp = System.currentTimeMillis() + duration;
    }

如果按isExpired 的实现，lastUpdateTimestamp 应该表示上次reNew 的时间，则reNew的实现应该为

   public void reNew() {
        lastUpdateTimestamp = System.currentTimeMillis();
    }

不知道是我理解有误还是这个设计别有用意？

Publish new task asynchronously, configure independent thread pool current limit

Describe the bug

Since the renew function requires that the pub addition is not successful after the dataserver is synchronized, the synchronization process is asynchronous, and the original data access queue pool AccessorExecutor is disabled.
For the case where the client playback traffic is huge, the new post task must be configured with independent threads for current limiting.

Modify

Configure an independent thread pool for asynchronous execution of the pub synchronization dataserver task to ensure that the task receives normal, and corrects the overnew error waiting for the renew task.
For the access test, all new Token bucket flow control is set, currently 10w cards are set, and subsequent restrictions are based on traffic.

服务数据是否无法保证多副本？

基础环境：3meta+3data+2session。各个data节点配置storeNodes=3。data节点ip：7，8，9
现增加一个data节点做扩容，storeNodes也配置为3。ip为53
服务注册时主副本写到53节点，53节点计算后给7和9发送NotifyDataSyncRequest，但是9在处理完该datum后，又执行LocalDataServerCleanHandle，clean了该服务数据
导致该服务只在53和7中存了两个副本
----引申之，是否存在以下可能性：主副本给其他2个data节点同步数据，但是这2个节点都将该数据clean了，导致服务数据只存在主副本这一个副本？
----是否集群中data节点越多，以上风险就越大？
----另外，验证若配置主副本data节点的storeNodes为1时，将主副本的data节点停机后，在其他data节点上查不到注册的服务数据。所以如果数据不能保证多副本，存在注册数据丢失的风险？

请确认是否为bug。

谢谢！

临时推送数据在合并推送逻辑存在问题

Describe the bug

Temporary push data will be specially treated.but in the merge data process, temporary data flag will be override by normal push data, and as same batch to put to cache map.But temporary data has not existed source address, it will invalidate normal data to be add to cache map that cause the NPE exception

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

data的多副本是怎么保证一致性的呢？

IO压力及日志文件咨询

测试环境3meta 3data 2session，5.2.0开源版本
200个tps注册200w提供方，data节点iops高达12000。这是否是写日志文件导致的？
日志目录下该天cache-digest.log为700多M，这是否是必须记的？有什么作用？
另外，请问，按生产运维经验，其他日志文件有哪些是必须记的，有哪些是可以提高日志级别不记或少记的？

Metadata related

注册中心获取当前发布的服务的元数据

请问：咱们的注册中心是否有考虑对外提供查询中心所有发布服务的详细原数据，比如
1.　服务名
2.　服务版本号
3.　服务方法名、入参、出参

多谢

Client fast chain breakage data can not be cleaned up

Describe the bug

The client fast chain break does not trigger the clientoff operation to delete data, causing garbage data to exist and push the client.

analysis

The query corresponds to ip and port does have a broken link record. Note that this link and broken link are triggered in unified seconds, which is a flash situation.

During the flash, the client made the pub data and successfully synchronized to the data.
By breaking the link and initiating the clientoff log, you can see that the pub sub wat memory that was queried at the time of the initiation is not stored, and all are false. This situation should be the new data pub has not come to the end of the dataserver to return, and then write the memory to start the clientoff, so the query will not launch the clientoff will not be able to clean up this data

solution

It is necessary to periodically check that the link does not exist but the memory data and the corresponding data are periodically cleaned up by the client.
SOFARegistry version:5.2.0
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

The call failed after the official project service of the official website was registered to the local registration center.

demo地址：https://github.com/sofastack-guides/kc-sofastack-demo.git
从官网上下载下来的demo项目，根据官网指南教程操作，我使用的是本地的部署的SOFARegistry注册中心，但是项目跑起来后就会出现如下情况，尝试过增加和删除service.unique.id=user99，都没有作用，现在不知道是哪里的问题了！

meta节点咨询

meta节点如果发生一台或多台宕机，会有哪些影响？
比如，是否影响服务注册、订阅；是否影响data或session节点上下线；是否影响服务数据备份？

我们测试集群配置3meta3data2session，5.2.0开源版本，启动服务提供方后，停掉一个meta节点，启动订阅方，报找不到提供方。
这结论是否正确？meta节点是否会成为瓶颈？

PublisherRegistration dosen't overwrite hashcode and equals

Your question

PublisherRegistration dosen't overwrite hashcode and equals.
In DefaultRegistryClient.register using the registrationPublisherMap judge whether
registration has registered, but this is not available.
This is in my opinion.

Your scenes

see source code find this question.

Your advice

Implement the hashCode and equals

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

DataServer does not work after startup, receives client off request processing error

Describe the bug

The dataserver receives the client off request when there is no working time, and the data synchronization and data deletion simultaneously cause the final data to be inconsistent.
Therefore, we must postpone the client off request processing of the current node, and process the client off request after the no working state has to synchronize the other node data. That is, we must prohibit the deletion of data in the no working state.
At the same time, other nodes also receive the same client off request. These requests are executed on other dataservers for data deletion, and the deleted logs are synchronized to the dataserver node without working, which also causes confusion. Therefore, we must postpone the execution of these synchronization logs to the post-working execution, and postpone the execution because the executed logs are idempotent operations, so the number can be completed normally.

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

update jraft version

Describe the bug

MetaServer store other node list dependent on sofa-jraft consistency，but old jraft version has some bug cause refresh leader error by client，and sometime receive the leader ip as 0.0.0.0:0 .In jraft 1.2.5 version has fixed it，so we update dependency version 1.2.5

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

DataServer changes from working state back to init state

Describe the bug

Sometime we found dataServer changes from working state back to init state during startup，logs eg.:
Analysis found that the data startup log found that the first time all the data ip list information obtained from the meta is the first list, this is not in the expected range, because all data startup is started one by one to register the meta node, before the system starts It is to get some or some of the node information one by one to carry out the subsequent operations. This environment is the first time to get the full list. It is probably because the meta raft protocol is caused by the persistence of the registration information before the local disk is performed.
Subsequent analysis of the code, the main code logic that causes the working state to change back to the init state is because the current node is already in the working state, and the subsequent incremental ip of the meta list change (relative to the newly added ip node of the existing ip) Including the current node ip will be backed up to init, this design has not been modified in the earliest dataserver code, originally involved in the idea of estimating in order to prevent the current node from breaking the chain for a long time and then the link back state recovery can restore the data, but this is very Inadequate retreat status will cause a lot of bans and other errors to occur according to the subsequent design, so follow up to remove this code, the detailed code is as follows

Then analyze the data log, the above raft-persistent content gets the history list data to get the full list faster, which causes the current node to become the working state speed ahead of time, so if it becomes working after receiving the meta The pushed list does not contain the current node, and then receives the current list list collection, which will eventually generate the list increment information including the current ip occurrence, which will eventually result in the change to init, for example
- The current data node ip is 41. If it has become the working state and the meta push list is received as [49, 48], the current change to the subsequent list change will not cause a state change, but will cause The memory list changes to [49, 48]
- The subsequent meta re-push list is [41, 49, 48], so the list increment for this push relative to the memory list is [41], and it is currently working, so Satisfy the requirement that the above code logic becomes init, and the transformation to the Init state occurs.
The above log process is verified as follows:
However, the above process has a special point. In the case where the data is already working, the meta will push a list that does not contain the current node [49, 48]. In theory, the current node becomes woring. Already registered on the meta, so each push list should contain 41, continue to analyze the leader node meta log found, after the working has not received the current node 41 renew update invalidation time , causing the meta to be culled by 41, so the list process that does not contain it is pushed, the log is as follows
That is, data 41 is not renewed. The analysis log finds that the time interval for initiating the renew task after the first registration is completed exceeds 30s. After the node is registered, the update expired configuration must be exceeded, so the action of rejecting 41 will be triggered for the first time. Once again, renew is pushed and the full list is pushed [41,49,48]
This time interval is very long, because the list change content of the first meta push contains the data list of other dataCenter. The current data is to start the renew task after all the data links are performed one by one. This process occurs when other dataCenter links timeout and repeated retry, resulting in This renew task can't be started for more than 30s, which ultimately leads to the above result.

Problem recurrence

Do not delete the meta raft persistent directory to keep the data for the first time to get the full list information, quickly into the working, the renew task does not initiate a delay of more than 30s, so the meta has been culled, followed by the woring into init

fix

In the dataserver initialization process, the renew current node task is started immediately after the node registration is completed, and the current node is renewed to keep the current node valid, so that the push node information does not include the current node, and eventually the node changes from the working state. Back to init state
SOFARegistry version:
JVM version (e.g. java -version):
OS version (e.g. uname -a):
Maven version:
IDE version:

Support blacklist function

Demand

pub data is invalid
- The data of all new pubs of the client corresponding to the ip is invalidated, that is, the pub content is not pushed to the subscriber. This process ensures that the new pub request client does not report an error and avoids repeatedly sending the pub.
- Clean up the data that has been published by the corresponding ip client, and call the clientoff interface execution according to ip, including the push-to-empty processing for the special message subscriber.
Special message subscribers push away
- The above ip cleanup clientoff process has cleaned up the data and pushed the special message dataId, but if the message client is restarted after the clientoff, the new subscription will also need to be pushed out. There is a corresponding pub push

Design realization

Increase the registration logic front face to ensure that the facet logic can continue to register through abnormal blocking or return directly without subsequent registration logic. The inside of the aspect is compatible with whether the original client is connected to determine the blocking logic. The blacklist filtering logic is divided into dataid filtering and ip filtering policies, and various policy hits are processed after the policy is customized

Specific logic
  - Disable the pub client's new pub data to directly prohibit the memory storage and send dataServer, directly return the client pub successfully, keep the client abnormal, and try again and again.
  - For the existing pub data and the old version consistently through the opsconfreg scheduling clientoff to clean up
  - For the new registration ip of the special message dataid, push the null data in the blacklist, and the subsequent sub does not perform the memory storage directly to report the sub success, avoiding the client repeatedly sub
  - The blacklist information opsconfreg management interface is unchanged, the data is synchronized to the meta to persist to multiple nodes to store metadata, and the changes are pushed to the session for memory update. Session gets configuration data during startup
The message dataId can not understand the content of the message. First, describe the function of the blacklist:
  - First, the blacklist currently provides the ability to hit according to the dataId and ip hit. The relationship between the two hit strategies is that the two hit strategies are configured to be filtered at the same time. The ip filter rule is defined in opsconfreg, and the filtered dataId defines the regular expression in the startup parameter.
  - Second, provide different hit behaviors for pub and sub
     - For pub, if it hits, it will not allow the new pub to register, and the registered pub needs to be deleted and push the deletion of the data that has subscribed to this pub, which is equivalent to the clientoff ip effect.
     - For sub, if hit, filtering the newly registered sub does not allow registration, and keeps the push data of this sub from being pushed back to avoid the client's listening has been waiting (currently this dimension according to dataid is mainly used for messages) Certain prefixes

Add monitoring logs

Describe the bug

Add monitoring logs

	taskEvent.setSendTimeStamp(DatumVersionUtil.getRealTimestamp(datum.getVersion()));
	taskEvent.setAttribute(Constant.PUSH_CLIENT_SUBSCRIBERS, subscribers);
	taskEvent.setAttribute(Constant.PUSH_CLIENT_DATUM, datum);
	taskEvent.setAttribute(Constant.PUSH_CLIENT_URL, new URL(address));

	int size = datum != null && datum.getPubMap() != null ? datum.getPubMap().size() : 0;

	int size = datum != null && datum.getPubMap() != null ? datum.getPubMap().size() : 0;

	taskLogger.info(
	"send {} taskURL:{},dataInfoId={},dataCenter={},pubSize={},subSize={},taskId={}",
	taskEvent.getTaskType(), subscriber.getSourceAddress(), datum.getDataInfoId(),
	datum.getDataCenter(), size, subscribers.size(), taskEvent.getTaskId());