Giter Club home page Giter Club logo

ecchronos's Introduction

ecChronos

codecov maven central OpenSSF Scorecard

ecChronos is a decentralized scheduling framework primarily focused on performing automatic repairs in Apache Cassandra.

The aim of ecChronos is to provide a simple yet effective scheduler that helps in maintaining a cassandra cluster. It is primarily used to run repairs but can be extended to run all manner of maintenance work as well.

  • Automate the process of keeping cassandra repaired.
  • Split a table repair job into many smaller subrange repairs
  • Expose statistics on how well repair is keeping up with the churn of data
  • Flexible through many different plug-in points to customize to your specific use case

ecChronos is a helper application that runs next to each instance of Apache Cassandra. It handles maintenance operations for the local node. The repair tasks make sure that each node runs repair once every interval. The interval is configurable but defaults to seven days.

More details on the underlying infrastructure can be found in ARCHITECTURE.md.

More information on the REST interface of ecChronos is described in REST.md.

Prerequisites

  • JDK 11
  • Python 3.8

Installation

Installation instructions can be found in SETUP.md.

Command line utility

In standalone installation, a command line utility called ecctool is provided. For more information about ecctool refer to ECCTOOL.md.

Getting Started

Instructions on how to use ecChronos and configure it to suit your needs can be found in GETTING_STARTED.md

Upgrade

Upgrade instructions can be found in UPGRADE.md.

Compatibility with Cassandra versions

For information about which ecChronos versions have been tested with which Cassandra versions can be found in COMPATIBILITY.md

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, development, and the process for submitting pull requests to us.

Versioning

We try to adhere to SemVer for versioning.

  • Anything requiring changes to configuration or plugin APIs should be released in a new major version.
  • Anything extending configuration or plugins in a backwards compatible way should be released in a new minor version.
  • Bug fixes should be made for the first known version and merged forward.

Authors

  • Marcus Olsson - Initial work - emolsson

See also the list of contributors who participated in this project.

License

This project is licensed under the Apache License - see the LICENSE.md file for details

ecchronos's People

Contributors

arcturusmengsk avatar ch1bbe avatar cssndrthrift avatar danielweriksson avatar dapc11 avatar dependabot[bot] avatar dlpartain avatar emolsson avatar etedpet avatar fraserc182 avatar gkunz avatar itskarlsson avatar jwaeab avatar kaiyaok2 avatar manmagic3 avatar masokol avatar pthariensflame avatar sajidriaz138 avatar tommystendahl avatar valmiranogueira avatar victorcavichioli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ecchronos's Issues

Start application if Cassandra is in UJ

Currently ecChronos initially tries to connect to Cassandra by CQL. If the connection fails ecChronos process will exit with error.

EcChronos should recognize and wait if Cassandra is joining the cluster.

Locking failures log too much

As it is expected to get contention sometimes WARN logs shouldn't be printed for that. Example:

10:28:12.183 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.183 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@39043b5e - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=1232f20b-58e4-4088-a130-19048ffcc667)', releasing previously acquired locks - Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.199 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.200 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@7b1150f1 - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=1232f20b-58e4-4088-a130-19048ffcc667)', releasing previously acquired locks - Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.207 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1
10:28:12.208 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@25212018 - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=6c784338-2527-47f7-82e7-ade4714f74ea)', releasing previously acquired locks - Unable to lock resource RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1
10:28:12.223 [pool-3-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Table test.test2 last repaired at 2019-04-24 10:26:30, next repair 2019-04-24 10:27:30
10:28:12.227 [pool-3-thread-1] INFO  c.e.b.c.e.f.i.LoggingFaultReporter - Ceasing alarm: REPAIR_WARNING - {TABLE=test2, KEYSPACE=test}
10:28:12.233 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1
10:28:12.234 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@26394a83 - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=6c784338-2527-47f7-82e7-ade4714f74ea)', releasing previously acquired locks - Unable to lock resource RepairResource-6c784338-2527-47f7-82e7-ade4714f74ea-1 in datacenter datacenter1
10:28:12.236 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.236 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@63c92ea5 - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=1232f20b-58e4-4088-a130-19048ffcc667)', releasing previously acquired locks - Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.238 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - Lock (RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1) got error Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1
10:28:12.238 [pool-3-thread-1] WARN  c.e.b.c.e.c.r.RepairLockFactoryImpl - com.ericsson.bss.cassandra.ecchronos.core.repair.RepairLockFactoryImpl@3252b9d0 - Unable to get lock for repair resource 'RepairResource(dc=datacenter1,resource=1232f20b-58e4-4088-a130-19048ffcc667)', releasing previously acquired locks - Unable to lock resource RepairResource-1232f20b-58e4-4088-a130-19048ffcc667-1 in datacenter datacenter1

High memory usage

A lot of unnecessary memory is being used for virtual node repair history and ReplicaRepairGroup.

We should:

  • Remove usage of long token
    • As we create a ton of LongTokenRanges it's simpler to just keep the long values directly in the class.
  • Reuse replication structures across a keyspace
    • Including reusing replica sets across different virtual nodes
  • Not use deep copy in all repair state classes
    • We could use ImmutableSet/List, etc. directly instead of creating a new array/set for each instance of VnodeRepairState/ReplicaRepairGroup

Early indications shows a memory decrease to ~33% (being used by TableRepairJob(s)) with memory usage going from ~18MB to ~6MB in a setup with:

  • 256 virtual nodes
  • 22 keyspaces
    • 4 tables in each
    • Different replication strategy
  • 17 nodes
    • DC1 = 9 nodes
    • DC2 = 4 nodes
    • DC3 = 4 nodes

Proper REST interface

The current REST interface is not very intuitive and friendly for users, it should be updated.
Proposal to change it from e.g.:

/repair-scheduler/v1/get/mykeyspace/mytable

to:

/repair-management/v1/status/keyspaces/mykeyspace/tables/mytable

In summary the interface should be:

GET /repair-management/v1/status
GET /repair-management/v1/status/keyspaces/<keyspace>
GET /repair-management/v1/status/keyspaces/<keyspace>/tables/<table>

GET /repair-management/v1/config
GET /repair-management/v1/config/keyspaces/<keyspace>
GET /repair-management/v1/config/keyspaces/<keyspace>/tables/<table>

Which hopefully will align nicely with the on-demand repair service as well as future "POST" requests to update.

Ability to change the scheduler interval

Currently, the scheduler runs every 30 seconds. There might be use cases were this needs to be increased or decreased. Adding the ability to configure the value is a fairly trivial change and would make testing quicker since we can increase the frequency.

Split/merge token range history

The current mechanism for both keeping track of and running repairs is based on the token ranges of the node. This is heavily relying on a large number of virtual nodes in order to split repairs.

An alternative to this would be to split and merge both repair history and token ranges as needed. Basically that the repair tasks are performed with smaller token ranges which are later merged together again when parsing the repair history. This would remove the need to use virtual nodes by having a configurable "token range size" instead.

Repair history for ranges that are adjacent could be merged i.e. if they are performed within an hour of each other or if both are outside of the repair interval. The merged result would retain the timestamp of the "oldest" entry.

Additionally the "token range size" could be dynamic - based on a "target repair size".
target token range = total token range / (current disk usage / target repair size)
If a table has 10GB of data and we specify a target repair size of 1GB we would divide the total token range in 10 equal pieces while assuming that data is uniformly distributed across the token range.

Exception in RepairSchedulerImpl#close()

When closing RepairSchedulerImpl the following exception occurs:

java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
        at com.ericsson.bss.cassandra.ecchronos.core.repair.RepairSchedulerImpl.close(RepairSchedulerImpl.java:87)

This only happens when the schedule is for multiple tables.

Repair does not start after changing interval period

Before reading it might be good to know that I'm running ecChronos in a container, so all temporary files it creates will be deleted at restart. Anything stored in Cassandra is stored in a persistent volume.

I first started ecChronos yesterday with default configuration. All 1024 ranges (4 nodes, 256 vnodes) was repaired about 20 hours ago according to records in the repair_history table.

I recently changed the repair interval from 7 days to 2 hours and restarted all instances. 20 hours since last repair is alot more than 2 hours so I expected a new repair to start immediately. But even after waiting more than an hour, no repair started. All I can see in the logs is that ecChronos picked up my change:

...
17:46:40.148 [main] INFO  c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using JmxConnection(localhost:7199)
17:46:40.597 [pool-4-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table k1.keyvalue is new, next repair 2020-06-04 19:46:40
17:46:40.652 [pool-4-thread-1] INFO  c.e.b.c.e.f.i.LoggingFaultReporter - Ceasing alarm: REPAIR_WARNING - {TABLE=keyvalue, KEYSPACE=k1}
17:46:40.670 [pool-4-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table k2.keyvalue is new, next repair 2020-06-04 19:46:40
...

The table is not new so I thought a repair should've started already. Since it did not, I just assumed that ecChronos scheduled the next repair 2 hours from when the ecChronos started up with the new configuration. If that's what's expected, that's fine.

At 19:09 (before the assumed 19:46 repair start), I restarted ecChronos. I still expect that the next repair will be at the latest at 19:46 since that's two hours since I first started ecChronos with my new configuration.

So I waited until 20:00 but no repair started. I checked the logs of ecChronos, Cassandra and the repair_history table but no repair has run on any node.

...
19:09:26.171 [main] INFO  c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using JmxConnection(localhost:7199)
19:09:26.541 [pool-4-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table k1.keyvalue is new, next repair 2020-06-04 21:09:26
19:09:26.570 [pool-4-thread-1] INFO  c.e.b.c.e.f.i.LoggingFaultReporter - Ceasing alarm: REPAIR_WARNING - {TABLE=keyvalue, KEYSPACE=k1}
19:09:26.588 [pool-4-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table k2.keyvalue is new, next repair 2020-06-04 21:09:26
19:09:26.608 [pool-4-thread-1] INFO  c.e.b.c.e.f.i.LoggingFaultReporter - Ceasing alarm: REPAIR_WARNING - {TABLE=keyvalue, KEYSPACE=k2}
...

I will check tomorrow to see if repair ran. This seems like a bug either way since it is now long passed my configured interval time.

Add health check endpoint

Add a health endpoint to the REST API. This is useful to know if the ecChronos application is running and responsive, compared to if the process is running but hanging.

Expose repair scheduling information

It would be useful to expose information about current repairs like:

  • Which tables are scheduled for repair?
  • When was the table repaired last?
  • Individual vnode status per table

I propose to use a REST interface for this that users can create their own GUIs around if necessary. It would also be useful to have some scripts to show the local status on demand.

A REST interface could later on be used to alter state in the repair scheduler when needed.

Repair policies are ignored within a task (RepairGroup)

Currently run policies are used to prevent a job from running. Since the current structure is
Repair Job -> *Repair Group -> *Repair Task
in case there is a large table being repaired this could prevent the run policy from taking effect potentially for hours.

Ideally specific targeted repair policies should prevent any new repair tasks from being executed. This would require restructuring policies to bind to the jobs rather than the schedule manager (potentially both).

A good first step for 1.x versions of ecc could be to at least prevent new repair groups from starting which should be relatively trivial.

Is it possible to change configuration dynamically?

Hi,

maybe it is obvious but I am sorry I have not had a chance to read the code in depth.

Is there any way how to set some configuration properties after I start the binary? I am not completely sure what property that would be, I am more interested about general possibility to do that. If something is running for a very long time, I do not want to touch that and restart it manually just to configure it differently. Maybe something based on JMX would help here?

Add manual repairs

Running repairs on-demand is a useful feature. A first version could trigger repair(s) on the local node only.

In order to perform this kind of job there is some additional features needed:

  • Scheduled jobs should have the option to be de-scheduled after running successfully.
  • Repair jobs should be abstracted a bit to allow for different types/actions.
  • It would make sense to keep some metrics separate from the automatic scheduled jobs.
  • Repair state needs to be redesigned slightly. Repair should run on all vnodes after it was triggered, it would be a hack to use the current interval for this.

One way to achieve this could be to use something like a "should be repaired after X" for the manual repair job so that automatic repairs could clear the task as well.

Possibility to customize RepairConfigurationProvider in standalone application

The DefaultRepairConfigurationProvider has a static configuration for repairs and some tables might have different needs.

It would be useful to use a custom RepairConfigurationProvider in the standalone application. Currently there is no interface for repair configuration providers, but none might be needed (Closeable might be enough here).

Converting from Joda-Time to java.time

Converting from Joda-Time to java.time to avoid having an "unnecessary" 3PP dependency. Which can lead to security issues if not continuously stepped...

Re-add support for incremental repairs

As part of implementing concurrent virtual node repairs the ability to run incremental repairs was temporarily removed. Incremental repairs should be added again before 1.0.0 release.

Improve error message when statistics are unable to be saved to file

This might be difficult to do considering that this comes from codahale.metrics.csvreporter, but currently the warning message is rather confusing.

WARN  com.codahale.metrics.CsvReporter - Error writing to test2.table2-88bdb730-a63a-11ea-9604-e5f068acdab1-RepairSuccessTime 
java.io.IOException: No such file or directory
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createNewFile(File.java:1012)
	at com.codahale.metrics.CsvReporter.report(CsvReporter.java:287)
	at com.codahale.metrics.CsvReporter.reportTimer(CsvReporter.java:219)
	at com.codahale.metrics.CsvReporter.report(CsvReporter.java:212)
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:251)
	at com.ericsson.bss.cassandra.ecchronos.core.metrics.TableRepairMetricsImpl.close(TableRepairMetricsImpl.java:96)
	at com.ericsson.bss.cassandra.ecchronos.application.ECChronosInternals.close(ECChronosInternals.java:164)
	at com.ericsson.bss.cassandra.ecchronos.application.spring.ECChronos.close(ECChronos.java:146)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:339)
	at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:273)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559)
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1092)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520)
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1085)
	at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1061)
	at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1030)
	at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:170)
	at org.springframework.context.support.AbstractApplicationContext$1.run(AbstractApplicationContext.java:949)

Including the folder path it is trying to access would go a long way here.

Locking failures trigger to often

If we have ten tables that needs to be repaired within the same keyspace and the first one fails to lock a resource, instead of trying it with all the other nine tables we should cache the result and fail-fast.

(On-demand) Repair scheduling tool

Currently we are lacking the possibility to schedule repairs using a tool, rather than using pure REST requests.
It's also not possible to schedule repairs for all keyspaces or all tables within a keyspace.

Items to complete:

  • Repair multiple tables in a single command (REST or tool based).
  • Add a python tool for scheduling repairs.
  • Make sure we use the correct configuration when repairing tables on-demand.

Consolidate validation into one command

Currently there are a few intregation test that should be run before pushing. In the future there will even be checks for code quality and code style that one would want to run before pushing. I suggest we create a common profile for all of them so that the developer can simply run one command.

Example:
mvn clean install -P all_validation

DefaultRepairConfigurationProvider not handling replication changes properly

When #onKeyspaceChanged() is called it checks with ReplicatedTableProvider if the keyspace is replicated enough for repair.

In the case when a keyspace changes replication the token metadata is not properly updated until after the listeners have been notified of the keyspace update. This means that going from 1 -> 3 in rf won't cause the table to be scheduled for repair. But changing rf from 3 -> 1 will schedule it.

Trigger table repairs more often

Currently we perform repairs of one virtual node at a time. Then we wait for the schedule interval before continuing.

We should investigate what impact it would have to try and continue to repair the rest of the virtual nodes especially after Issue #70 is resolved.

Early tests showed an improvement in repair session throughput of about x2.8 times.

Add all parameters to ecc-config

Currently we are missing job id, size_target and recurring in the repair config tool. Current output looks like the following:

./ecc-config.py test tbl
-------------------------------------------------------------------------------------------------------------------------
| Keyspace | Table | Interval              | Parallelism | Unwind ratio | Warning time          | Error time            | 
-------------------------------------------------------------------------------------------------------------------------
| test     | tbl   |  7 day(s) 00h 00m 00s | PARALLEL    | 0.0          |  8 day(s) 00h 00m 00s | 10 day(s) 00h 00m 00s | 
-------------------------------------------------------------------------------------------------------------------------

The output should be changed to something similar to:

./ecc-config.py test tbl
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Keyspace | Table | Interval              | Parallelism | Unwind ratio | Warning time          | Error time            | Recurring | Id                                   | Size target |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| test     | tbl   |  7 day(s) 00h 00m 00s | PARALLEL    | 0.0          |  8 day(s) 00h 00m 00s | 10 day(s) 00h 00m 00s | True      | 32a38ae0-138c-11eb-81f2-a1bdc1ab9c38 | 1m          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and it should also be possible to run e.g.:

./ecc-config.py test tbl 32a38ae0-138c-11eb-81f2-a1bdc1ab9c38

to get config for a specific job id.

Metrics exposed via JMX?

Hi,

it would be cool if metrics were accessible via JMX so we can collect them and dashboard them e.g in Grafana via Prometheus. Is this on road map at least?

Calculate previous run-time to determine when to run next

Currently the repair scheduler is dynamically shifting around the schedule based on when ranges where last repaired.
While this is good for spreading the schedule there are still some inaccuracy with regards to repair interval targets. The current approach calculates "start of repair of range" + repair interval as the next time we should start a repair which causes a constant shift and misses the target every time.

Since we have both a start and finish time per range we could instead estimate how long the repair will take and subtract that from the "next start" to make the next start be "start of repair of range" + repair interval - estimated repair time instead.

On a table level this needs to be a summarized estimated repair time for all ranges that should be repaired.


While this won't solve the issue of estimating repair time between tables (tableX and tableY have the same interval and want to start at roughly the same time) that should in theory be shifted enough over time that there no longer is an overlap between the two tables.

Don't release test-jar from core module

Currently the test JAR from the core module is part of the release and has to be manually removed in the staging area.

We should not include the test jar artifact while deploying.

Expose metrics in REST API

Expose metrics with Prometheus support through an endpoint in the REST API.

Metrics should be exposed and secured with TLS (mutual authentication).

Statistics writes NaN

DataRepairState.csv is generating NaN, example:

t,value
1556094425,NaN
1556094522,NaN
1556094582,NaN
1556094642,NaN
1556094651,NaN

This is likely happening when total node data size is 0. A similar issue could potentially happen for TableRepairState.csv as well if we don't have any tables to repair.

Persist on-demand repairs

On-demand repairs should persist after they are completed.
They should also persist between restarts (incl. crash) and in that case automatically continue.
Status on completed/failed repairs should be available in the REST API.

Add ability to read repair history for cluster wide repair statistics

This can be done by exposing an endpoint over rest, updating ecc-status or creating a separate tool.

With the repair history we could provide accurate visual representation on when repair were run to give an overview. It would allow the user to see if repair is keeping up with the churn of data and give a better view of the current status of repairs.

API packages

To make it more clear what is considered the public API it would be good to have that as part of the package. In some cases, i.e. the connection module the whole module is considered to be public API. In others like the core module only specific classes like ScheduledJob are considered public API.

Support multiple Cassandra version

Currently ecChronos is verified to work with Apache Cassandra 3.0.x versions. It it known to not work with 2.2.x due to this line.

Intended goal of this issue is to:

  • Verify behavior with 3.x and 4.0 versions.
  • Identify the changes needed to support multiple versions.
  • Add separate builds if necessary to run tests towards different versions.

Logback config needs update

In the standalone packaging we currently log DEUBG output to a file and rotate it with a max history of 30 days.
It should be split into two log files, one for DEBUG and one for INFO. The history should also be size based with a reasonable size.

Use yaml format for configuration

The current property based approach is not easy to deal with as every new property requires a lot of code and test additions.

A yaml based approach should be easier to handle with a good yaml parser like jackson.

Faulty ReplicaRepairGroup created initially

Example output below:

10:36:30.114 [pool-4-thread-1] INFO  c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table test.test2 is new, next repair 2019-04-24 10:37:30
10:36:30.132 [pool-4-thread-1] DEBUG c.e.b.c.e.c.r.state.RepairStateImpl - Table test.test2 switched to repair state (canRepair=true,lastRepaired=1556094990114,replicaRepairGroup=[(replicas=[/127.0.0.4:9042, /127.0.0.1:9042, /127.0.0.2:9042],vnodes=[(5865110601664802360,5872668701990616434], (1707638222063213603,1713564121870665705], (-1748418626064078896,-1635185970846513576], (7019030916469767600,7033546010821571589], (-1516078844314334718,-1458943644367007333], (-323352036129999922,-314690946108749909], (-3596175493781184566,-3587793484282907969], (-8762582900180887739,-8759828443919510431], (6949902150454688705,6969904547592418668], (-411505147072181762,-410842971601908983], (-2548111725870617098,-2538426633468712412], (2163602500729925663,2166161838665488127], (-3720649383903394554,-3713741836932883778], (2163503248952254881,2163602500729925663], (5854557320086906102,5865110601664802360], (-5177921320133811005,-5174587106404923125], (-1264979646568308238,-1237321012003848906], (-5201786378686468947,-5177921320133811005], (1688501330680648338,1707638222063213603], (-6275075007739433698,-6268302824669324579], (5872668701990616434,5901160086725571488], (-6291174479881429322,-6290734303635263919], (-6211176598209419705,-6179582759489884647], (904626422747834300,909141497291312508], (3294162739508016805,3331624608553698780], (5926229946233930553,5931509687880992881], (2166161838665488127,2269340123807757678], (8509066610140723557,8520479384547097272], (8621046863113556216,8626931040711855375], (-8306736635692113851,-8306482243595399606], (-429201408411695824,-411505147072181762], (-6489508266518005130,-6479912609303740432], (6992017166272278355,7004102467640863079], (6189848803672173013,6204802985729823188], (-1569222170998684811,-1542749510179766429], (5137826813072792418,5151253925285384160], (-2610229333535183387,-2605003058823314073], (4146280479472649988,4183778838307753156], (-2538426633468712412,-2510394669336755057], (5293076878224373907,5294597217344761909], (2077470801583532320,2100465152749702283], (-4930568659988294766,-4909146613011801829], (2269340123807757678,2272486200862048847], (7224225483693544489,7229126180239274387], (8353471612924879580,8355997482308738932], (912128156167090743,933586841925457475], (-4549738343394257465,-4537280039133286622], (5151253925285384160,5246903589245804881], (-4909146613011801829,-4885613665582363312], (-3530332683374685047,-3445348267714842536], (423636490468909441,437734320414290565], (7762460632173064601,7779955422194824080], (-6490023601277932489,-6489508266518005130], (-6766741423716583331,-6750959180340598188], (-6214956105623688363,-6211176598209419705], (5246903589245804881,5254760090923284648], (7730207248544796606,7762460632173064601], (1097721320254344991,1113575929401780169], (398870656012263016,423636490468909441], (7779955422194824080,7890215595895495306], (395929270276418217,398870656012263016], (1525799912377572417,1542266724634658850], (-3418133689437057648,-3412762482864301645], (-5228764097479169205,-5220807474927517370], (-8877726394992241173,-8877641435029451551], (-8694492312556398435,-8677992421294978833], (1131291184902649803,1146400534703038307], (8343397443059015388,8353471612924879580], (5280298757768807625,5282458503178761207], (5272308250160930201,5276524579298445036], (6944188551164548602,6949902150454688705], (617219443249087917,650662126875556431], (-7641848610421243617,-7637641514133584757], (1146400534703038307,1146866732393723016], (8604492338144654001,8621046863113556216], (-5256874662171154338,-5252583152179811091], (599509332826805038,617219443249087917], (1225330282445579115,1228706603710723881], (9053009751296138984,9132635580595382717], (-3445348267714842536,-3418133689437057648], (6111008095610196791,6123567665163451190], (-3698077144093230565,-3685429585022215121], (5268852051162723541,5272308250160930201], (-7879262946045901197,-7874704495847950937], (-432576444157372309,-429201408411695824], (2550850111059835717,2593392468936236393], (3645283065254949673,3677458290328783005], (-1751775327470895498,-1748418626064078896], (-3727081816655181962,-3723362598727264421], (-4583802763964096024,-4581729154367222009], (2528185767878440158,2550850111059835717], (194420953442488830,217271557314014778], (3853944801781605097,3870407445372540568], (-5940365322928062919,-5929099427194152106], (1033273149652736222,1037605910497831797], (-3348574762510418048,-3338621913080405410], (7004102467640863079,7019030916469767600], (-7680841623914968864,-7641848610421243617], (-5240233271783617793,-5228764097479169205], (5470884543153865248,5518067652677054212], (3485759404712264832,3575431245922176983], (3575431245922176983,3585822675029499260], (-1611895247728016610,-1605813028235077768], (-364365161291321674,-345317700026830156], (-8759828443919510431,-8758150526595839254], (-2085489634079961149,-2079737485420203671], (-410842971601908983,-398255204442847598], (7186292960893205053,7207735552416671341], (-5561781046051128867,-5543188096295948636], (5276692577534975197,5280298757768807625], (-3723362598727264421,-3720649383903394554], (2075816711290817847,2077470801583532320], (-1577905002572988726,-1569222170998684811], (-7697316165365897035,-7688555265872611303], (894289529623906122,904626422747834300], (5518067652677054212,5532300777250430563], (-6465134643226675171,-6337014050776494940], (4693764855567020908,4698960547274402590], (-6290734303635263919,-6275075007739433698], (6618575016333657870,6621522116097773286], (-6174998002076058104,-6162740120560363776], (3677458290328783005,3678024011679207416], (3585822675029499260,3596886025586214375], (-3356173278125830420,-3348574762510418048], (-1605813028235077768,-1598014663671333309], (55857686471190552,175687164208689053], (3474764612062391705,3485759404712264832], (6430691014443470662,6434031125711022087], (5562874535831802135,5568388867263304803], (-1598014663671333309,-1577905002572988726], (7207735552416671341,7215711263652568967], (-5577282857635277323,-5561781046051128867], (3752365928748437564,3853944801781605097], (-6268302824669324579,-6214956105623688363], (-974083684084645752,-954497641070828269], (6165093866950915396,6177127156832058350], (-9043004961018730579,-9028880999266673671], (591701533044512705,599509332826805038], (-2322693562396586793,-2317931223914439040], (5276524579298445036,5276692577534975197], (6189786695072947405,6189848803672173013], (4698960547274402590,4723577048387550300], (8337626372352882259,8343004019854518611], (-2317931223914439040,-2292590827822032184], (-5301962115623367714,-5285965254221540793], (-3675178267152060423,-3622787130165106527], (-1542749510179766429,-1522301528810445928], (-6179582759489884647,-6177332647764392024], (5254760090923284648,5268852051162723541], (-5174587106404923125,-5163855802100783610], (2493752807164890732,2528185767878440158], (-3587793484282907969,-3570607975811585929], (-2556181655832929341,-2548111725870617098], (2593392468936236393,2642098029248458616], (-981396908144098956,-974083684084645752], (-5994846117350980881,-5989552782185817903], (-6479912609303740432,-6465134643226675171], (5924806116876622972,5926229946233930553], (5549405677219893887,5562874535831802135], (3721922645791000708,3752365928748437564], (7890215595895495306,7903791430332987822], (4915729485604519158,4949248785384939975], (4183778838307753156,4195989899753122385], (5532300777250430563,5532345109051942077], (-4537280039133286622,-4534970476101949022], (-9028880999266673671,-8992700479635465667], (5446319390890546385,5457355108977371736], (-6316550980332592407,-6291174479881429322], (-5929099427194152106,-5921879880539631595], (6969904547592418668,6992017166272278355], (7052227462767726078,7086799441920613290], (-3536205874002446550,-3535944234349575683], (6163654697774163929,6165093866950915396], (1214282855696741607,1225330282445579115], (-3713741836932883778,-3698687427349058181], (4615265643420033498,4625433685915714086], (-7874704495847950937,-7836387469841337385], (3600286378600912414,3602558203419136453], (-8698000266913213056,-8694492312556398435], (-3622787130165106527,-3596175493781184566], (1113575929401780169,1127818832426368072], (-8318751880220712930,-8306736635692113851], (5901160086725571488,5924806116876622972], (1037605910497831797,1097721320254344991], (2464716401439152164,2480459946168668893], (-5204763410890895560,-5201786378686468947], (352267798763708589,395929270276418217], (-3570607975811585929,-3554671213151714582], (-4931780317380069194,-4930568659988294766], (-6337014050776494940,-6335054186285864832], (-7836387469841337385,-7834912549661522516], (-2336775960286457978,-2322693562396586793], (-4421463747262855904,-4418145671382827454], (8343004019854518611,8343397443059015388], (-3554671213151714582,-3536205874002446550], (-7732007273701699621,-7697316165365897035], (-5220807474927517370,-5204763410890895560], (5457355108977371736,5470625983095903154], (7215711263652568967,7224225483693544489], (-7688555265872611303,-7680841623914968864], (-345317700026830156,-344780936523856021], (449849081497196706,450696698287780033], (-2605003058823314073,-2604034762611144388], (-1635185970846513576,-1611895247728016610], (909141497291312508,912128156167090743], (6590751478213731121,6618575016333657870], (-4581729154367222009,-4549738343394257465], (933586841925457475,1033273149652736222], (-8464282757426723735,-8459016353660004912], (-5285965254221540793,-5256874662171154338], (1507648825300314184,1525799912377572417], (7495192102529320322,7498863195942861640], (-3698687427349058181,-3698077144093230565], (-3535944234349575683,-3530332683374685047], (-5989552782185817903,-5940365322928062919], (-6335054186285864832,-6316550980332592407], (-954497641070828269,-947764842554960180], (1146866732393723016,1214282855696741607], (-6177332647764392024,-6174998002076058104], (-3685429585022215121,-3675178267152060423], (-8327060983529623988,-8318751880220712930], (-5003243368467649375,-4931780317380069194], (-3412762482864301645,-3356173278125830420], (-8336012076377779770,-8327060983529623988], (-4063374191739390378,-4049543938908731350], (4333327706380638473,4377205407491323039], (1127818832426368072,1131291184902649803], (5470625983095903154,5470884543153865248], (6177127156832058350,6189786695072947405], (-1522301528810445928,-1516078844314334718], (3596886025586214375,3600286378600912414], (5768079726638834108,5788851882398406371], (-8124603138985911160,-8115552572523121883], (-4885613665582363312,-4860308298438233166], (2480459946168668893,2493752807164890732], (4873909594335701588,4915729485604519158], (-4084774479874245955,-4063374191739390378], (437734320414290565,449849081497196706], (4949248785384939975,4958144072368701209], (7033546010821571589,7052227462767726078]]), (replicas=[/127.0.0.1:9042, /127.0.0.3:9042, /127.0.0.2:9042],vnodes=[(-6085079478493063748,-6064086376772932949], (7323874428765842391,7358834636086600926], (8747480903393112490,8782208445422058674], (-176111701355425003,-169760120569591771], (9145392594660429319,9150763428177360310], (7569060659075402759,7592824695776964355], (3936113910939518155,4005213648320906733], (-7601996801790670654,-7576792551450546071], (-2139307512592331431,-2124034205294471676], (8421666339138659133,8439195032986552636], (4291828310922574131,4300702368990194247], (-8447582647674991288,-8426642012422690733], (4300702368990194247,4305639271493167825], (-8598564670286207949,-8544939724431758413], (-6970937630106549900,-6955530490938078420], (7565366154738375602,7569060659075402759], (-8603830489496245345,-8598564670286207949], (-2216958276566737349,-2209672631108973107], (8819045943949213038,8867743884527926764], (-8426642012422690733,-8424184001168593343], (-6009495873341475961,-5997272053738268139], (-4467343772903903461,-4461074660245746669], (1597664708988256769,1614722586896137799], (-6771446901661685303,-6766741423716583331], (6330463775396557171,6331823557027939481], (-209111744291119143,-176111701355425003], (-8336866429175853518,-8336012076377779770], (1671497933049700904,1688501330680648338], (8663152070628535017,8679535761224396473], (2100465152749702283,2113024155883041601], (1617837092953979212,1618825263616910351], (-9125940794246425887,-9111488804514277692], (-9146823683067732948,-9144820982928368952], (6842175710329267203,6845049554775927098], (-3947983447321572542,-3905022741177590404], (-3976346251680084856,-3947983447321572542], (8998171168660646190,9003896970053295890], (-7045758213515074082,-7043271228260149001], (-8724721081470977419,-8716001418909823795], (6941982057708876183,6944188551164548602], (-3078652673540316606,-3061041098277641318], (-6162740120560363776,-6149099949787153176], (-4282090163051883517,-4264849592369943372], (1228706603710723881,1229829227897079901], (-4352633219073212574,-4320097879524891096], (-5131320385707577589,-5081068369891447593], (-1938709368664353801,-1931185718042446687], (-4603361068021482497,-4593421659205850810], (4625433685915714086,4632469554352491391], (9003896970053295890,9008780243899802420], (-4479988857817902050,-4467343772903903461], (-1010141748756392499,-1005140044741574936], (-5062114138935044786,-5023342262039022830], (3366508608717533537,3385087227595902653], (-4591817681685041184,-4583802763964096024], (6123567665163451190,6128655848084060787], (6621522116097773286,6631972872192633892], (9021768472554452831,9031303309867349265], (-8125090791303469289,-8124603138985911160], (-5018840483591250455,-5013268481472046447], (-1455637857949792482,-1392704525356113323], (9008780243899802420,9021768472554452831], (8873385172924959020,8924369100024571785], (-7043271228260149001,-7037842878478131680], (-1379680765407185741,-1377216344153416241], (-1324163084079963442,-1315493448982092640], (3926139287145951568,3936113910939518155], (7632179137742305810,7663187432855754872], (1406477802687866615,1406687990467387376], (-7029803186678738280,-6987141877540914971], (7143116601883080812,7148745436923418955], (-7457507363873839154,-7389646866273333626], (-4264849592369943372,-4228242017112795044], (-7306014358625409521,-7298490723013439731], (-8539157977133443971,-8529944247620436679], (-5325523806542915314,-5301962115623367714], (-8512752697594657271,-8464282757426723735], (9155318056920308985,9200549186172059323], (5094761537064668308,5116823998192924159], (-6551023572387362104,-6498957312507669021], (-6987141877540914971,-6980697792462610337], (-5023342262039022830,-5018840483591250455], (-943859730138465885,-926953834104208255], (180076413392258018,180170472863302283], (-2193754557006218959,-2193147540903996267], (-1963060129846663904,-1938709368664353801], (-4461074660245746669,-4435785125182221653], (1568187754177264301,1597664708988256769], (7545403286942224128,7565366154738375602], (6332509013925774151,6333123985695867402], (1371644617543334836,1406477802687866615], (6291594931548181052,6330463775396557171], (6333123985695867402,6409610248910455893], (-215991459561082993,-209111744291119143], (1565430038942130049,1568187754177264301], (7663187432855754872,7663618827608961141], (-2742227664864922100,-2698966707765996850], (-1010581946183551305,-1010141748756392499], (6266957361669943728,6291594931548181052], (5116823998192924159,5137826813072792418], (-5244945931128744673,-5242744687555085505], (-398255204442847598,-394578796135138146], (2822674138412598871,2883132575029009350], (7592824695776964355,7632179137742305810], (-5543188096295948636,-5543115361842095614], (5049504929581508706,5094761537064668308], (7903791430332987822,7935377308528532614], (6835339102813802894,6842175710329267203], (180170472863302283,182117263077597323], (-8624532491157173974,-8606073937404571436], (3426893667885521180,3467784646367129552], (-7612335745365489824,-7601996801790670654], (4262358669709550929,4288437054735113792], (3407682730579875853,3411923394679121674], (-8635321429158920378,-8624532491157173974], (8439195032986552636,8507644851398813676], (4195989899753122385,4262358669709550929], (2331229335263849761,2335555860630581823], (9031303309867349265,9053009751296138984], (6849530990070116870,6893188040095947079], (7240452458364991302,7243055387142598474], (-1458943644367007333,-1455637857949792482], (1628585380588143280,1667957094107881339], (6893188040095947079,6911783197682337119], (-4421532926824512614,-4421463747262855904], (-3995788751478011102,-3976346251680084856], (1879232046276361682,1923380441637559262], (-4004711672964930178,-3995788751478011102], (-1392704525356113323,-1379680765407185741], (9150763428177360310,9155318056920308985], (-9131148688457292041,-9125940794246425887], (-5347049098193336450,-5325523806542915314], (-8529944247620436679,-8529621456292135747], (-385513118966807083,-379818273450251275], (3678024011679207416,3694506724843411052], (-2193147540903996267,-2139307512592331431], (-8529621456292135747,-8512752697594657271], (4755968582471395536,4757648850966479028], (-7262785772456384769,-7229015904537660797], (-7037842878478131680,-7029803186678738280], (4982761366001356396,5005890829459468615], (-4228242017112795044,-4211107674876986349], (-7295973535513917109,-7273180598763621919], (1542266724634658850,1565430038942130049], (-5081068369891447593,-5062114138935044786], (1923380441637559262,1937721953694846923], (3413750590217750494,3426893667885521180], (8954706127887322095,8998171168660646190], (-7142529860060505342,-7132471250423494779], (2745130875852646599,2783130181211648220], (3411923394679121674,3413750590217750494], (-7166238756596614356,-7142529860060505342], (657163451114422339,699230638988207512], (8626931040711855375,8663152070628535017], (2737124566175867171,2745130875852646599], (6631972872192633892,6667332082343368647], (-8544939724431758413,-8539157977133443971], (3694506724843411052,3704614391287501636], (7127493621181351327,7143116601883080812], (7229126180239274387,7231458959087123218], (4305639271493167825,4331018248413823717], (4381378670981013950,4396097878371041260], (8692588889034362378,8696225449101721190], (-3049205761861887758,-2999689242488322342], (-2604034762611144388,-2603457227686304697], (-4320097879524891096,-4282090163051883517], (8867743884527926764,8873385172924959020], (3337654787467425507,3365408741878601749], (182117263077597323,194420953442488830], (2792899662231065291,2815923473212160116], (-7273180598763621919,-7262785772456384769], (7262199637358577075,7268024507082671891], (5282458503178761207,5287399068785747095], (2883132575029009350,2891054973889704525], (-4609802893058310591,-4609109637270475929], (-7046951259520300422,-7045758213515074082], (6331823557027939481,6332509013925774151], (-6498957312507669021,-6490023601277932489], (7935377308528532614,7942584713614810868], (-6948231922344728495,-6892825980726903852], (4288437054735113792,4291828310922574131], (4962407184790712377,4975551510740206839], (2815923473212160116,2822674138412598871], (-7132471250423494779,-7046951259520300422], (-226353260410211219,-221517837349377584], (-2781905190148377665,-2781196421955373197], (2737008108506815408,2737124566175867171], (8532715472827672542,8537430521544094642], (1667957094107881339,1671497933049700904], (5035109903748642920,5049504929581508706], (8336247149868461448,8337626372352882259], (-4006062565185813321,-4004711672964930178], (-7474466951776339105,-7457507363873839154], (-4211107674876986349,-4130628848800772014], (-2124034205294471676,-2113656441134307440], (-1997854129988023928,-1963060129846663904], (-312441533525873368,-226353260410211219], (8935114439710069064,8954706127887322095], (3385087227595902653,3407682730579875853], (-8758150526595839254,-8724721081470977419], (4975551510740206839,4982761366001356396], (-4593421659205850810,-4591817681685041184], (6845049554775927098,6849530990070116870], (4396097878371041260,4421003088420840985], (-6039801029413117127,-6015135119263493000], (6911783197682337119,6932110804403189827], (-5242744687555085505,-5241049529562487898], (5746844912511956054,5768079726638834108], (6696937289141285464,6723127974976259825], (-6097339166480013850,-6093138274745816947], (6932110804403189827,6941982057708876183], (4723577048387550300,4755968582471395536], (4331018248413823717,4333327706380638473], (7243055387142598474,7243932460218177663], (-2781196421955373197,-2743478415635922542], (-5997272053738268139,-5994846117350980881], (5005890829459468615,5025219494516970029], (-6056809593268285808,-6039801029413117127], (8537430521544094642,8598852718013803893], (-4435785125182221653,-4421532926824512614], (-1031499470180538512,-1010581946183551305], (-5252583152179811091,-5244945931128744673], (6409610248910455893,6428586419865279467], (8726456351986111146,8747480903393112490], (1614722586896137799,1617837092953979212], (-394578796135138146,-385513118966807083], (-2743478415635922542,-2742227664864922100], (4872505285400466299,4873909594335701588], (6204802985729823188,6224094934199833854], (-2209672631108973107,-2193754557006218959], (-6093138274745816947,-6085079478493063748], (8696225449101721190,8726456351986111146], (1792758068080755003,1879232046276361682], (-7298490723013439731,-7295973535513917109], (2113024155883041601,2152158206756067479], (7180592675795233944,7186292960893205053], (483718895129583087,494722471047124582], (3331624608553698780,3337654787467425507], (7243932460218177663,7262199637358577075], (8598852718013803893,8600753507102008275], (-6980697792462610337,-6970937630106549900], (8679535761224396473,8689845424546294624], (-8406941645189187721,-8336866429175853518], (-1005140044741574936,-1000008012173794466], (-8606073937404571436,-8603830489496245345], (7231458959087123218,7240452458364991302], (-6955530490938078420,-6948231922344728495], (-4609109637270475929,-4603361068021482497], (-3061041098277641318,-3049205761861887758], (6128655848084060787,6145049633715407522], (1618825263616910351,1628585380588143280], (4632469554352491391,4640582204614908410], (650662126875556431,657163451114422339], (-6892825980726903852,-6810790644528379216], (2783130181211648220,2792899662231065291], (-3218267175574991965,-3078652673540316606], (-8424184001168593343,-8406941645189187721], (-7229015904537660797,-7166238756596614356], (-5378405653617552316,-5363322127376271082], (3910448106136124717,3926139287145951568], (-2842269572001428808,-2781905190148377665], (3365408741878601749,3366508608717533537], (-8716001418909823795,-8698000266913213056], (-9144820982928368952,-9131148688457292041], (8508905548497491571,8509066610140723557], (7148745436923418955,7180592675795233944], (8507644851398813676,8508905548497491571], (-221517837349377584,-215991459561082993], (8924369100024571785,8935114439710069064], (-5893259517463730069,-5887911526306930666], (-5363322127376271082,-5347049098193336450], (494722471047124582,501725220371970741], (-438656553040500476,-432576444157372309], (-8055389242851379289,-8050062324743967649], (7663618827608961141,7683673725814486073], (-1315493448982092640,-1264979646568308238], (-6810790644528379216,-6791336121455681497], (8689845424546294624,8692588889034362378], (-6064086376772932949,-6056809593268285808], (6821987745330247523,6835339102813802894], (2891054973889704525,2933613951988115867], (-5921879880539631595,-5893259517463730069], (3704614391287501636,3706166178114075110], (8782208445422058674,8819045943949213038], (5287399068785747095,5291055976386199291], (-2603457227686304697,-2602490583877172941], (-947764842554960180,-945857009714646027], (6224094934199833854,6266957361669943728], (-5159048046179125183,-5131320385707577589], (4640582204614908410,4644785017456752545], (2933613951988115867,2936432957359492123], (478378941901948917,483718895129583087], (-314690946108749909,-312441533525873368], (-8131686880192759082,-8125090791303469289], (-6015135119263493000,-6009495873341475961], (52305926231006582,55857686471190552], (-945857009714646027,-943859730138465885], (-7576792551450546071,-7563497441193122994], (5025219494516970029,5035109903748642920]]), (replicas=[/127.0.0.1:9042, /127.0.0.3:9042, /127.0.0.4:9042],vnodes=[(7703644075782778864,7712696879772336064], (8600753507102008275,8604492338144654001], (-8891120499589539919,-8877726394992241173], (5327253174164511172,5349688408152187779], (-743915602802026964,-695501524240740609], (-4418145671382827454,-4417209963629780346], (811493037936306242,838114257707170419], (1778789248487877986,1785523483131416082], (-7981650950861547541,-7974031291962188145], (-5769150864113570950,-5764769525657463596], (-379818273450251275,-364365161291321674], (-3262687312685671724,-3260100044773516992], (4452829111707603271,4456649555696632792], (4456649555696632792,4472671370174786162], (6667332082343368647,6673603829776876082], (7539025352140692780,7542873976949229957], (4472671370174786162,4476375217671846091], (-5793126567009200631,-5769150864113570950], (4421003088420840985,4422711856039661790], (-4534970476101949022,-4511706687531293796], (-5811516027968492509,-5793126567009200631], (-441308398020558005,-438656553040500476], (6153918562402789886,6163654697774163929], (-7903112515119447925,-7902156457259253187], (-4511706687531293796,-4501815084243080961], (-7353601039635545101,-7331131303674436732], (6673603829776876082,6694564977798542436], (-169760120569591771,-138134739059630745], (-686029428662602715,-648460789833124834], (-4696686542483557733,-4683614105962249082], (-8006087791596325977,-7981650950861547541], (4520015957392043219,4594519386865403509], (2305569917727005648,2321123912078476434], (-338587694711880264,-323352036129999922], (450696698287780033,477666097902629438], (-1343964826172426532,-1343250488378321706], (-8278520908611436295,-8270562038285440338], (-8918648602644904395,-8891120499589539919], (-545416937814284702,-513887612600276443], (744550750487592120,759799652392053883], (-2079737485420203671,-2073727144662490980], (9200549186172059323,9203901563962430280], (-2054532383838136087,-2051654332207897653], (175687164208689053,178450227193699837], (-7937238240168352056,-7927332630108798125], (8176339689490245639,8187297856103723510], (-828388405255968767,-824627313280778391], (-1894174198166601531,-1881395826445156725], (7712696879772336064,7718678366215062010], (-9075599732774102487,-9055902421430308862], (-1927477608795972921,-1915308859386544553], (-695501524240740609,-686029428662602715], (-478338263667205877,-465854095811319373], (-7389646866273333626,-7353601039635545101], (3624522838844352561,3645283065254949673], (804360716116598750,811311278724568904], (1301026617204275214,1330309943177182822], (-6619709420432773342,-6594901151671861059], (3602558203419136453,3616483922783895571], (7525280378803624684,7537914949567333589], (551087901469104814,577261665361371524], (-8677992421294978833,-8641670289656597103], (-7927332630108798125,-7926170863375736590], (811311278724568904,811493037936306242], (-612190406272543890,-589995666261167981], (-8453859899564123810,-8452178421274760207], (-8452178421274760207,-8447582647674991288], (536608914310835172,551087901469104814], (5574081941217454402,5656765982167526395], (-5764769525657463596,-5753511555644635837], (4958144072368701209,4959368480587379685], (265496722633474953,279850159027127023], (4449736174217920655,4452829111707603271], (-9111200051009241721,-9102172398924204155], (-4651097678281732810,-4624207684903946181], (5349688408152187779,5375792675427774657], (-8176874699620897719,-8163194612470817128], (-78042297262686438,-25741870359790004], (-7974031291962188145,-7937238240168352056], (3878416626240567256,3882906062717662509], (-4367957394803852508,-4356657427961455361], (-1000008012173794466,-981396908144098956], (-6638018660008035578,-6622631041486007651], (5532345109051942077,5532657207249437180], (1229829227897079901,1267972429814396909], (7718678366215062010,7730207248544796606], (5743748943652236857,5746844912511956054], (-4024507808596123620,-4006062565185813321], (-1931185718042446687,-1927477608795972921], (-1874680315975530350,-1869396875748430305], (279850159027127023,351600788772328829], (4959368480587379685,4962407184790712377], (-4049543938908731350,-4024507808596123620], (7293028483534359691,7307080044161508013], (-8163194612470817128,-8161949808914120202], (8187297856103723510,8202117223008600915], (2321123912078476434,2331229335263849761], (-25741870359790004,37391337322169607], (3882906062717662509,3910448106136124717], (8202117223008600915,8203392765442852489], (-5887911526306930666,-5878786481540633130], (-810626947782988593,-810463674058589911], (-587140372674584176,-545416937814284702], (37391337322169607,37585399164016158], (9213702739355436020,9221339561787446436], (767524456110932003,802448816165765704], (8001917955950206476,8083054838322377993], (-138134739059630745,-126188366545828940], (-8115552572523121883,-8087379021231202613], (8000690451681265940,8001917955950206476], (5656765982167526395,5739623051348905722], (-2569640731642961824,-2556181655832929341], (-3278988046142262526,-3262687312685671724], (351600788772328829,352267798763708589], (-1850346434989347856,-1804404047198168355], (-5866999542773939801,-5829411324960077782], (-1763084380735213841,-1751775327470895498], (-1869396875748430305,-1855882454000892615], (-589995666261167981,-587140372674584176], (-1881395826445156725,-1874680315975530350], (-926953834104208255,-925742045144146197], (-642490621565945985,-612190406272543890], (4005213648320906733,4024641047681152714], (-8927315660663505688,-8918648602644904395], (-8050062324743967649,-8045046882076598680], (-3260100044773516992,-3239159723904851739], (217271557314014778,265496722633474953], (-9090059604757023662,-9075599732774102487], (7683673725814486073,7703644075782778864], (-7906217694864170236,-7905279975365392755], (-3287476619175450337,-3278988046142262526], (5739623051348905722,5742271562156289998], (759799652392053883,767524456110932003], (-8641670289656597103,-8635321429158920378], (-5878786481540633130,-5866999542773939801], (-3338621913080405410,-3337927698010214471], (-465854095811319373,-441308398020558005], (-2617898614825145021,-2613318404002197552], (-2586310675615823660,-2569640731642961824], (-9111488804514277692,-9111200051009241721], (802448816165765704,804360716116598750], (-8973307681754649790,-8927315660663505688], (9221339561787446436,-9146823683067732948], (-8979891822333170035,-8973307681754649790], (-126188366545828940,-78042297262686438], (37585399164016158,38937140242194447], (178450227193699837,180076413392258018], (-8270562038285440338,-8268281594008954042], (1755772011051335753,1778789248487877986], (-774353422270158415,-743915602802026964], (-9044362253486487732,-9043004961018730579], (4024641047681152714,4039943296310483885], (1330309943177182822,1371644617543334836], (-9055902421430308862,-9044362253486487732], (-2073727144662490980,-2054532383838136087], (5742271562156289998,5743748943652236857], (-8233560447132820524,-8213620055354472793], (-2620256037442726132,-2617898614825145021], (1785523483131416082,1792758068080755003], (4594519386865403509,4615265643420033498], (882448096709896781,894289529623906122], (-3327028201895063724,-3301182765143616815], (-7905279975365392755,-7903112515119447925], (-513887612600276443,-478338263667205877], (-4501815084243080961,-4479988857817902050], (-1855882454000892615,-1850346434989347856], (-7902156457259253187,-7879262946045901197], (-8985523619544057307,-8979891822333170035], (-824627313280778391,-810626947782988593], (-3301182765143616815,-3287476619175450337], (8216081785310910497,8299590521817923717], (7307080044161508013,7323874428765842391], (-4714889179292825652,-4696686542483557733], (6694564977798542436,6696937289141285464], (1267972429814396909,1301026617204275214], (501725220371970741,536608914310835172], (-6594901151671861059,-6551023572387362104], (-1343250488378321706,-1324163084079963442], (-4683614105962249082,-4651097678281732810], (-808838586328986394,-774353422270158415], (7504250122685595240,7525280378803624684], (-9102172398924204155,-9090059604757023662], (4422711856039661790,4432943244673542552], (7542873976949229957,7545403286942224128], (-8213620055354472793,-8190536656378443467], (-1374661530207959904,-1359632809749919920], (6428586419865279467,6430691014443470662], (-1377216344153416241,-1374661530207959904], (-1359632809749919920,-1343964826172426532], (7537914949567333589,7539025352140692780], (4476375217671846091,4520015957392043219], (-648460789833124834,-642490621565945985], (-3337927698010214471,-3327028201895063724], (2272486200862048847,2305569917727005648], (-6783595270416147594,-6771446901661685303], (-8045046882076598680,-8015570680950170413], (-8190536656378443467,-8176874699620897719], (-8268281594008954042,-8233560447132820524], (1713564121870665705,1726037723169685355], (-1237321012003848906,-1227384327962189184], (-4417209963629780346,-4397161038179248593], (-836886172037637404,-828388405255968767], (4435843140798776592,4449736174217920655], (-6622631041486007651,-6619709420432773342], (9203901563962430280,9213702739355436020], (3619679439769329559,3624522838844352561], (4432943244673542552,4435060965900542357], (5294597217344761909,5327253174164511172], (7289281844331830535,7293028483534359691], (-2613318404002197552,-2610229333535183387], (-5743203162634756559,-5713321773448208456], (-8015570680950170413,-8006087791596325977], (-4723152534836858034,-4714889179292825652], (477666097902629438,478378941901948917], (-5753511555644635837,-5743203162634756559], (7979783289869702676,8000246103949993020], (4435060965900542357,4435843140798776592], (5568388867263304803,5574081941217454402], (-5829411324960077782,-5811516027968492509], (838114257707170419,865514027366834513], (-1915308859386544553,-1894174198166601531], (1726037723169685355,1755772011051335753], (-4624207684903946181,-4609802893058310591], (3616483922783895571,3619679439769329559], (-4397161038179248593,-4367957394803852508], (7498863195942861640,7502930717026834559], (865514027366834513,882448096709896781], (-810463674058589911,-808838586328986394], (8203392765442852489,8216081785310910497], (8000246103949993020,8000690451681265940], (-4740225573451678732,-4723152534836858034], (-8161949808914120202,-8131686880192759082], (-1804404047198168355,-1763084380735213841], (577261665361371524,591701533044512705], (7502930717026834559,7504250122685595240]])])

The suspect is likely this line as we use vnodeRepairStates rather than updatedVnodeRepairStates.

Slow query of repair_history at start-up

ecChronos queries system_distributed.repair_history to be able to schedule new repairs at starts-up. If the history table is "large" and contains a lot of tombstones a warning is printed in the cassandra logs:
WARN o.a.c.d.ReadCommand$1MetricRecording:569 onClose Read 5000 live rows and 6144 tombstone cells for query SELECT * FROM system_distributed.repair_history WHERE keyspace_name, columnfamily_name = ,

AND id <= 8be5a6ef-0790-11ea-7f7f-7f7f7f7f7f7f LIMIT 5000 (see tombstone_warn_threshold)

Also this query can be very slow since it iterates through all data with using the prepared statement:
"SELECT id, range_begin, range_end, status, participants FROM %s.%s WHERE keyspace_name=? AND columnfamily_name=? AND id <= maxTimeuuid(?)"

If the "last repaired at"-date is known (after startup), then the iteration is then limited to that date, using the following prepared statement:
"SELECT id, range_begin, range_end, status, participants FROM %s.%s WHERE keyspace_name=? AND columnfamily_name=? AND id >= minTimeuuid(?) and id <= maxTimeuuid(?)"

We should introduce a configurable limit for how long back in time this iteration should continue - even if "last repaired at" is unknown.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.