Giter Club home page Giter Club logo

servicefabric.watchdog's Introduction

ServiceFabric.Watchdog

ServiceFabric.Watchdog is a little framework that helps you emit custom metrics from your Azure Service Fabric Service and monitor/act on those metrics in a watchdog. An example of what you can do is to expose metrics like AverageResponseTime and RequestsPerSecond, and based on those metrics you can setup rules that scales up or down the number of instances of that service. Interesting? Here are the details!

Emitting metrics

The first thing you have to do is emit custom metrics. In the ServiceFabric.Watchdog solution you will find two projects:

  • ServiceFabric.Metrics, which is a .NET Standard 2.0 library that contains various base classes that implements different types of metrics, like:

    • Value
    • Sum
    • Average
    • High
    • Low

    You derive your own metrics from one of these base classes, and either submit the values yourself as they happen via SetValue/AddValue or you override ProvideValue and call SetValue/AddValue there. This needs to be done for all your metrics that you want to expose. Here is an example from the sample:

    internal class IterationsCountMetric : ValueMetric
    {
        private MetricsBuilderSampleService _service;
        public IterationsCountMetric(MetricsBuilderSampleService service)
            : base("Iterations")
        {
            _service = service;
        }
    
        public override Task ProvideValue()
        {
            SetValue((int)_service.Iterations);
            return Task.CompletedTask;
        }
    }
  • ServiceFabric.Watchdog is a .NET Framework that contains the rest of the implementation. Of interest is the MetricsBuilder class. Create an instance of it, add your metrics and call it's Start method. In the sample we do this in RunASync like this:

    protected override async Task RunAsync(CancellationToken cancellationToken)
    {
        // setup the metrics and start the metrics builder on a 5 second interval
        var metricsBuilder = new MetricsBuilder();
        metricsBuilder.Add(new CPUMetric());
        metricsBuilder.Add(new WorkingSetMetric());
        metricsBuilder.Add(new IterationsCountMetric(this));
        metricsBuilder.Start(this.Partition, new TimeSpan(0, 0, 5));
    
        try
        {
            while (true)
            {
                cancellationToken.ThrowIfCancellationRequested();
    
                ServiceEventSource.Current.ServiceMessage(this.Context, "Working-{0}", ++_iterations);
    
                await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
            }
        }
        catch(Exception)
        {
            metricsBuilder.Stop();
        }
    }

Now, in addition to this we need to specify load metrics when we create the service. You can do this in many ways depending on how you create the service - I do it through the Service Fabric Explorer UI. On the MetricsBuilderSample Service Type, press Create and specify the metrics like this:

Create Service with load metrics

Creating the watchdog

Ok, we emit the metrics we want. Now we want to monitor them and take actions on them as well. For that we use the RuleWatchdog class and add rules to it. The properties in the Rule class are:

  • RuleFilter. Here you can filter on which objects you want to apply the rule on. You can filter on:

    • Application
    • ApplicationType
    • Service
    • ServiceType

    You can filter on exact match, starts with, ends with and contains. You can combine filters with and or or, for example like this: "Application = '*Watchdog' && Service = 'MetricsBuilderSample'". If you leave this empty no service is ruled out.

  • TriggerExpression. This is the expression that uses the metrics emitted. In our sample we have a fairly simple version, but you can build as complex patterns as you like. In the sample we use: "Iterations > 45"

  • AggregateData. This property defines if we should aggregate the metrics across all running instances or not.

  • TriggerAction. This is the action we want to take if the TriggerExpression evaluates to true. This could be anything; right now we support the following actions (but more can be added by deriving from the Action base class):

    • Scaling stateless services up or down.
    • Changing the MoveCost property for a service.
    • Deleting the running service instance.
  • TriggerPeriod. To avoid suddens spikes in the metrics causing undesired actions, you can specify a time interval here. This interval defines the time the TriggerExpression continously triggers before the action is taken.

  • ActionGracePeriod. To stop actions firing rapidly you can assign a time interval that hinders an action to fire again. This interval will then stop this action to fire again before it has expired.

Putting it together, here is the code in WatchdogSampleService that set up the rule and starts the RuleWatchdog instance:

protected override async Task RunAsync(CancellationToken cancellationToken)
{
    // create the watchdog and create our sample rule
    var ruleWatchdog = new RuleWatchdog();
    ruleWatchdog.Rules.Add(new Rule()
    {
        // only applicable for applications with Watchdog in their name
        RuleFilter = new StringExpression("Application == \'*Watchdog*\'"),
        // action will trigger when Iterations are above 45
        TriggerExpression = new IntExpression("Iterations > 45"),
        // We should aggretage data for the instances of the service
        AggregateData = true,
        // The expression need to trigger for 1 minute before action kicks in
        TriggerPeriod = new TimeSpan(0, 1, 0),
        // 2 minutes need to expire before the expression is considered again after an action has been done
        ActionGracePeriod = new TimeSpan(0, 2, 0),
        // we want to scale the service up, 1 at the time until we have the service on all nodes in the cluster
        TriggerAction = new ScaleStatelessServiceUpRuleAction()
        {
            MaxNumInstances = -1,
            MinNumInstances = 1,
            ScaleDeltaNumInstances = 1
        }
    });
    // start the watchdog and check the rules every 10 seconds
    ruleWatchdog.Start(new TimeSpan(0, 0, 10));

    try
    {
        while (true)
        {
            // just to keep the loop running
            cancellationToken.ThrowIfCancellationRequested();
            await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
        }
    }
    catch (Exception)
    {
        ruleWatchdog.Stop();
    }
}

That's it! Test out the sample, implement it in your own solution and improve it by doing a pull request!

servicefabric.watchdog's People

Contributors

krist00fer avatar mpervovskiy avatar peterbryntesson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

servicefabric.watchdog's Issues

Watchdog doesn't use ReliableState to store the state of the rule

Suggestion: When creating the Watchdog class - have an optional parameter that sends in the StateManager. If set - we use the state manager to persist the state of the rules, otherwise we run it as it is today. This will allow a stateful service that runs Watchdog to fail over with persistent rules and current state.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.