Giter Club home page Giter Club logo

aiops2020's Introduction

AIOps 2020

This is our group project for Advanced Network Management 2020 final project.

Repository Guide

This repository is for the final project of The Anomalies

Directory

  • Docs/ Documentation for this project. Includes the project specification as well as our presentation and final report.
  • Legacy All of our previous prototypes and pre-trained models.
  • Processing The notebooks we created for processing data of different KPI resources.
  • Scripts The main and test scripts.

Running Scripts in Tencent Cloud

  • Make sure python script has the following at the beginning

    #!/usr/bin/env python3
  • Give it permission

    chmod +x Consumer.py
  • Run it using:

    nohup python -u ./Consumer.py > [output file name].log &
  • View the output log:

    cat [output file name].log
  • Kill the process:

    ps ax | grep Consumer.py
    kill PID

Problem statement

Root cause an anomaly for a microservice-based software system.

Definitions

  1. Anomaly detection: Considering the time series behavior of the system, label values that exceed an arbitrary threshold as anomalies.

  2. Troubleshooting: refers to the task of finding the root cause of failure and fixing it. It has 3 steps:

    1. Find time t when success rate is much lower than 1
    2. Around that time, check the behavior of microservices and other hosts and containers.
    3. After finding the abnormal source, find which KPIs perform anomalously.
  3. Microservice system

    1. The user send a request (UUID-n).
    2. The Remote Procedure Call (RPC) Framework makes consecutive calls to different micro-services to process user's request.
    3. The web service posts a response (UUID-n)
    MSG Order UUID Sent at Received at MSG (m->n)
    1 UUID-1 1516171819 1516171821 call(start a)
    2 UUID-1 1516171820 1516171821 call(a - b)
    3 UUID-1 1516171821 1516171822 response(b-a)

Data sources

  1. ESB business indicator (ESB)

    Service name Start time Average time num: # requests # success Success rate
    osb_001 1516171819 0.45678 360 360 1.0
    osb_001 1516171819 0.45678 461 461 1.0

    We only have osb_001 so we can neglect this column. The data is recorded every minute.

  2. Trace: a user request (with a unique ID) --- it consists of several microservice calls (AKA span). Each span has a tree structure; therefore, each span has a parent span, except for the root span. There are also two types of span: inside and outside.

    ID Parent ID Trace ID Start time Elapsed time Service name cmdb ID Call type Success ds name
    1 None 1 t1 t2-t1 foo db_008 osb True -
    2 1 2 t3 t4-t3 bar db_008 csf True -
    3 2 3 t5 t6-t5 bar db_009 local False ANM

    callType: has 6 types.

    inside spans: 1. osb 2. remoteprocess 3. flyremote

    outside spans: 4. csf 5. local 6. jdbc

    dsName: named of the database accessed by the microservice, only when local of jdbc (where we can regard accessing databases as the microservice).

  3. Host KPIs data

    Item ID Name Bomc ID Timestamp Value cmdb ID
    1111 CPU_free ZJ02 163249574938 420.69 db_008

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.