Giter Club home page Giter Club logo

meshping's Introduction

meshping

Pings a number of targets at once, collecting their response times in histograms. Deploy at strategic points in your network to detect weak links and gain insight to your network's topology.

Features

  • Graphs show latencies as they are, not aggregated into an average.
  • Runs traceroute to show the hops between your monitoring node and the targets.
  • Uses traced routes to draw a map of your network, rendered as an SVG.
  • Performs Path MTU discovery for each hop along the route, so you can see where MTUs get smaller.
  • Detects and displays routing loops.
  • Shows AS info about the hops along the route.
  • Shows where exactly an outage occurs by coloring nodes in the network map red, even if those aren't targets.
  • Multiple targets can be rendered in a single graph for comparison.
  • Meshping instances can be peered with one another and will then ping the same targets.
  • Scrapeable by Prometheus.
  • Targets can be added and removed on-the-fly, without restarting or reloading anything.
  • IPv6 supported.
  • Docker images: https://hub.docker.com/r/svedrin/meshping

UI

Here's a screenshot of the main Web UI:

web_ui

There's a mobile-friendly version too:

web_ui-mobile

Loop detection looks like this:

web_ui-loop-detected

Here's a view of the traced route, including the Path MTU up to each hop and the AS information:

web_ui-traceroute

Last but not least, here's an example for a network map, also including AS information:

web_ui-netmap

When a target stops responding, nodes in the network map are colored to show where an outage might have occurred (I faked this one for demo purposes by dropping their responses using iptables):

web_ui-netmap

Heatmaps

Meshping can render heatmaps from the pings measured over the last few (by default, three) days. They look like this:

built-in heatmap

You can nicely see that, while most of the pings are between 11 and 16ms, a significant number take around 22ms. This indicates that under some conditions, packets may take a different route to the recipient, or the recipient may just take longer to send a reply.

Here's one we recently used to debug connectivity issues a customer was having in one of our datacenters. One of the firewalls had gone bonkers, occasionally delaying packets. The average ping had gone up to 7ms, which maybe would not have looked all that bad, but the histogram very clearly shows that something's wrong:

datacenter heatmap

This was actually bad enough that RDP sessions would drop and file shares would become unavailable. Being able to clearly see the issue (and also verify the fix!) was invaluable.

Here's the heatmap for a WiFi point-to-point link that spans a few kilometers:

wifiwifi

Most pings are fine, but there does appear to be a fair bit of disturbance - maybe there's a tree in the way.

The time span covered by these can be configured by setting the MESHPING_HISTOGRAM_DAYS environment variable to a value other than 3.

Deploying

Deploying meshping is easiest using docker-compose, with the docker-compose.yaml file from this repo. This will deploy meshping, along with a Watchtower instance that keeps Meshping up-to-date. It can be deployed as-is by adding a Stack through Portainer, or using docker-compose:

mkdir meshping
cd meshping
wget https://raw.githubusercontent.com/Svedrin/meshping/master/examples/docker-compose.yaml
docker compose up --detach

Meshping should now be reachable at http://<your-ip>:9922.

Running on a Raspberry Pi

A Docker image for the Raspberry Pi 4 is also available. To use it, you need to have:

  • Docker version 19.03.9 or newer, and
  • libseccomp version 2.4.2 or newer.

See issue #30 for details and instructions on how you can check if you have them, and provide them if not.

Running on Windows

Running meshping on Windows is easiest using Docker Desktop and the WSL2 backend. Do use this method, you need to have WSL2 and Docker installed. Then run these commands in a PowerShell terminal:

docker volume create meshping-db
docker run -d --name meshping -p 9922:9922 --restart=always --hostname (Get-Item env:\Computername).Value -v meshping-db:/opt/meshping/db svedrin/meshping

This will start MeshPing and configure it to start automatically and show the correct hostname in the UI.

Distributed Meshping

If you have set up multiple Meshping instances, you can have them exchange targets via peering. To do this, set the MESHPING_PEERS env var in each instance to point to each other. That way, they will exchange target lists regularly, and you will be able to retrieve statistics from both sides to see how your links are doing.

Latency Analysis

When doing mathematical analyses on measurements, monitoring tools usually apply calculations based on averages and standard deviations. With latency data however, these methods yield unsatisfactory results.

Let's take another look at this heatmap:

built-in heatmap

You'll see that, while most of the pings are between 11 and 16ms, a significant number take around 22ms. The average as calculated by meshping is 16ms, and the standard deviation is probably somewhere around 2ms.

Now suppose you're trying to formulate an alerting rule based on those numbers. Say you'd want to be alerted whenever ping results differ from the average for more than two standard deviations. This means that data points smaller than 12ms or greater than 20ms would potentially trigger an alert. Since this would probably be a bit noisy, let's assume you'll only alert after a bunch of those arrive over a given time span.

But as you can see from the graph, there's a significant number of pings that just take 22ms, for whatever reason. Since this is a WAN link that we don't have control over, we won't be able to fix it - we just have to take it for what it is. Now how do you express that in terms of averages and standard deviations? The answer is: you can't, because the data does not follow a Normal distribution. Instead, this signal consists of two separate signals (because it seems that these packets can take two different routes, resulting in a small difference in latency), each of which can be described using those terms: one lives at 13±3ms, the other one at 22±2ms. To conduct a meaningful analysis of this data, you can't just approach it as if it consisted of one single signal.

I'd like to start looking for a solution to this. At the moment I'm focusing on getting the graphs to a point that they visualize this, and I'm pretty satisfied with the heatmaps as they are currently. Next, I'll probably look into modality detection and finding ways to extract patterns out of the data that I can then use to draw conclusions from.

Much of this is derived from Theo Schlossnagle's talk about Math in big sytems, go check it out if you want to know more.

Prometheus

Meshping provides a /metrics endpoint that is meant to be scraped by Prometheus. You can run queries on the data for things such as:

  • loss rate in %: rate(meshping_lost{target="$target"}[2m]) / rate(meshping_sent[2m]) * 100
  • quantiles: histogram_quantile(0.95, rate(meshping_pings_bucket{target="$target"}[2m]))
  • averages: rate(meshping_pings_sum{target="10.5.1.2"}[2m]) / rate(meshping_pings_count[2m])

Configuration options

Meshping is configured through environment variables. These exist:

  • MESHPING_TIMEOUT: Ping timeout (default: 5s).
  • MESHPING_PEERS: Comma-separated list of other Meshping instances to peer with (only ip:port, no URLs).
  • MESHPING_HISTOGRAM_DAYS: How many days of data to keep in the histograms (default: 3).

Dev build

Building locally for development is easiest by running the ./run-dev.sh script. It will build the container and start up Meshping.

Known issues

  • If you're running meshping behind nginx, be sure to set proxy_http_version 1.1; or it'll be unbearably slow.
  • Only scores 11/12 in the Joel Test.

Stargazers over time

Stargazers over time

Who do I talk to?

  • First and foremost: Feel free to open an issue in this repository. :)
  • If you'd like to get in touch, you can send me an email.
  • I also regularly hang out at the Linux User Group in Fulda.

meshping's People

Contributors

cglewis avatar dependabot[bot] avatar mfuhrmann avatar svedrin avatar wolph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

meshping's Issues

Add traceroute feature

I'm wondering if we should add route tracing to meshping, so that you can click on a target and see the route that packets take in a similar fashion to what mtr shows, including the pings of the individual hops. Also, buttons to easily add those hops as targets would be nice.

Integration with an IPAM such as NetBox or phpipam, or a generic source plugin API

I'd like to integrate meshping with an IPAM tool such as netbox or phpipam, to automatically fetch targets from those. Also, UniFi comes to mind, which makes clients and devices available via the API.

These projects also seem very interesting:

TBD:

  • What happens to devices/clients that vanish from the list of active clients? Should MeshPing delete them straight away, or should there be some kind of grace period for a device to come back? How does UniFi handle this?
  • Could we integrate with NetBox or phpipam in a way such that graphs can be viewed directly from within them? NetBox's modular architecture probably allows this quite easily; for phpipam we'd probably have to provide patches or something. With UniFi it's not realistically possible.
  • Should we specify an API that makes it easy to write "IP source plugins" or something? Just have a directory of binaries, call them one by one and have them dump their data to stdout in some format we specify. Include examples for things we want to support natively, so that users have something to start from.

Alternative Grafana dashboard with configurable interval

From bitbucket:

Since the 90 day default doesn’t show anything initially it’s a tad hard to debug initially. So here’s a dashboard where you can configure both the total duration and the interval :)

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "6.3.5"
    },
    {
      "type": "panel",
      "id": "heatmap",
      "name": "Heatmap",
      "version": ""
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": null,
  "iteration": 1575568444654,
  "links": [],
  "panels": [
    {
      "cards": {
        "cardPadding": null,
        "cardRound": null
      },
      "color": {
        "cardColor": "#0a50a1",
        "colorScale": "sqrt",
        "colorScheme": "interpolateBlues",
        "exponent": 0.5,
        "mode": "opacity"
      },
      "dataFormat": "tsbuckets",
      "datasource": "${DS_PROMETHEUS}",
      "gridPos": {
        "h": 10,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "heatmap": {},
      "hideZeroBuckets": false,
      "highlightCards": true,
      "id": 2,
      "legend": {
        "show": true
      },
      "links": [],
      "options": {},
      "reverseYBuckets": false,
      "targets": [
        {
          "expr": "increase(meshping_pings_bucket{target=\"$target\"}[$interval])",
          "format": "heatmap",
          "interval": "$interval",
          "intervalFactor": 2,
          "legendFormat": "{{ le }}",
          "refId": "A"
        }
      ],
      "title": "Heat Map",
      "tooltip": {
        "show": true,
        "showHistogram": true
      },
      "tooltipDecimals": null,
      "type": "heatmap",
      "xAxis": {
        "show": true
      },
      "xBucketNumber": null,
      "xBucketSize": null,
      "yAxis": {
        "decimals": null,
        "format": "ms",
        "logBase": 1,
        "max": null,
        "min": null,
        "show": true,
        "splitFactor": null
      },
      "yBucketBound": "auto",
      "yBucketNumber": null,
      "yBucketSize": null
    }
  ],
  "refresh": "5m",
  "schemaVersion": 19,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {},
        "datasource": "${DS_PROMETHEUS}",
        "definition": "",
        "hide": 0,
        "includeAll": false,
        "label": null,
        "multi": false,
        "name": "target",
        "options": [],
        "query": "meshping_min",
        "refresh": 1,
        "regex": "/.*target=\"(.*)\".*/",
        "skipUrlSync": false,
        "sort": 3,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "current": {
          "text": "auto",
          "value": "$__auto_interval_interval"
        },
        "hide": 0,
        "label": null,
        "name": "interval",
        "options": [
          {
            "selected": true,
            "text": "auto",
            "value": "$__auto_interval_interval"
          },
          {
            "selected": false,
            "text": "1m",
            "value": "1m"
          },
          {
            "selected": false,
            "text": "10m",
            "value": "10m"
          },
          {
            "selected": false,
            "text": "30m",
            "value": "30m"
          },
          {
            "selected": false,
            "text": "1h",
            "value": "1h"
          },
          {
            "selected": false,
            "text": "6h",
            "value": "6h"
          },
          {
            "selected": false,
            "text": "12h",
            "value": "12h"
          },
          {
            "selected": false,
            "text": "1d",
            "value": "1d"
          },
          {
            "selected": false,
            "text": "7d",
            "value": "7d"
          },
          {
            "selected": false,
            "text": "14d",
            "value": "14d"
          },
          {
            "selected": false,
            "text": "30d",
            "value": "30d"
          }
        ],
        "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
        "refresh": 2,
        "skipUrlSync": false,
        "type": "interval"
      }
    ]
  },
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "browser",
  "title": "Meshping",
  "uid": "000000004",
  "version": 3
}

Configure targets with file

Hello,

Is it possible to configure targets using a configuration file, so they can be manager with tools like Ansible / Puppet / ... ?

Thanks :)

Distributed Meshping Redesign

Distributed Meshping in its current form works by allowing multiple instances of meshping to use the same Redis, so they ping the same hosts. While nice in theory, in practice this sucks:

  • It requires both nodes to have reliable access to Redis, thus no WAN links may sit in between. Monitoring WAN links from both sides is the main reason for meshping to exist though.
  • Meshping seduces you into adding a number of local targets, which are not necessarily interesting when pinging across WAN links.

Thus, we should redesign this feature as such:

  • Introduce a "peers" list. Peers are expected to be other Meshping instances of roughly the same version.

  • Since we have an HTTP API now, expose an endpoint for our peers to call us at.

  • At regular intervals, we call the same interface of our peers, sending them a dump of our targets, whether or not that target is local, and on which interface we see it (indicated by its network address).

  • Our peers will respond to that call with statistics for those targets that they have gathered for us, and will add any target to their list that is

    • either not local to us,
    • or local to us on a network that is also local to them.

    This way we can have multiple meshping instances run on the same LAN segment and they will ping local targets, but when we add a remote Meshping in the mix, that one will ignore targets on the LAN and only ping the other Meshpings.

Clustering

Adding a cluster feature would be nice. We could:

  • Correctly alert on a node being down (as opposed to multiple separate alerts that pings are dropping)
  • When that happens, use a surviving node to view the dead node's data up until it died (see also #19 )
  • Sync targets in a much nicer way (the current implementation handles target deletions and renames pretty badly)

Rough idea for an implementation:

  • We'll probably need some kind of Raft implementation, so that nodes can elect a leader (or at least know if they're in Quorum)
  • While nodes are out of quorum, they should continue to ping their targets and collect data. They should not send out alerts or try to send their data to other nodes though.
  • It would probably be awesome if it didn't matter which other node a node talks to, as long as that other node is in quorum. This would make it easier to use MeshPing in a network where not all the nodes can see each other.

Tech ideas:

  • Nodes will have to distribute information to other nodes. Something like ZeroMQ would probably make this tremendously easier because it supports PubSub. So, a node could connect to another node, and if that node is in quorum, subscribe to its updates and get all the cluster's data. ZMQ could run on port 9923.
  • Join should work via the HTTP API. Accepting joins should probably be the master's responsibility.
  • For authentication, nodes should sign their messages using libsodium. Unsigned messages should be discarded.
  • Local targets should be in a separate database table. This makes them easier to distinguish from cluster targets and probably also makes it easier to operate a single node.
  • It would be awesome if nodes didn't even have to be a Meshping instance, but could also be an ESP or Arduino that just pulls a list of targets from an actual Meshping instance and delivers ping data back. Probably not all that helpful in a professional setting, but could be interesting for home users or people monitoring a distributed WiFi network.

Auto-clear the statistics on the home page in regular intervals

The longer meshping has been running, the less useful the statistics page becomes because it now displays an average over like months of data. These stats should probably be reset in a fixed interval, or maybe display the stats for the current hour before those are molded into a histogram? Or should it just be a moving window over the last hour (as in, 60 minutes, not the current-hour-until-now)? At any rate, it seems like one hour is a sensible time window for the summary.

Support dyndns targets

Currently, DNS names are only resolved when adding a target, and from then on, IPs are assumed to be static. Let's add some kind of support for dyndns targets.

Auto-Reset statistics for a target if it has been down for a longer time

similar to #28

When a target such as a notebook or workstation has been offline for e.g. 3 hours, the statistics show a huge packet loss that is misleading - all packets have been lost for the last hours because the device was gone, but it's not indicative of an ongoing problem with the device. Thus it would be helpful if the statistics would auto-reset for that one target when it comes back. To make sure we don't hide any actual problems though, it must've been down for a longer period of time (e.g. 3 hours) for the auto-reset to trigger.

Update Alpine to 3.13

I'd like to update the Images to Alpine 3.13, thereby shrinking them because I will no longer need to build pandas myself but can just install it through apk.

Unfortunately this does not yet make sense, because it would break our support for Arm: Alpine 3.13 has a new musl which comes with 64-bit time support. This is not yet supported in Rasbian buster's version of Docker though: https://gitlab.alpinelinux.org/alpine/aports/-/issues/11774 This means we'll need a new Docker, which comes in a new Raspbian, which only comes after the release of Debian bullseye. And for that, there is no release date yet.

Until then, we'll need to stick with 3.12 unfortunately. :(

Unable to bring up Docker - error

When running the command docker-compose up --detach
container is restarting with the following error:

meshping_1    | standard_init_linux.go:211: exec user process caused "exec format error"
meshping_1    | standard_init_linux.go:211: exec user process caused "exec format error"

Draw Histograms using peers' data

child of #40

I'd like to be able to draw histograms where a node's own data is included, and the data from its peers, so that I can easily compare the two.

For that, a node probably needs to store which of its peers it sent a target to, so that it knows which peers to query. Or do we just not care and just query all the MESHPING_PEERS, because who else could it be? I need to think about this.

Add vue.js UI

Let's have a responsive UI that allows searching and probably adding/removing targets.

Use InfluxDB as database for histograms

I'm thinking about switching the histogram storage to InfluxDB, where I could simply store the raw data and nicely encapsulate all the histogramming and bucketing in histodraw.

Will try that in staging.

adding an IPv6 target creates an exception

When I add an IPv6 target, meshping stops pinging and this shows up in the logs:

ERROR:hypercorn.error:ASGI Framework Lifespan error, shutdown without Lifespan support
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/hypercorn/trio/lifespan.py", line 31, in handle_lifespan
    await invoke_asgi(self.app, scope, self.asgi_receive, self.asgi_send)
  File "/usr/lib/python3.9/site-packages/hypercorn/utils.py", line 247, in invoke_asgi
    await app(scope, receive, send)
  File "/usr/lib/python3.9/site-packages/quart/app.py", line 1741, in __call__
    await self.asgi_app(scope, receive, send)
  File "/usr/lib/python3.9/site-packages/quart/app.py", line 1767, in asgi_app
    await asgi_handler(receive, send)
  File "/usr/lib/python3.9/site-packages/quart_trio/asgi.py", line 148, in __call__
    break
  File "/usr/lib/python3.9/site-packages/trio/_core/_run.py", line 850, in __aexit__
    raise combined_error_from_nursery
  File "/opt/meshping/src/peers.py", line 18, in run_peers
    peer_targets = [
  File "/opt/meshping/src/peers.py", line 22, in <listcomp>
    local = if4.is_local(target.addr)
  File "/opt/meshping/src/ifaces.py", line 43, in is_local
    return self.find_iface_for_network(target) is not None
  File "/opt/meshping/src/ifaces.py", line 36, in find_iface_for_network
    target = ipaddress.IPv4Address(target)
  File "/usr/lib/python3.9/ipaddress.py", line 1304, in __init__
    self._ip = self._ip_int_from_string(addr_str)
  File "/usr/lib/python3.9/ipaddress.py", line 1191, in _ip_int_from_string
    raise AddressValueError("Expected 4 octets in %r" % ip_str)
ipaddress.AddressValueError: Expected 4 octets in '2a03:4000:36:64::1'

pylint: Cell variable target defined in loop

Pylint flagged these lines we introduced in #49:

meshping/src/meshping.py

Lines 64 to 66 in 2272944

trace = await trio.to_thread.run_sync(
lambda: traceroute(target.addr, fast=True, timeout=0.5, count=1)
)

Pylint says:

src/meshping.py:65:39: W0640: Cell variable target defined in loop (cell-var-from-loop)

More info: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/cell-var-from-loop.html

I'm not sure if this is actually a problem here because we're running the closure right away and awaiting on it, but I want to look into it in more detail and perhaps get rid of it anyway.

Show outages on the network map

child of #52

When a target goes down, have the network map show exactly where the outage occurred by drawing all the nodes red that don't respond anymore.

For this to work, we need to store a separate copy of the traceroute while the target is still alive ("last known good traceroute", so to speak), and then render all the nodes red that are missing from the current one when the target is down.

Necessary PlantUML syntax:

<style>
.nodeDown {
    BackgroundColor Crimson
    FontColor White
}
</style>

hide <<nodeDown>> stereotype

node "meshpingpi3" <<nodeDown>> as SELF

Result:

image

Docker Environment Variable to Change Port

Issue:

  • I have port numbering scheme to keep tack of services
  • Don't see a way to change the port natively
  • Only way to change port is to disable network host mode and map he port, which is not ideal.

Thanks for a good project

Display whois information

Display whois information for targets and route hops.

Exceptions:

  • RFC1918, RFC6598 (100.64.0.0/10)
  • Ipv6 addresses not in GUA scope (2000::/3)

Draw Histograms for multiple targets

I'd like to be able to draw a histogram of multiple targets simultaneously so I can compare the two. For instance, compare the histogram of the default gateway with one of 8.8.8.8.

Network map and traceroute panel does not show hops that don't send "TTL exceeded"

child of #52

Sometimes, hops don't send a "TTL exceeded" message. tracert then shows a line containing * * * * * for them. In the map, they just don't show up at all.

This is because icmplib does not return a hop for them. Instead, it sets the distance attribute on the responding hops to indicate what's going on:

>>> from icmplib import traceroute
>>> hopz = traceroute("192.168.xxx.7")
>>> hopz
[<Hop 1 [10.aaa.bbb.253]>, <Hop 3 [192.168.xxx.7]>]
>>> hopz[0].distance
1
>>> hopz[1].distance
3

Let's make sure the network map and routing panel account for this.

Switch from VueJS 2 to alpine.js?

We're still using VueJS 2, which is deprecated. I don't want to use VueJS 3 though, because I feel it's way too large and complicated for what we need here. Instead I'm thinking about switching to alpine.js, which is more in the spirit of what we need here.

Currently this is an enhancement because everything works using VueJS 2, and as long as nothing's broken, it's not a bug.

Add PMTUD feature

I'm wondering if we should add a PMTUD feature. That is, regularly figure out what the path MTU to a certain target is, and show that in the target list.

Work is underway to support this in icmplib:

Let's see if we can add a rogue implementation to MeshPing and then migrate once it's ready upstream.

More interesting maths literature on heatmaps and on automatic error detection

Since I've found a few links over the years that have inspired me, that I would like to someday maybe turn into further features in Meshping, but I'm not sure when that may happen or what else to do with those links right now, I'm bookmarking them here so that I can come back to them when I need them.

Heatmaps and their visualization: https://queue.acm.org/detail.cfm?id=1809426

Theo Schlossnagle mentioned CUSUM as a method to detect from the latencies when it looks like we're screwed: https://de.wikipedia.org/wiki/CUSUM (might be insteresting to explore, however we might just not ping often enough for that to actually work)

Brendan Gregg on how to do modality detection: https://www.brendangregg.com/FrequencyTrails/modes.html

Maybe we could combine those indicators with the actual packet loss metric in some way, to see if we can find a model that's able to predict those from measured latencies.

Make time between and number of pings configurable

First of all, thanks for creating meshping, it's just what I needed to properly monitor my home internet connection!

But having a resolution of 1ping/minute is not enough when trying to find out if you often have intermittent connection issues.

I would like to say e.g. "ping each target 10 times every 15 seconds".

Especially regarding packet loss, this would give me much more data to work with.

Add `Error` state for targets

Add an error state that a target can enter when for instance its name doesn't resolve, so that this does not kill meshping:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/hypercorn/trio/lifespan.py", line 29, in handle_lifespan
[...]
  File "/opt/meshping/src/meshping.py", line 78, in run
    pingobj.add_host(target.addr.encode("utf-8"))
  File "oping.pyx", line 96, in oping.PingObj.add_host
oping.PingError: b'getaddrinfo: Name does not resolve'

The Joel Test: 12 Steps to Better Code

Let's run The Joel Test, just for fun.

  1. Do you use source control? → yes
  2. Can you make a build in one step? → yes, ./run-dev.sh – prod takes zero steps
  3. Do you make daily builds? → yes, CI is automated and runs on every PR/commit to master
  4. Do you have a bug database? → yes
  5. Do you fix bugs before writing new code? → mostly, but not religiously. I'll count this as a 0.8.
  6. Do you have an up-to-date schedule? → I have a rough idea, but nothing formal. Let's count this as a 0.2.
  7. Do you have a spec? → yes 😛
  8. Do programmers have quiet working conditions? → quiet enough for flow to happen, so that'll count as a yes.
  9. Do you use the best tools money can buy? → I use the open-source tools that I enjoy most for which no money is necessary to acquire, so that'll be a yes.
  10. Do you have testers? → I use it daily, so that'll count.
  11. Do new candidates write code during their interview? → My "hiring" process would be "you submit a PR", so I guess yes.
  12. Do you do hallway usability testing? → I have a coupl'a friends using it who I ask what they think, so yes.

Result: 11.

A score of 12 is perfect, 11 is tolerable, but 10 or lower and you’ve got serious problems.

That's super good enough 😄

This is obviously a bit tongue-in-cheek, but I'm actually quite happy with the way the project works overall: It's fun to work with, tech debt feels low and dependencies are manageable. This is way more important to me than a perfect score in any formal methodology, whichever one it may be.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.