synadia-io / nex Goto Github PK

View Code? Open in Web Editor NEW

120.0 13.0 12.0 20.28 MB

The NATS execution engine

Home Page: https://docs.nats.io/using-nats/nex

License: Apache License 2.0

Go 99.45% Shell 0.55%

distributed-systems nats scheduler scheduler-service workload

nex's People

Contributors

Stargazers

Watchers

Forkers

jordan-rash kthomas linux-china rodneyosodo k1nho jarema brianmcgee yordis cbnsndwch shono-io fleischr strogo

nex's Issues

msg="Did not receive NATS handshake from agent within timeout."

Observed behavior

When using the latest nex cli with arm64 support update, there seems to be a problem when NATS tries to hand shake agents within microVMs.

admin@ip-x-x-x-x:~$ nats-server -js

[11826] 2024/03/18 07:15:11.309007 [INF] Starting nats-server
[11826] 2024/03/18 07:15:11.309159 [INF]   Version:  2.10.12
[11826] 2024/03/18 07:15:11.309179 [INF]   Git:      [121169ea]
[11826] 2024/03/18 07:15:11.309199 [INF]   Name:     NCXQ6Y3BGPS4XJ4AXQWNTUO7SUYN65FH3UJPUSLNPFX4LE6LSAWRQSNN
[11826] 2024/03/18 07:15:11.309220 [INF]   Node:     8nbUCYL1
[11826] 2024/03/18 07:15:11.309238 [INF]   ID:       NCXQ6Y3BGPS4XJ4AXQWNTUO7SUYN65FH3UJPUSLNPFX4LE6LSAWRQSNN
[11826] 2024/03/18 07:15:11.309683 [INF] Starting JetStream
[11826] 2024/03/18 07:15:11.310617 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[11826] 2024/03/18 07:15:11.310643 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[11826] 2024/03/18 07:15:11.310661 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[11826] 2024/03/18 07:15:11.310679 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[11826] 2024/03/18 07:15:11.310695 [INF]
[11826] 2024/03/18 07:15:11.310712 [INF]          https://docs.nats.io/jetstream
[11826] 2024/03/18 07:15:11.310729 [INF]
[11826] 2024/03/18 07:15:11.310746 [INF] ---------------- JETSTREAM ----------------
[11826] 2024/03/18 07:15:11.310766 [INF]   Max Memory:      23.47 GB
[11826] 2024/03/18 07:15:11.310788 [INF]   Max Storage:     18.52 GB
[11826] 2024/03/18 07:15:11.310807 [INF]   Store Directory: "/tmp/nats/jetstream"
[11826] 2024/03/18 07:15:11.310825 [INF] -------------------------------------------
[11826] 2024/03/18 07:15:11.311415 [INF] Listening for client connections on 0.0.0.0:4222
[11826] 2024/03/18 07:15:11.312698 [INF] Server is ready

admin@ip-x-x-x-x:~$ sudo nex node preflight

Validating - Required CNI Plugins
          🔎Searching - /opt/cni/bin
          ✅ Dependency Satisfied - /opt/cni/bin/host-local [host-local CNI plugin]
          ✅ Dependency Satisfied - /opt/cni/bin/ptp [ptp CNI plugin]
          ✅ Dependency Satisfied - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]

Validating - Required binaries
          🔎Searching - /usr/local/bin
          ✅ Dependency Satisfied - /usr/local/bin/firecracker [Firecracker VM binary]

Validating - CNI configuration requirements
          🔎Searching - /etc/cni/conf.d
          ✅ Dependency Satisfied - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]

Validating - VMLinux Kernel
          ✅ Dependency Satisfied - /tmp/wd/vmlinux [VMLinux Kernel]

Validating - Root Filesystem Template
          ✅ Dependency Satisfied - /tmp/wd/rootfs.ext4 [Root Filesystem Template]

admin@ip-x-x-x-x:~$ sudo nex node up

time=2024-03-18T07:17:28.694Z level=INFO msg="Generated keypair for node" public_key=NAJ25OZ52JHNPU43HPCYAZW75MJ3US3ORABO3JYYIRLK4RBTPFI5AWKH
time=2024-03-18T07:17:28.695Z level=INFO msg="Loaded node configuration" config_path=./config.json
time=2024-03-18T07:17:28.695Z level=INFO msg="Initialized telemetry"
time=2024-03-18T07:17:28.696Z level=INFO msg="Established node NATS connection" servers=""
time=2024-03-18T07:17:28.704Z level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:44253
time=2024-03-18T07:17:28.705Z level=INFO msg="Virtual machine manager starting"
time=2024-03-18T07:17:28.705Z level=INFO msg="Resetting network"
time=2024-03-18T07:17:28.705Z level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDIZCIFCIDEGBDR23JGIEKBX43RVXNC3MOJGTU2Q3VXNIQ3ZSH7HT4VC
time=2024-03-18T07:17:28.705Z level=INFO msg="NATS execution engine awaiting commands" id=NAJ25OZ52JHNPU43HPCYAZW75MJ3US3ORABO3JYYIRLK4RBTPFI5AWKH version=0.1.5
time=2024-03-18T07:17:28.707Z level=INFO msg="Publishing node started event"
time=2024-03-18T07:17:29.014Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g socket_path=/tmp/.firecracker.sock-11865-cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.026Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.026Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0x40002df748 Smt:0x40002df753 TrackDirtyPages:0x40002df754 VcpuCount:0x40002df740}"
time=2024-03-18T07:17:29.026Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.026Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g drive_path=/tmp/rootfs-cnrul21t6e8i4vc1sf6g.ext4 slot=1 root=true
time=2024-03-18T07:17:29.027Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g drive_path=/tmp/rootfs-cnrul21t6e8i4vc1sf6g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.027Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g device_name=tap0 mac_addr=ca:7d:05:49:f0:b2 interface_id=1
time=2024-03-18T07:17:29.050Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.050Z level=INFO msg="Machine started" vmid=cnrul21t6e8i4vc1sf6g ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.051Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.051Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.358Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 socket_path=/tmp/.firecracker.sock-11865-cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.370Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.371Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0x40004d9398 Smt:0x40004d93a3 TrackDirtyPages:0x40004d93a4 VcpuCount:0x40004d9390}"
time=2024-03-18T07:17:29.371Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.371Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf70.ext4 slot=1 root=true
time=2024-03-18T07:17:29.372Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf70.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.372Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 device_name=tap0 mac_addr=4a:a8:e5:66:f8:a2 interface_id=1
time=2024-03-18T07:17:29.396Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.397Z level=INFO msg="Machine started" vmid=cnrul29t6e8i4vc1sf70 ip=192.168.127.3 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.397Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.397Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.3 vmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.701Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g socket_path=/tmp/.firecracker.sock-11865-cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.713Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.714Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0x4000408bb8 Smt:0x4000408bc3 TrackDirtyPages:0x4000408bc4 VcpuCount:0x4000408bb0}"
time=2024-03-18T07:17:29.714Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.715Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf7g.ext4 slot=1 root=true
time=2024-03-18T07:17:29.715Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf7g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.715Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g device_name=tap0 mac_addr=36:88:7a:63:c5:62 interface_id=1
time=2024-03-18T07:17:29.740Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.740Z level=INFO msg="Machine started" vmid=cnrul29t6e8i4vc1sf7g ip=192.168.127.4 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.741Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.741Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.4 vmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:34.074Z level=ERROR msg="Did not receive NATS handshake from agent within timeout." vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:34.074Z level=ERROR msg="First handshake failed, shutting down to avoid inconsistent behavior"

Problem:

time=2024-03-18T07:17:34.074Z level=ERROR msg="Did not receive NATS handshake from agent within timeout." vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:34.074Z level=ERROR msg="First handshake failed, shutting down to avoid inconsistent behavior"

Expected behavior

I tested on a linux-amd64(x86_64) machine following the same steps and successfully set up the node.

admin@ip-x-x-x-x:~$ sudo nex node up
time=2024-03-18T07:32:19.250Z level=INFO msg="Generated keypair for node" public_key=NAAFGJURU5O2A7Y3LW4ADW46C3FGO2HQFQXR4ZDVUW4QOPCL5DSZ65D7
time=2024-03-18T07:32:19.251Z level=INFO msg="Loaded node configuration" config_path=./config.json
time=2024-03-18T07:32:19.251Z level=INFO msg="Initialized telemetry"
time=2024-03-18T07:32:19.252Z level=INFO msg="Established node NATS connection" servers=""
time=2024-03-18T07:32:19.462Z level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:33653
time=2024-03-18T07:32:19.462Z level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDSYXBPAWDSXFRNJWNLNCE3IGU2ZRCYZAMRDJETPVR5YNICZVBNQ5NDX
time=2024-03-18T07:32:19.462Z level=INFO msg="Virtual machine manager starting"
time=2024-03-18T07:32:19.462Z level=INFO msg="NATS execution engine awaiting commands" id=NAAFGJURU5O2A7Y3LW4ADW46C3FGO2HQFQXR4ZDVUW4QOPCL5DSZ65D7 version=0.1.5
time=2024-03-18T07:32:19.462Z level=INFO msg="Resetting network"
time=2024-03-18T07:32:19.462Z level=INFO msg="Publishing node started event"
time=2024-03-18T07:32:20.311Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus0v76f30gvd23590 socket_path=/tmp/.firecracker.sock-1113-cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.358Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.359Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus0v76f30gvd23590 err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0xc00040cb38 Smt:0xc00040cb43 TrackDirtyPages:0xc00040cb44 VcpuCount:0xc00040cb30}"
time=2024-03-18T07:32:20.359Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus0v76f30gvd23590 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:20.359Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus0v76f30gvd23590 drive_path=/tmp/rootfs-cnrus0v76f30gvd23590.ext4 slot=1 root=true
time=2024-03-18T07:32:20.360Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus0v76f30gvd23590 drive_path=/tmp/rootfs-cnrus0v76f30gvd23590.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:20.360Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus0v76f30gvd23590 device_name=tap0 mac_addr=16:ee:e2:d1:5f:37 interface_id=1
time=2024-03-18T07:32:20.432Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus0v76f30gvd23590 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:20.432Z level=INFO msg="Machine started" vmid=cnrus0v76f30gvd23590 ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:20.433Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.433Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.678Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus1776f30gvd2359g socket_path=/tmp/.firecracker.sock-1113-cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.690Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.690Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus1776f30gvd2359g err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0xc000015068 Smt:0xc000015073 TrackDirtyPages:0xc000015074 VcpuCount:0xc000015060}"
time=2024-03-18T07:32:20.691Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus1776f30gvd2359g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:20.691Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus1776f30gvd2359g drive_path=/tmp/rootfs-cnrus1776f30gvd2359g.ext4 slot=1 root=true
time=2024-03-18T07:32:20.691Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus1776f30gvd2359g drive_path=/tmp/rootfs-cnrus1776f30gvd2359g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:20.691Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus1776f30gvd2359g device_name=tap0 mac_addr=26:48:5a:18:8f:de interface_id=1
time=2024-03-18T07:32:20.765Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus1776f30gvd2359g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:20.765Z level=INFO msg="Machine started" vmid=cnrus1776f30gvd2359g ip=192.168.127.3 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:20.765Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.765Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.3 vmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:21.018Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus1776f30gvd235a0 socket_path=/tmp/.firecracker.sock-1113-cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.030Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.031Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus1776f30gvd235a0 err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0xc0001222c8 Smt:0xc0001222d3 TrackDirtyPages:0xc0001222d4 VcpuCount:0xc0001222c0}"
time=2024-03-18T07:32:21.031Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus1776f30gvd235a0 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:21.031Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus1776f30gvd235a0 drive_path=/tmp/rootfs-cnrus1776f30gvd235a0.ext4 slot=1 root=true
time=2024-03-18T07:32:21.032Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus1776f30gvd235a0 drive_path=/tmp/rootfs-cnrus1776f30gvd235a0.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:21.032Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus1776f30gvd235a0 device_name=tap0 mac_addr=76:a8:99:c5:0c:e6 interface_id=1
time=2024-03-18T07:32:21.095Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus1776f30gvd235a0 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:21.096Z level=INFO msg="Machine started" vmid=cnrus1776f30gvd235a0 ip=192.168.127.4 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:21.096Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.096Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.4 vmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:22.195Z level=INFO msg="Received agent handshake" vmid=cnrus0v76f30gvd23590 message="Host-supplied metadata"
time=2024-03-18T07:32:22.554Z level=INFO msg="Received agent handshake" vmid=cnrus1776f30gvd2359g message="Host-supplied metadata"
time=2024-03-18T07:32:22.857Z level=INFO msg="Received agent handshake" vmid=cnrus1776f30gvd235a0 message="Host-supplied metadata"

Nex and NATS version

admin@ip-x-x-x-x:~$ nats-server --version
nats-server: v2.10.12
admin@ip-x-x-x-x:~$ nex --version
v0.1.5 [567f45b355e71d0a186961a8aaca95d7c3a7fd3b] | Built-on: 2024-03-14T19:45:23Z

Host environment

Host: AWS a1.metal
vCPUs: 16
Memory: 32GiB
OS: debian-12
CPU Arch: arm64

admin@ip-x-x-x-x:~$ lsb_release -a

No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

admin@ip-x-x-x-x:~$ lscpu

Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 16
  On-line CPU(s) list:  0-15
Vendor ID:              ARM
  Model name:           Cortex-A72
    Model:              3
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          4
    Stepping:           r0p3
    BogoMIPS:           166.66
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Caches (sum of all):
  L1d:                  512 KiB (16 instances)
  L1i:                  768 KiB (16 instances)
  L2:                   8 MiB (4 instances)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-15
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Not affected
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; Branch predictor hardening, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected

Steps to reproduce

No response

Badges

This is just to discuss what other, if any, badges we want to add the the README. At the moment, all the have is the status of the two actions. I would like some sort of code coverage indicator to show the platform is (er, uh, will be) well tested

Support a different runtime for Windows and Macos.

Proposed change

Many of us are on corporate Windows or MacOS computers, with no way to run firecracker easily. For people in this situation it would be great if nex could also run on top of another execution engine such as, e.g, podman, with podman desktop.

Use case

Developers on corporate Windows of MacOS machines who want to test nex locally.

Contribution

Not software but if we havet his, then I can probably get my employer involved for buying a subscription from Synadia.

Emit `function_executed` events

Proposed change

When a function is triggered, we should emit an event indicating the completion status of the function:

success or failure
error message if we have one
elapsed execution time

Use case

Monitoring tools (other than prometheus/otel) would like to be able to observe execution counts, execution time, and failure rates

Contribution

No response

Preflight checks fail when directories are not present

Observed behavior

When using the simple.json preflight checks, the command fails if the directories are not already present.

Expected behavior

Preflight checks should create any required nested directories.

Nex and NATS version

root@nats-client:~# nex --version
vdevelopment [] | Built-on:

Host environment

Ubuntu 23.10 x64

Steps to reproduce

I'm using a clean Ubuntu 23.10 x64 droplet in Digital Ocean, with golang and the nats CLI installed.

Install nex cli
Save https://github.com/synadia-io/nex/blob/main/examples/nodeconfigs/simple.json to your local directory

root@nats-client:~# nex node preflight --config simple.json 
Validating - Required CNI Plugins [/opt/cni/bin]
	⛔ Missing Dependency - /opt/cni/bin/host-local [host-local CNI plugin]
	⛔ Missing Dependency - /opt/cni/bin/ptp [ptp CNI plugin]
	⛔ Missing Dependency - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]

Validating - Required binaries [/usr/local/bin]
	⛔ Missing Dependency - /usr/local/bin/firecracker [Firecracker VM binary]

Validating - CNI configuration requirements [/etc/cni/conf.d]
	⛔ Missing Dependency - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]

Validating - User provided files []
	⛔ Missing Dependency - /tmp/wd/vmlinux [VMLinux Kernel]
	⛔ Missing Dependency - /tmp/wd/rootfs.ext4 [Root Filesystem Template]

⛔ You are missing required dependencies for [Required CNI Plugins], do you want to install? [y/N] y
open /opt/cni/bin/ptp: no such file or directory
panic: preflight checks failed: open /opt/cni/bin/ptp: no such file or directory

goroutine 1 [running]:
github.com/synadia-io/nex/internal/node.CmdPreflight(0x1114db0?, 0x1784a80, {0xe58680?, 0xc0003fab40?}, 0xc0003fab40?, 0xc00046c5f0?)
	/root/go/pkg/mod/github.com/synadia-io/[email protected]/internal/node/init.go:140 +0x96
main.RunNodePreflight(0x40e44c?)
	/root/go/pkg/mod/github.com/synadia-io/[email protected]/nex/node_linux_amd64.go:33 +0x179
github.com/choria-io/fisk.(*actionMixin).applyActions(...)
	/root/go/pkg/mod/github.com/choria-io/[email protected]/actions.go:28
github.com/choria-io/fisk.(*Application).applyActions(0xc00047f400?, 0xc00019e000)
	/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:823 +0xdf
github.com/choria-io/fisk.(*Application).execute(0xc00047f400, 0xc00019e000?, {0xc0003fab80, 0x2, 0x2})
	/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:609 +0x65
github.com/choria-io/fisk.(*Application).Parse(0xc00047f400, {0xc000100060?, 0xc0004eb650?, 0x30?})
	/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:272 +0x16c
github.com/choria-io/fisk.(*Application).MustParseWithUsage(0xc00047f400, {0xc000100060, 0x4, 0x4})
	/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:889 +0x45
main.main()
	/root/go/pkg/mod/github.com/synadia-io/[email protected]/nex/nex.go:94 +0x41f7

If you run mkdir /opt/cni/bin and run again, the CNI Plugins step succeeds but then fails on CNI configuration requirements. After resolving that issue, it fails again at User provided files. After creating all required directories, preflight succeeds.

Condense NodeConfiguration and NodeOptions

Proposed change

The two structs are almost identical

Use case

This will allow us to drop a config at preflight

Contribution

No response

Visualized end to end open telemetry spans

Proposed change

Whether it's a set of instructions or whether it's through a docker compose, we should have an "easy button" for people to be able to capture and see the open telemetry spans that we're collection

Use case

Observability shouldn't be an afterthought

Contribution

No response

Support OTLP exporting for metrics

Proposed change

We should be able to export to an OTLP collector rather than exposing an inbound HTTP Prometheus scrape URL like we're doing now. Firstly, we should be using OTLP instead of something vendor-specific like Prometheus. Secondly, we should support exporting because it's typically easier in secure production environments than supporting inbound HTTP

Use case

I want to be able to export metrics and traces to OTLP collector(s)

Contribution

No response

Audit codebase for `len(strval) == 0` and `strval == ""` vs `strval == nil`

Using default value "" for string makes it difficult to determine if clients are providing "garbage in" -- it feels more idiomatic to use *string and having a basic utility method such as:

// returns the given string or nil when empty
func StringOrNil(str string) *string {
	if str == "" {
		return nil
	}
	return &str
}

A good chore for the near future is to audit and refactor the codebase to (i) use *string in struct types, (ii) implement the above StringOrNil utility method and (iii) dereference string pointers where necessary.

Use ginkgo as BDD test harness

Setup ginkgo and code coverage tooling.

Bring Custom RootFS Generation to CLI

Proposed change

Right now, to build a custom rootfs, or even develop on the agent, one must use scripts nested in the repo and is limited to x86. This proposal brings that funcationality so nex rootfs ... or something similar.

Use case

Better devx around the rootfs creation

Contribution

Will contribute over the next week

Emit heartbeats

Proposed change

Right now the only way to observe the state of a Nexus (group of Nex nodes) is to monitor events. If one of those events is lost/missed, then watchers need to make another round of ping requests. It would be easier if nex nodes emitted regular heartbeats with critical information on them so the state of a Nexus can be observed.

Use case

Receive periodic heartbeats containing node status as a supplement to the "started", "stopped", etc events.

Contribution

No response

msg="failed to devrun workload" err="nats: no responders available for request"

Observed behavior

nats-server
[128423] 2024/03/01 16:16:19.229992 [INF] Starting nats-server
[128423] 2024/03/01 16:16:19.230314 [INF]   Version:  2.10.11
[128423] 2024/03/01 16:16:19.230323 [INF]   Git:      [5fc6051]
[128423] 2024/03/01 16:16:19.230329 [INF]   Name:     NA3UPF2KFNGQWU7KKHG4ESWKGIO2XGZYVYOHMXC3LKX4Y4A37ECR7E2V
[128423] 2024/03/01 16:16:19.230349 [INF]   ID:       NA3UPF2KFNGQWU7KKHG4ESWKGIO2XGZYVYOHMXC3LKX4Y4A37ECR7E2V
[128423] 2024/03/01 16:16:19.232238 [INF] Listening for client connections on 0.0.0.0:4222
[128423] 2024/03/01 16:16:19.232916 [INF] Server is ready

sudo nex node preflight --config=./simple.json
[sudo] password for user: 
Validating - Required CNI Plugins
	  🔎Searching - /opt/cni/bin 
	  ✅ Dependency Satisfied - /opt/cni/bin/host-local [host-local CNI plugin]
	  ✅ Dependency Satisfied - /opt/cni/bin/ptp [ptp CNI plugin]
	  ✅ Dependency Satisfied - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]

Validating - Required binaries
	  🔎Searching - /usr/local/bin 
	  ✅ Dependency Satisfied - /usr/local/bin/firecracker [Firecracker VM binary]

Validating - CNI configuration requirements
	  🔎Searching - /etc/cni/conf.d 
	  ✅ Dependency Satisfied - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]

Validating - VMLinux Kernel
	  ✅ Dependency Satisfied - /path/to/vmlinux-5.10 [VMLinux Kernel]

Validating - Root Filesystem Template
	  ✅ Dependency Satisfied - /path/to/rootfs.ext4 [Root Filesystem Template]

user@userbn:~/Desktop/nats$ sudo nex node up --config=./simple.json
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Generated keypair for node" public_key=NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Loaded node configuration" config_path=./simple.json
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Initialized telemetry"
time=2024-03-01T16:36:30.689+03:00 level=INFO msg="Established node NATS connection" servers=""
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:38009
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDEHXWJMQ6ZCLFRFKSWLX2BQ7MLCUCYZGEQAPAJL7VMZUZK2JPICVKCT
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="NATS execution engine awaiting commands" id=NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC version=0.1.4
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Virtual machine manager starting"
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Resetting network"
time=2024-03-01T16:36:30.703+03:00 level=INFO msg="Publishing node started event"
time=2024-03-01T16:36:31.194+03:00 level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cngtjngpd7rvbdfklck0 socket_path=/tmp/.firecracker.sock-129425-cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.208+03:00 level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.209+03:00 level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate: MemSizeMib:0xc0003574f8 Smt:0xc000357503 TrackDirtyPages:0xc000357504 VcpuCount:0xc0003574f0}"
time=2024-03-01T16:36:31.210+03:00 level=INFO msg=PutGuestBootSource firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-01T16:36:31.210+03:00 level=INFO msg="Attaching drive" firecracker=true vmmid=cngtjngpd7rvbdfklck0 drive_path=/tmp/rootfs-cngtjngpd7rvbdfklck0.ext4 slot=1 root=true
time=2024-03-01T16:36:31.211+03:00 level=INFO msg="Attached drive" firecracker=true vmmid=cngtjngpd7rvbdfklck0 drive_path=/tmp/rootfs-cngtjngpd7rvbdfklck0.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-01T16:36:31.211+03:00 level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cngtjngpd7rvbdfklck0 device_name=tap0 mac_addr=ea:a3:32:2f:7b:4e interface_id=1
time=2024-03-01T16:36:31.369+03:00 level=INFO msg="startInstance successful" firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-01T16:36:31.369+03:00 level=INFO msg="Machine started" vmid=cngtjngpd7rvbdfklck0 ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=38009
time=2024-03-01T16:36:31.370+03:00 level=INFO msg="SetMetadata successful" firecracker=true vmmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.370+03:00 level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:33.246+03:00 level=INFO msg="Received agent handshake" vmid=cngtjngpd7rvbdfklck0 message="Host-supplied metadata"


nex node info NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
NEX Node Information

       Node: NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
       Xkey: XDEHXWJMQ6ZCLFRFKSWLX2BQ7MLCUCYZGEQAPAJL7VMZUZK2JPICVKCT
    Version: 0.1.4
     Uptime: 16m18s
       Tags: nex.arch=amd64, nex.cpucount=16, nex.os=linux, simple=true

Memory in kB:

         Free: 435,036
    Available: 12,001,120
        Total: 16,365,564

Problem:

nex devrun ./echoservice/echoservice nats_url="nats://192.168.127.1:4222"
Reusing existing issuer account key: /home/user/.nex/issuer.nk
Reusing existing publisher xkey: /home/user/.nex/publisher.xk
time=2024-03-01T16:57:20.172+03:00 level=ERROR msg="failed to devrun workload" err="nats: no responders available for request"

Expected behavior

$ nex devrun ../examples/echoservice/echoservice nats_url=nats://192.168.127.1:4222
Reusing existing issuer account key: /home/kevin/.nex/issuer.nk
Reusing existing publisher xkey: /home/kevin/.nex/publisher.xk
🚀 Workload 'echoservice' accepted. You can now refer to this workload with ID: cmji29n52omrb71g07a0 on node NBS3Y3NWXLTFNC73XMVD6USFJF2H5QXTLEJQNOPEBPYDUDVB5YYYZOGI

Nex and NATS version

user@userbn:~$ nats-server --version
nats-server: v2.10.11
user@userbn:~$ nex --version
v0.1.4 [7a6e9b3a5d68199fddc85c8c2b880adc94b7412c] | Built-on: 2024-02-23T18:36:07Z

Host environment

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           4
    Stepping:            1
    BogoMIPS:            4394.90
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse 
                         sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpu
                         id tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcn
                         t tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fau
                         lt invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
                         ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip ar
                         ch_capabilities
Virtualization features: 
  Virtualization:        VT-x
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    64 MiB (16 instances)
  L3:                    64 MiB (4 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS N
                         ot affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Steps to reproduce

No response

Create Trace Logging Level

          May be useful to support `trace` level here also.

Originally posted by @kthomas in #82 (comment)

Regression: Ping now returns an incorrect workload count

Observed behavior

As soon as a nex node starts, the results of ping will report an incorrect workload count. For example, with the default pool size of 1, ping will report that there is 1 workload running. However, when you query via info, you get an accurate list of running machines (which is none in this case)

Expected behavior

When we query a newly running node, it should report 0 workloads. It should continue to report an accurate count based on what is and is not actually running, so it should take into account machine failures and terminations as well.

We're not currently testing our API endpoints, so we should make sure to add spec for this so that we can assert that the control API behaves properly

Nex and NATS version

main

Host environment

Ubuntu 20.10, x86_64

Steps to reproduce

Just start a node and then ping it or do nex node ls

Implement lame duck mode

Proposed change

A node can be placed into lame duck mode by a remote request. This lame duck mode would:

reject all new requests for workload start
remove the essential label from all applicable running workloads
publish a lame_duck_entered event
all subsequent heartbeats would include the lame_duck flag

Once in lame duck mode, a node cannot be "un-lame-duck'ed". The only acceptable state after lame duck is termination.

Use case

We may want to prepare a node for maintenance, update, pending network partition, or maybe just a particularly long graceful shutdown. Implementing a request to enter lame duck mode would accomplish this.

Contribution

No response

Preflight Known good binaries

Have preflight check for these binaries and install if not present

Kernel: https://s3.amazonaws.com/spec.ccfc.min/firecracker-ci/v1.5/x86_64/vmlinux-5.10.186
Plugins: https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-arm-v1.3.0.tgz
Plugins SHA: https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-arm-v1.3.0.tgz.sha256
tc-redirect-tap: https://github.com/awslabs/tc-redirect-tap (will need to build 😫 )

Couple rootfs with agent

We need to rebuild the rootfs when the agent changes, we will do this in CI and add the rootfs as a artifact, but it would also be nice to have some more fluent devex around this for motivated engineers that are hacking away at the agent.

Nex CLI should be aware of nats context

Proposed change

When the Nex CLI is performing a task that requires a connection to the NATS server, we should be able to detect if a nats context is available and selected and even potentially offer the ability to override the nats context.

Use case

If someone isn't using the default NATS URL for their development server, then they have to add (at least) --server to all of their nex commands. We should make it much easier and detect/reuse the nats context.

Contribution

No response

Slog through the logging with SLOG

Migrate to log/slog before we get to the point of no return.

node up should always start the first vm on the first available IP in the CIDR

Right now node up just continues allocating IP addresses. In production it won't take long for the IPs to run out.

Note that this behavior will happen regardless of how many times the node is started. Basically after 254 VMs from a fresh start, we're going to run out. So I see the requirements as follows:

Start the node fresh with an empty /var/lib/cni so the first VM gets the .2 address
Before putting the 253rd VM into the pool, we should reset /var/lib/cni so that the next VM gets .2

Update README

Proposed change

Need to remove go install method of installation (while replace stanza in go.mod) and provide better solution.

Use case

Better DevEx

Contribution

No response

devrun fails if issuer.nk doesnt exist

if $HOME/.nex/issuer.nk doesnt already exist, devrun will fail with a no such file error

All i needed to do was touch the file and the next time I ran nex, is created the issuer and publisher nkeys correctly.

Probably as simple as stat-ing the file before trying to open it...will look into it tomorrow

Expose open telemetry metrics

Proposed change

Follow the same pattern(s) as the NATS server (where/if appropriate) to allow Nex nodes to emit (or be scraped for) metrics. The ability to export should be considered the primary goal (OTLP) while supporting scraping should only be done if we get that for free or "for cheap" when we're setting up the metrics and exporting. We shouldn't use any specific vendor libraries like Jaeger or Prometheus, everything should be otel/OTLP standards.

Some gauges that I think we'll want ("deployed" includes all workload types, incl. functions):

deployed workload count
deployed workload count in namespace
deployed bytes count (size of total deployed workloads)
deployed bytes count in namespace
allocated memory total (based on firecracker machine config)
allocated memory in namespace
allocated vcpu count (based on fc vm config)
allocated vcpu count in namespace

Counters:

function trigger count total
function trigger count for workload
function trigger count for namespace
function failure count total
function failure count for workload
function failure count for namespace
function execution elapsed time
function execution elapsed time for namespace
total function execution time across all funcs

Use case

Operators managing a fleet of Nex nodes will need this feature

Contribution

No response

Implement and support OpenTelemetry spans

Proposed change

We should implement the core functionality that allows for the nex node to emit OpenTelemetry spans. On top of that, we should keep track of function execution as a span.

Use case

As a Nex developer or operator, I want to be able to track activity through my own OpenTelemetry infrastructure. The primary desire is that I can track function executions and correlate all of the host service calls that each function makes.

So, for example, a "span" might look something like this:

Triggered
- Host Service Call - KV Get
- Host Service Call - KV Put
- Host Service Call - Message Publish
Finished

Contribution

No response

Nex tries to deploy workload into a VM that hasn't reported a handshake

Observed behavior

Start a nex node and have the agent within the first VM not report a handshake (this is pretty difficult to reproduce). A request to deploy a workload comes in and the node will attempt to deploy into that VM, and since the VM is effectively dead, the result will be a "nats: timeout"

Expected behavior

The Nex node should abort and terminate if an agent fails to handshake. Agents failing to handshake is an indication of instability that is enough to warrant stopping the node. A missing handshake used to stop the node and no longer does, so this is a regression.

This should then never allow the situation where we attempt to deploy into a VM that hasn't given a handshake

Nex and NATS version

Latest release of both

Host environment

No response

Steps to reproduce

No response

Support per-workload-request machine configuration overrides

Proposed change

If a workload request contains a machine configuration override field, then the workload will use that override and not the machine configuration template that was specified via the --config option when the nex node was started (nex node up --config ....).

Use case

When applications have more specific requirements , such as controlling inbound TCP port access, and those requirements don't apply to all firecracker VMs on a given node, then we want the ability to specify per-machine configuration. For example, if we want all instances of a given workload, let's say alice-service to listen on port 8080 on the node host versus listening on the firecracker IP (e.g. 192.168.127.x:8080), then a configuration that contains the use of the CNI portmap plugin needs to be specified, and likely only applies to a given workload and not all workloads on that node.

This should be labeled as an advanced feature and the Nex node itself should emit a warning whenever a per-workload machine configuration override is included in a deployment request. The use of this feature will also impact startup performance. Since the workload deployment request is for a bespoke machine configuration, we can't reuse one of the "warm" VMs in the firecracker pool, and it has to be started specifically for this request. For more workloads this startup penalty is negligible, but it's definitely something an operator needs to be aware of.

The warning for using this feature should also remind developers that whatever validation was performed during the most recent preflight check does not apply to this workload. So we can't pre-verify that the workload configuration override is referring to real, accessible linux kernel and rootfs images or installed CNI plugins.

In short, this feature would be "use at your own risk"

Contribution

No response

Support terminal interactive mode when running a node server

Proposed change

When we run node up, we should be able to specify that we want to run that in interactive mode. The default should continue to operate the way it does now where it's just emitting logs to stdout (e.g. the production-ready mode).

We should be able to opt into interactive mode (e.g. with --interactive flag on node up). This would give us a full screen tui that lets us navigate around the various control operations for that specific node, as well as view the logs and other important information.

Use case

When developing workloads and functions for Nex, I want to be able to have more easy interaction with my development node server. I imagine myself being able to use arrow keys to navigate through the list of namespaces in use and the running workloads within those, and to be able to view and tweak whatever configuration I might like.
This lets me have my code in one window and an easy and fun to use terminal window for the node.

Contribution

No response

ARM Support

Proposed change

arm architecture support is required to use in A1.metal,
the cheapest bare metal environment that does not use x86_64.

Use case

aws ec2 a1.metal

Contribution

No response

Add host services tracing spans

Proposed change

Right now the open telemetry tracing support is limited to just the execution trigger handler on functions. We want to expand this so that we can create child spans that maintain the appropriate parent for all of the host services, including kv, object store, and messaging.

Use case

We want to see child spans every time a function utilizes a host service capability

Contribution

No response

Support configurable and overridable limits and boundaries for nodes and workloads

In the current implementation, the only time you can specify things like rate limiters is in the machineconfig template that is defined in the node configuration (which is really just a firecracker rate limiter JSON definition). We should be able to provide more constraints around the node limits as well as be able to override some or all of the machineconfig template in the individual workload.

Note when we do allow overriding the firecracker constraints on a per-workload basis, we will have to document the caveat that if you override the machine config template for a given workload, that workload will be started from scratch, and not use a warm VM from the pool.

Incomplete list of Node limitations we should probably support:

Max machines deployed
Max total bytes of all workloads deployed
Max bytes of individual workloads
Max aggregate memory allocated to machines
Max total CPU cores allocated to machines

The above limitations can be specified either as per node (within a single nex process) or per namespace. In the case of the latter, e.g. the max machines deployed number will be enforced per namespace rather than for the entire node as a whole.

Overridable Limits that can be supplied with an individual workload:

CPU cores
Memory
Rate limiters as supported by firecracker

When we emit the workload_deployed event, we should include the effective limitations, which should be representative of the actual limitations used in the firecracker definition for that VM.

nex-node preflight should download well-known dependencies

In the current version, nex-node preflight will tell you when you're missing the CNI plugins, your kernel, and your rootfs. We should publish "known good" versions of the CNI plugins, a linux kernel, and a default rootfs so that preflight can download those all to help people out and potentially make it easier for ops teams to manage a nex-node fleet.

Bonus: preflight could even check for newer versions of nex-node itself and update automatically if the user wants.

Consolidate 5 mini robots into a single super robot

Since we're already in a monorepo and there's already a lot of shared code between the nex node and the nex CLI, we'd like to consolidate them.

New single binary will be in the nex-cli folder
All of the node functionality will be in a node sub-command, e.g. nex node up and nex node preflight
Make sure the awesome ASCII art is still visible in the help

nex-ui web/dist folder generated dynamically

Right now the web/dist folder is being generated locally and committed. I think a better way to do this might be to leave that directory uncommitted and run pnpm install and pnpm build in the nex-ui/web directory prior to building the binary.

OCI roadmap

Proposed change

Limited-environment languages such as Cloudflare Workers are suitable for the needs of rapid prototyping like startups, but realistically, they are difficult to adopt in companies.
If you can run an image of a personally created service on the NATS Edge
I think it will be a really good execution engine.
I would like to ask if there is a plan for priorities.
(This sentence was written using a translator, so there are some awkward parts, so please understand.)

Use case

run an image created at nats edge

Contribution

No response

Allow starting a userland NATS server

Proposed change

The Nex binary already embeds an internal NATS server for private use. It would be cool if we could also have the Nex server start a "regular" (aka userland) NATS server by passing a configuration file.

Use case

When developing locally it might be more convenient if the NATS node started its own NATS server. Also, on constrained devices it might be easier to just deploy the nex binary rather than both nex and nats

Contribution

No response

Change uptime to runtime and modify calculation to support functions

Current implementation uses the difference between now and the original deploy/launch time to determine the uptime value for a given workload. This value has the potential to confuse people. By changing to runtime, that value should now have the same meaning for both functions and services/elf, though calculated differently.

For services - the value will continue to be calculated as it is currently: now - deploy time
For functions - the value is an accumulated sum of actual execution time. We will time each execution of the function and add that time to a growing counter that will be used for the total elapsed execution time (runtime).

Having runtime be an accumulator for functions also makes for a nice prometheus metric.

We should also continue to maintain the original deployment time and make that available to control client queries, as it might be really useful to know that a function was deployed 3 weeks ago and has only accrued 12 minutes of total execution time.

Awaiting agent advertise timeout should be calculated based on workload type

The nex-node calls awaitHandshake each time a new firecracker VM is started.

The timeout used by awaitHandshake should be based on the workload type (e.g., OCI requests may need more time).

Implement host services for JavaScript functions (spec included)

In the initial implementation of function support, the functions have no access to side-effects (I/O). This doesn't matter for the native elf binaries because they are free to create sockets and perform all kinds of activities that functions can't.

To make functions practical and useful, we want to give them access to a suite of "host services", similar to the functionality sets provided in "cloud function SDKs", e.g. AWS SDK, Azure, Google, CloudFlare, etc.

This issue covers implementing host services JavaScript functions and hopefully scoping out the work it will take to make similar enhancements to WebAssembly workloads. Implementing host workloads involves a few core pieces:

Implement a host services server in the node
Implement a host services client in the agent
Expose the host services client to the workload itself (e.g. through a JavaScript global object wrapper)

Host Services - Rough Spec

The following is a list of services to be exposed to functions.

Key/Value Store

Each function-type workload will have access to a single key-value bucket. By convention, this bucket will be named {namespace}-{workload} where namespace is the namespace in which the workload is deployed (defaults to default) and workload is the name of the workload. For example, if a workload called echofunction is deployed in the default namespace, it will be given access to the default-echofunction key-value store.

The key-value store will only be created on demand, and so functions that never make KV requests will not consume unnecessary resources.

Operations available:

Get
Set
Delete
Keys (this can fail beyond some upper limit and/or require some form of paging)

History or revision control is not available in host key-value services.

Object Store

Object stores are also dynamically allocated on demand and also follow the same naming convention as the key-value stores. Therefore a workload name echofunction deployed in the default namespace will be provisioned an object store default-echofunction. Just like key-value buckets, the Nex node configuration can contain limits and constraints for buckets or people can even pre-create these buckets and so Nex will just reuse the existing resource.

If the developer wants multi-level or hierarchical functionality, they can implement that by using their own tokenization / prefix scheme.

Operations available:

PutChunk
GetChunk
Delete
Keys (this can fail beyond some upper limit and/or require some form of paging)

There is no implied revision list or history available. Delete here should be considered an operation that cannot be undone.

Messaging

Functions have access to basic NATS messaging, using the connection proxied on behalf of the function by the host.

Operations available:

Publish
Request
RequestMany

Functions cannot create subscriptions nor can they interact with any other low-level aspects of the messaging system. Remember that functions "subscribe" by virtue of their configured triggers and not by executing code.

HTTP Client

I feel like we might want to be able to provide a simple HTTP client proxy so that operators don't need to allow CNI configurations to let the v8 engines inside firecracker make outbound HTTP calls.

Operations available:

Request (payload contains method, headers, etc)

URLs and Credentials

Functions will need access to a specific NATS server, but in real-world deployments we likely will not want them to be communicating with the Nex control plane (e.g. the topic space in which workload start/stop and node interaction commands exist). Rather, we want the functions to use a different NATS server. In other words, you don't want functions to be able to shut down workloads (unless you explicitly want that).

As such, Nex will examine the environment variables in the work request and if a suite of NATS_xxx variables such as NATS_URL exist, then it will use those to create a connection on behalf of the workload. Whenever host services requests are received from the agent on behalf of the workload, those requests will be satisfied using the proxy connection.

Note that if a start request for a function-type workload doesn't include NATS connection parameters, then the host will use the default (control plane) connection and will emit a warning in its log. We can add a "strict mode" to node configuration that will instead reject any request to deploy a function that doesn't have explicit connection values. This would be something we'd want to enable in production and leave off during the development cycle.

This kind of separation is essential to be able to securely support multi-tenant workloads.

Not in Scope

Using the forthcoming WASI component model as the set of contracts between guest and host. We'll re-evaluate that in the coming months as, hopefully, the spec and tooling for components improve.

Ensure that Nex can be built on non-Linux operating systems

Proposed change

Allows for Nex to be built/released for non-Linux operating systems

Use case

The primary use case for this is for people who want to remotely control/operate Nex nodes from machines that don't meet the node execution requirements but can still act as control clients. This could be either Windows or MacOS.

Ideally if you run nex node ... on non-Linux, you'd get an error message rather than the current behavior, which I believe fails to compile on MacOS.

Contribution

No response

Formalize host services SDK

Proposed change

Host services is the set of functionality provided to Nex workloads as a built-in convenience. For example, JavaScript functions get access to a key-value bucket, an object store, basic pub/sub, and an HTTP client.

This would extract host services client code into its own "package"

Use case

This would let code in long-running services utilize host services if they wanted to, giving people more built-in goodness when they use Nex deployments.

Contribution

No response

Support pull consumers for function triggers

In addition to trigger subjects, we should allow developers to specify the name of a pull consumer which will trigger the function each time that consumer pulls a message. For now, we can wrap the execution in an ack so if the function works, we ack, if the function fails, we nack. We can always add support for manual ack/nack to functions in the future if we want.

Complete nex UI integration

Proposed change

The current UI implementation is stubs and placeholders. The following features need to be "live" for the UI to be considered MVP:

Monitor logs
Monitor events
Filter logs/events by namespace
Discover and browse workloads
Start a workload via devrun and an upload button/form

These were all built and completed in the Elixir Phoenix prototype in under a week so hopefully the Go version won't take much longer.

Use case

I want to be able to use nex ui to make it easier and more comfortable to perform some common nex tasks like monitoring logs and events, discovering workloads, and starting (devrun) workloads

Contribution

No response

Support namespace pings

Proposed change

Support the ability to "probe" a nexus (cluster of nex nodes) for a given namespace. So instead of pinging and getting a reply from all machines, instead get a reply from each workload in a given namespace.

The ping and ping filtering system could use a bit of a refactor anyway, especially to accommodate another filter like this.

Use case

Probe for "everything running in this namespace".

Contribution

No response

Add update command to the CLI

Proposed change

Run nex update and it will automatically update the nex binary.

Use case

For people who don't like having to remember what toolchain they used to install a given binary (was it brew? apt? nix? asdf? curl?), it just makes life so much easier if we can do nex update and have it update to the latest version. note that this just applies to replacing the binary being run, and doesn't affect a running process

Contribution

No response

nex-node should propagate async workload validation failures to clients

This PR moved the responsibility of validating workloads to the agent. This means validation now happens asynchronously.

nex-node needs to handle workload validations asynchronously and propagate failures to clients.

Workloads need to be given a chance to gracefully shut down (especially NATS subscribers)

Proposed change

When a request comes in to stop a workload, we should send a message into the agent telling it to gracefully terminate its workload prior to invoking stopVMM on the firecracker machine.

Use case

In one of our simplest use cases, echoservice, if you use devrun to launch it and then use devrun again, the running workload is stopped and the new one is started. This is useful in the developer iteration loop, but if the workload isn't given a chance to gracefully shut down (explicitly drain and disconnect), then the NATS server to which echoservice connects will have a zombie subscription until the NATS server's idle client detection timer expires.

NOTE - The current workaround to this is to set the NATS server's client/subscription timeout value to something very small so that developers don't experience the black hole subscription. Once this change is implemented, a developer should be able to run devrun twice in a row with little to no noticeable overlap in subscriptions for echoservice

This was reported by a user in the NATS slack

Contribution

No response