synadia-io / nex Goto Github PK
View Code? Open in Web Editor NEWThe NATS execution engine
Home Page: https://docs.nats.io/using-nats/nex
License: Apache License 2.0
The NATS execution engine
Home Page: https://docs.nats.io/using-nats/nex
License: Apache License 2.0
When using the latest nex cli with arm64 support update, there seems to be a problem when NATS tries to hand shake agents within microVMs.
admin@ip-x-x-x-x:~$ nats-server -js
[11826] 2024/03/18 07:15:11.309007 [INF] Starting nats-server
[11826] 2024/03/18 07:15:11.309159 [INF] Version: 2.10.12
[11826] 2024/03/18 07:15:11.309179 [INF] Git: [121169ea]
[11826] 2024/03/18 07:15:11.309199 [INF] Name: NCXQ6Y3BGPS4XJ4AXQWNTUO7SUYN65FH3UJPUSLNPFX4LE6LSAWRQSNN
[11826] 2024/03/18 07:15:11.309220 [INF] Node: 8nbUCYL1
[11826] 2024/03/18 07:15:11.309238 [INF] ID: NCXQ6Y3BGPS4XJ4AXQWNTUO7SUYN65FH3UJPUSLNPFX4LE6LSAWRQSNN
[11826] 2024/03/18 07:15:11.309683 [INF] Starting JetStream
[11826] 2024/03/18 07:15:11.310617 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[11826] 2024/03/18 07:15:11.310643 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[11826] 2024/03/18 07:15:11.310661 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[11826] 2024/03/18 07:15:11.310679 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[11826] 2024/03/18 07:15:11.310695 [INF]
[11826] 2024/03/18 07:15:11.310712 [INF] https://docs.nats.io/jetstream
[11826] 2024/03/18 07:15:11.310729 [INF]
[11826] 2024/03/18 07:15:11.310746 [INF] ---------------- JETSTREAM ----------------
[11826] 2024/03/18 07:15:11.310766 [INF] Max Memory: 23.47 GB
[11826] 2024/03/18 07:15:11.310788 [INF] Max Storage: 18.52 GB
[11826] 2024/03/18 07:15:11.310807 [INF] Store Directory: "/tmp/nats/jetstream"
[11826] 2024/03/18 07:15:11.310825 [INF] -------------------------------------------
[11826] 2024/03/18 07:15:11.311415 [INF] Listening for client connections on 0.0.0.0:4222
[11826] 2024/03/18 07:15:11.312698 [INF] Server is ready
admin@ip-x-x-x-x:~$ sudo nex node preflight
Validating - Required CNI Plugins
🔎Searching - /opt/cni/bin
✅ Dependency Satisfied - /opt/cni/bin/host-local [host-local CNI plugin]
✅ Dependency Satisfied - /opt/cni/bin/ptp [ptp CNI plugin]
✅ Dependency Satisfied - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]
Validating - Required binaries
🔎Searching - /usr/local/bin
✅ Dependency Satisfied - /usr/local/bin/firecracker [Firecracker VM binary]
Validating - CNI configuration requirements
🔎Searching - /etc/cni/conf.d
✅ Dependency Satisfied - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]
Validating - VMLinux Kernel
✅ Dependency Satisfied - /tmp/wd/vmlinux [VMLinux Kernel]
Validating - Root Filesystem Template
✅ Dependency Satisfied - /tmp/wd/rootfs.ext4 [Root Filesystem Template]
admin@ip-x-x-x-x:~$ sudo nex node up
time=2024-03-18T07:17:28.694Z level=INFO msg="Generated keypair for node" public_key=NAJ25OZ52JHNPU43HPCYAZW75MJ3US3ORABO3JYYIRLK4RBTPFI5AWKH
time=2024-03-18T07:17:28.695Z level=INFO msg="Loaded node configuration" config_path=./config.json
time=2024-03-18T07:17:28.695Z level=INFO msg="Initialized telemetry"
time=2024-03-18T07:17:28.696Z level=INFO msg="Established node NATS connection" servers=""
time=2024-03-18T07:17:28.704Z level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:44253
time=2024-03-18T07:17:28.705Z level=INFO msg="Virtual machine manager starting"
time=2024-03-18T07:17:28.705Z level=INFO msg="Resetting network"
time=2024-03-18T07:17:28.705Z level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDIZCIFCIDEGBDR23JGIEKBX43RVXNC3MOJGTU2Q3VXNIQ3ZSH7HT4VC
time=2024-03-18T07:17:28.705Z level=INFO msg="NATS execution engine awaiting commands" id=NAJ25OZ52JHNPU43HPCYAZW75MJ3US3ORABO3JYYIRLK4RBTPFI5AWKH version=0.1.5
time=2024-03-18T07:17:28.707Z level=INFO msg="Publishing node started event"
time=2024-03-18T07:17:29.014Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g socket_path=/tmp/.firecracker.sock-11865-cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.026Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.026Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0x40002df748 Smt:0x40002df753 TrackDirtyPages:0x40002df754 VcpuCount:0x40002df740}"
time=2024-03-18T07:17:29.026Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.026Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g drive_path=/tmp/rootfs-cnrul21t6e8i4vc1sf6g.ext4 slot=1 root=true
time=2024-03-18T07:17:29.027Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g drive_path=/tmp/rootfs-cnrul21t6e8i4vc1sf6g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.027Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g device_name=tap0 mac_addr=ca:7d:05:49:f0:b2 interface_id=1
time=2024-03-18T07:17:29.050Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.050Z level=INFO msg="Machine started" vmid=cnrul21t6e8i4vc1sf6g ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.051Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.051Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:29.358Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 socket_path=/tmp/.firecracker.sock-11865-cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.370Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.371Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0x40004d9398 Smt:0x40004d93a3 TrackDirtyPages:0x40004d93a4 VcpuCount:0x40004d9390}"
time=2024-03-18T07:17:29.371Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.371Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf70.ext4 slot=1 root=true
time=2024-03-18T07:17:29.372Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf70.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.372Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 device_name=tap0 mac_addr=4a:a8:e5:66:f8:a2 interface_id=1
time=2024-03-18T07:17:29.396Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf70 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.397Z level=INFO msg="Machine started" vmid=cnrul29t6e8i4vc1sf70 ip=192.168.127.3 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.397Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.397Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.3 vmid=cnrul29t6e8i4vc1sf70
time=2024-03-18T07:17:29.701Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g socket_path=/tmp/.firecracker.sock-11865-cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.713Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.714Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0x4000408bb8 Smt:0x4000408bc3 TrackDirtyPages:0x4000408bc4 VcpuCount:0x4000408bb0}"
time=2024-03-18T07:17:29.714Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:17:29.715Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf7g.ext4 slot=1 root=true
time=2024-03-18T07:17:29.715Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g drive_path=/tmp/rootfs-cnrul29t6e8i4vc1sf7g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:17:29.715Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g device_name=tap0 mac_addr=36:88:7a:63:c5:62 interface_id=1
time=2024-03-18T07:17:29.740Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:17:29.740Z level=INFO msg="Machine started" vmid=cnrul29t6e8i4vc1sf7g ip=192.168.127.4 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=44253
time=2024-03-18T07:17:29.741Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:29.741Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.4 vmid=cnrul29t6e8i4vc1sf7g
time=2024-03-18T07:17:34.074Z level=ERROR msg="Did not receive NATS handshake from agent within timeout." vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:34.074Z level=ERROR msg="First handshake failed, shutting down to avoid inconsistent behavior"
Problem:
time=2024-03-18T07:17:34.074Z level=ERROR msg="Did not receive NATS handshake from agent within timeout." vmid=cnrul21t6e8i4vc1sf6g
time=2024-03-18T07:17:34.074Z level=ERROR msg="First handshake failed, shutting down to avoid inconsistent behavior"
I tested on a linux-amd64(x86_64) machine following the same steps and successfully set up the node.
admin@ip-x-x-x-x:~$ sudo nex node up
time=2024-03-18T07:32:19.250Z level=INFO msg="Generated keypair for node" public_key=NAAFGJURU5O2A7Y3LW4ADW46C3FGO2HQFQXR4ZDVUW4QOPCL5DSZ65D7
time=2024-03-18T07:32:19.251Z level=INFO msg="Loaded node configuration" config_path=./config.json
time=2024-03-18T07:32:19.251Z level=INFO msg="Initialized telemetry"
time=2024-03-18T07:32:19.252Z level=INFO msg="Established node NATS connection" servers=""
time=2024-03-18T07:32:19.462Z level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:33653
time=2024-03-18T07:32:19.462Z level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDSYXBPAWDSXFRNJWNLNCE3IGU2ZRCYZAMRDJETPVR5YNICZVBNQ5NDX
time=2024-03-18T07:32:19.462Z level=INFO msg="Virtual machine manager starting"
time=2024-03-18T07:32:19.462Z level=INFO msg="NATS execution engine awaiting commands" id=NAAFGJURU5O2A7Y3LW4ADW46C3FGO2HQFQXR4ZDVUW4QOPCL5DSZ65D7 version=0.1.5
time=2024-03-18T07:32:19.462Z level=INFO msg="Resetting network"
time=2024-03-18T07:32:19.462Z level=INFO msg="Publishing node started event"
time=2024-03-18T07:32:20.311Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus0v76f30gvd23590 socket_path=/tmp/.firecracker.sock-1113-cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.358Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.359Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus0v76f30gvd23590 err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0xc00040cb38 Smt:0xc00040cb43 TrackDirtyPages:0xc00040cb44 VcpuCount:0xc00040cb30}"
time=2024-03-18T07:32:20.359Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus0v76f30gvd23590 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:20.359Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus0v76f30gvd23590 drive_path=/tmp/rootfs-cnrus0v76f30gvd23590.ext4 slot=1 root=true
time=2024-03-18T07:32:20.360Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus0v76f30gvd23590 drive_path=/tmp/rootfs-cnrus0v76f30gvd23590.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:20.360Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus0v76f30gvd23590 device_name=tap0 mac_addr=16:ee:e2:d1:5f:37 interface_id=1
time=2024-03-18T07:32:20.432Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus0v76f30gvd23590 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:20.432Z level=INFO msg="Machine started" vmid=cnrus0v76f30gvd23590 ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:20.433Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.433Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cnrus0v76f30gvd23590
time=2024-03-18T07:32:20.678Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus1776f30gvd2359g socket_path=/tmp/.firecracker.sock-1113-cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.690Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.690Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus1776f30gvd2359g err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0xc000015068 Smt:0xc000015073 TrackDirtyPages:0xc000015074 VcpuCount:0xc000015060}"
time=2024-03-18T07:32:20.691Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus1776f30gvd2359g err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:20.691Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus1776f30gvd2359g drive_path=/tmp/rootfs-cnrus1776f30gvd2359g.ext4 slot=1 root=true
time=2024-03-18T07:32:20.691Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus1776f30gvd2359g drive_path=/tmp/rootfs-cnrus1776f30gvd2359g.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:20.691Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus1776f30gvd2359g device_name=tap0 mac_addr=26:48:5a:18:8f:de interface_id=1
time=2024-03-18T07:32:20.765Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus1776f30gvd2359g err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:20.765Z level=INFO msg="Machine started" vmid=cnrus1776f30gvd2359g ip=192.168.127.3 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:20.765Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:20.765Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.3 vmid=cnrus1776f30gvd2359g
time=2024-03-18T07:32:21.018Z level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cnrus1776f30gvd235a0 socket_path=/tmp/.firecracker.sock-1113-cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.030Z level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.031Z level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cnrus1776f30gvd235a0 err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0xc0001222c8 Smt:0xc0001222d3 TrackDirtyPages:0xc0001222d4 VcpuCount:0xc0001222c0}"
time=2024-03-18T07:32:21.031Z level=INFO msg=PutGuestBootSource firecracker=true vmmid=cnrus1776f30gvd235a0 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-18T07:32:21.031Z level=INFO msg="Attaching drive" firecracker=true vmmid=cnrus1776f30gvd235a0 drive_path=/tmp/rootfs-cnrus1776f30gvd235a0.ext4 slot=1 root=true
time=2024-03-18T07:32:21.032Z level=INFO msg="Attached drive" firecracker=true vmmid=cnrus1776f30gvd235a0 drive_path=/tmp/rootfs-cnrus1776f30gvd235a0.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-18T07:32:21.032Z level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cnrus1776f30gvd235a0 device_name=tap0 mac_addr=76:a8:99:c5:0c:e6 interface_id=1
time=2024-03-18T07:32:21.095Z level=INFO msg="startInstance successful" firecracker=true vmmid=cnrus1776f30gvd235a0 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-18T07:32:21.096Z level=INFO msg="Machine started" vmid=cnrus1776f30gvd235a0 ip=192.168.127.4 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=33653
time=2024-03-18T07:32:21.096Z level=INFO msg="SetMetadata successful" firecracker=true vmmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:21.096Z level=INFO msg="Adding new VM to warm pool" ip=192.168.127.4 vmid=cnrus1776f30gvd235a0
time=2024-03-18T07:32:22.195Z level=INFO msg="Received agent handshake" vmid=cnrus0v76f30gvd23590 message="Host-supplied metadata"
time=2024-03-18T07:32:22.554Z level=INFO msg="Received agent handshake" vmid=cnrus1776f30gvd2359g message="Host-supplied metadata"
time=2024-03-18T07:32:22.857Z level=INFO msg="Received agent handshake" vmid=cnrus1776f30gvd235a0 message="Host-supplied metadata"
admin@ip-x-x-x-x:~$ nats-server --version
nats-server: v2.10.12
admin@ip-x-x-x-x:~$ nex --version
v0.1.5 [567f45b355e71d0a186961a8aaca95d7c3a7fd3b] | Built-on: 2024-03-14T19:45:23Z
Host: AWS a1.metal
vCPUs: 16
Memory: 32GiB
OS: debian-12
CPU Arch: arm64
admin@ip-x-x-x-x:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 12 (bookworm)
Release: 12
Codename: bookworm
admin@ip-x-x-x-x:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: ARM
Model name: Cortex-A72
Model: 3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 4
Stepping: r0p3
BogoMIPS: 166.66
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 768 KiB (16 instances)
L2: 8 MiB (4 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Not affected
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; Branch predictor hardening, BHB
Srbds: Not affected
Tsx async abort: Not affected
No response
This is just to discuss what other, if any, badges we want to add the the README. At the moment, all the have is the status of the two actions. I would like some sort of code coverage indicator to show the platform is (er, uh, will be) well tested
Many of us are on corporate Windows or MacOS computers, with no way to run firecracker easily. For people in this situation it would be great if nex could also run on top of another execution engine such as, e.g, podman, with podman desktop.
Developers on corporate Windows of MacOS machines who want to test nex locally.
Not software but if we havet his, then I can probably get my employer involved for buying a subscription from Synadia.
When a function is triggered, we should emit an event indicating the completion status of the function:
Monitoring tools (other than prometheus/otel) would like to be able to observe execution counts, execution time, and failure rates
No response
When using the simple.json preflight checks, the command fails if the directories are not already present.
Preflight checks should create any required nested directories.
root@nats-client:~# nex --version
vdevelopment [] | Built-on:
Ubuntu 23.10 x64
I'm using a clean Ubuntu 23.10 x64 droplet in Digital Ocean, with golang and the nats CLI installed.
root@nats-client:~# nex node preflight --config simple.json
Validating - Required CNI Plugins [/opt/cni/bin]
⛔ Missing Dependency - /opt/cni/bin/host-local [host-local CNI plugin]
⛔ Missing Dependency - /opt/cni/bin/ptp [ptp CNI plugin]
⛔ Missing Dependency - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]
Validating - Required binaries [/usr/local/bin]
⛔ Missing Dependency - /usr/local/bin/firecracker [Firecracker VM binary]
Validating - CNI configuration requirements [/etc/cni/conf.d]
⛔ Missing Dependency - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]
Validating - User provided files []
⛔ Missing Dependency - /tmp/wd/vmlinux [VMLinux Kernel]
⛔ Missing Dependency - /tmp/wd/rootfs.ext4 [Root Filesystem Template]
⛔ You are missing required dependencies for [Required CNI Plugins], do you want to install? [y/N] y
open /opt/cni/bin/ptp: no such file or directory
panic: preflight checks failed: open /opt/cni/bin/ptp: no such file or directory
goroutine 1 [running]:
github.com/synadia-io/nex/internal/node.CmdPreflight(0x1114db0?, 0x1784a80, {0xe58680?, 0xc0003fab40?}, 0xc0003fab40?, 0xc00046c5f0?)
/root/go/pkg/mod/github.com/synadia-io/[email protected]/internal/node/init.go:140 +0x96
main.RunNodePreflight(0x40e44c?)
/root/go/pkg/mod/github.com/synadia-io/[email protected]/nex/node_linux_amd64.go:33 +0x179
github.com/choria-io/fisk.(*actionMixin).applyActions(...)
/root/go/pkg/mod/github.com/choria-io/[email protected]/actions.go:28
github.com/choria-io/fisk.(*Application).applyActions(0xc00047f400?, 0xc00019e000)
/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:823 +0xdf
github.com/choria-io/fisk.(*Application).execute(0xc00047f400, 0xc00019e000?, {0xc0003fab80, 0x2, 0x2})
/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:609 +0x65
github.com/choria-io/fisk.(*Application).Parse(0xc00047f400, {0xc000100060?, 0xc0004eb650?, 0x30?})
/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:272 +0x16c
github.com/choria-io/fisk.(*Application).MustParseWithUsage(0xc00047f400, {0xc000100060, 0x4, 0x4})
/root/go/pkg/mod/github.com/choria-io/[email protected]/app.go:889 +0x45
main.main()
/root/go/pkg/mod/github.com/synadia-io/[email protected]/nex/nex.go:94 +0x41f7
If you run mkdir /opt/cni/bin
and run again, the CNI Plugins step succeeds but then fails on CNI configuration requirements
. After resolving that issue, it fails again at User provided files
. After creating all required directories, preflight succeeds.
The two structs are almost identical
This will allow us to drop a config at preflight
No response
Whether it's a set of instructions or whether it's through a docker compose, we should have an "easy button" for people to be able to capture and see the open telemetry spans that we're collection
Observability shouldn't be an afterthought
No response
We should be able to export to an OTLP collector rather than exposing an inbound HTTP Prometheus scrape URL like we're doing now. Firstly, we should be using OTLP instead of something vendor-specific like Prometheus. Secondly, we should support exporting because it's typically easier in secure production environments than supporting inbound HTTP
I want to be able to export metrics and traces to OTLP collector(s)
No response
Using default value ""
for string
makes it difficult to determine if clients are providing "garbage in" -- it feels more idiomatic to use *string
and having a basic utility method such as:
// returns the given string or nil when empty
func StringOrNil(str string) *string {
if str == "" {
return nil
}
return &str
}
A good chore for the near future is to audit and refactor the codebase to (i) use *string
in struct
types, (ii) implement the above StringOrNil
utility method and (iii) dereference string pointers where necessary.
Setup ginkgo and code coverage tooling.
Right now, to build a custom rootfs, or even develop on the agent, one must use scripts nested in the repo and is limited to x86. This proposal brings that funcationality so nex rootfs ...
or something similar.
Better devx around the rootfs creation
Will contribute over the next week
Right now the only way to observe the state of a Nexus (group of Nex nodes) is to monitor events. If one of those events is lost/missed, then watchers need to make another round of ping
requests. It would be easier if nex nodes emitted regular heartbeats with critical information on them so the state of a Nexus can be observed.
Receive periodic heartbeats containing node status as a supplement to the "started", "stopped", etc events.
No response
nats-server
[128423] 2024/03/01 16:16:19.229992 [INF] Starting nats-server
[128423] 2024/03/01 16:16:19.230314 [INF] Version: 2.10.11
[128423] 2024/03/01 16:16:19.230323 [INF] Git: [5fc6051]
[128423] 2024/03/01 16:16:19.230329 [INF] Name: NA3UPF2KFNGQWU7KKHG4ESWKGIO2XGZYVYOHMXC3LKX4Y4A37ECR7E2V
[128423] 2024/03/01 16:16:19.230349 [INF] ID: NA3UPF2KFNGQWU7KKHG4ESWKGIO2XGZYVYOHMXC3LKX4Y4A37ECR7E2V
[128423] 2024/03/01 16:16:19.232238 [INF] Listening for client connections on 0.0.0.0:4222
[128423] 2024/03/01 16:16:19.232916 [INF] Server is ready
sudo nex node preflight --config=./simple.json
[sudo] password for user:
Validating - Required CNI Plugins
🔎Searching - /opt/cni/bin
✅ Dependency Satisfied - /opt/cni/bin/host-local [host-local CNI plugin]
✅ Dependency Satisfied - /opt/cni/bin/ptp [ptp CNI plugin]
✅ Dependency Satisfied - /opt/cni/bin/tc-redirect-tap [tc-redirect-tap CNI plugin]
Validating - Required binaries
🔎Searching - /usr/local/bin
✅ Dependency Satisfied - /usr/local/bin/firecracker [Firecracker VM binary]
Validating - CNI configuration requirements
🔎Searching - /etc/cni/conf.d
✅ Dependency Satisfied - /etc/cni/conf.d/fcnet.conflist [CNI Configuration]
Validating - VMLinux Kernel
✅ Dependency Satisfied - /path/to/vmlinux-5.10 [VMLinux Kernel]
Validating - Root Filesystem Template
✅ Dependency Satisfied - /path/to/rootfs.ext4 [Root Filesystem Template]
user@userbn:~/Desktop/nats$ sudo nex node up --config=./simple.json
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Generated keypair for node" public_key=NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Loaded node configuration" config_path=./simple.json
time=2024-03-01T16:36:30.686+03:00 level=INFO msg="Initialized telemetry"
time=2024-03-01T16:36:30.689+03:00 level=INFO msg="Established node NATS connection" servers=""
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Internal NATS server started" client_url=nats://0.0.0.0:38009
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Use this key as the recipient for encrypted run requests" public_xkey=XDEHXWJMQ6ZCLFRFKSWLX2BQ7MLCUCYZGEQAPAJL7VMZUZK2JPICVKCT
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="NATS execution engine awaiting commands" id=NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC version=0.1.4
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Virtual machine manager starting"
time=2024-03-01T16:36:30.702+03:00 level=INFO msg="Resetting network"
time=2024-03-01T16:36:30.703+03:00 level=INFO msg="Publishing node started event"
time=2024-03-01T16:36:31.194+03:00 level=INFO msg="Called startVMM(), setting up a VMM" firecracker=true vmmid=cngtjngpd7rvbdfklck0 socket_path=/tmp/.firecracker.sock-129425-cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.208+03:00 level=INFO msg="VMM metrics disabled." firecracker=true vmmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.209+03:00 level=INFO msg=refreshMachineConfiguration firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[GET /machine-config][200] getMachineConfigurationOK &{CPUTemplate: MemSizeMib:0xc0003574f8 Smt:0xc000357503 TrackDirtyPages:0xc000357504 VcpuCount:0xc0003574f0}"
time=2024-03-01T16:36:31.210+03:00 level=INFO msg=PutGuestBootSource firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[PUT /boot-source][204] putGuestBootSourceNoContent "
time=2024-03-01T16:36:31.210+03:00 level=INFO msg="Attaching drive" firecracker=true vmmid=cngtjngpd7rvbdfklck0 drive_path=/tmp/rootfs-cngtjngpd7rvbdfklck0.ext4 slot=1 root=true
time=2024-03-01T16:36:31.211+03:00 level=INFO msg="Attached drive" firecracker=true vmmid=cngtjngpd7rvbdfklck0 drive_path=/tmp/rootfs-cngtjngpd7rvbdfklck0.ext4 err="[PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent "
time=2024-03-01T16:36:31.211+03:00 level=INFO msg="Attaching NIC at index" firecracker=true vmmid=cngtjngpd7rvbdfklck0 device_name=tap0 mac_addr=ea:a3:32:2f:7b:4e interface_id=1
time=2024-03-01T16:36:31.369+03:00 level=INFO msg="startInstance successful" firecracker=true vmmid=cngtjngpd7rvbdfklck0 err="[PUT /actions][204] createSyncActionNoContent "
time=2024-03-01T16:36:31.369+03:00 level=INFO msg="Machine started" vmid=cngtjngpd7rvbdfklck0 ip=192.168.127.2 gateway=192.168.127.1 netmask=ffffff00 hosttap=tap0 nats_host=192.168.127.1 nats_port=38009
time=2024-03-01T16:36:31.370+03:00 level=INFO msg="SetMetadata successful" firecracker=true vmmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:31.370+03:00 level=INFO msg="Adding new VM to warm pool" ip=192.168.127.2 vmid=cngtjngpd7rvbdfklck0
time=2024-03-01T16:36:33.246+03:00 level=INFO msg="Received agent handshake" vmid=cngtjngpd7rvbdfklck0 message="Host-supplied metadata"
nex node info NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
NEX Node Information
Node: NDDCGXKZ5Z3P43KDTMNPLXZYBCCL7REOXYZ6GMQKBWHWQDLNCZSMMXRC
Xkey: XDEHXWJMQ6ZCLFRFKSWLX2BQ7MLCUCYZGEQAPAJL7VMZUZK2JPICVKCT
Version: 0.1.4
Uptime: 16m18s
Tags: nex.arch=amd64, nex.cpucount=16, nex.os=linux, simple=true
Memory in kB:
Free: 435,036
Available: 12,001,120
Total: 16,365,564
Problem:
nex devrun ./echoservice/echoservice nats_url="nats://192.168.127.1:4222"
Reusing existing issuer account key: /home/user/.nex/issuer.nk
Reusing existing publisher xkey: /home/user/.nex/publisher.xk
time=2024-03-01T16:57:20.172+03:00 level=ERROR msg="failed to devrun workload" err="nats: no responders available for request"
$ nex devrun ../examples/echoservice/echoservice nats_url=nats://192.168.127.1:4222
Reusing existing issuer account key: /home/kevin/.nex/issuer.nk
Reusing existing publisher xkey: /home/kevin/.nex/publisher.xk
🚀 Workload 'echoservice' accepted. You can now refer to this workload with ID: cmji29n52omrb71g07a0 on node NBS3Y3NWXLTFNC73XMVD6USFJF2H5QXTLEJQNOPEBPYDUDVB5YYYZOGI
user@userbn:~$ nats-server --version
nats-server: v2.10.11
user@userbn:~$ nex --version
v0.1.4 [7a6e9b3a5d68199fddc85c8c2b880adc94b7412c] | Built-on: 2024-02-23T18:36:07Z
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 4
Stepping: 1
BogoMIPS: 4394.90
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpu
id tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcn
t tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fau
lt invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip ar
ch_capabilities
Virtualization features:
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 64 MiB (16 instances)
L3: 64 MiB (4 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS N
ot affected
Srbds: Not affected
Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
No response
May be useful to support `trace` level here also.
Originally posted by @kthomas in #82 (comment)
As soon as a nex node starts, the results of ping will report an incorrect workload count. For example, with the default pool size of 1, ping will report that there is 1 workload running. However, when you query via info, you get an accurate list of running machines (which is none in this case)
When we query a newly running node, it should report 0 workloads. It should continue to report an accurate count based on what is and is not actually running, so it should take into account machine failures and terminations as well.
We're not currently testing our API endpoints, so we should make sure to add spec for this so that we can assert that the control API behaves properly
main
Ubuntu 20.10, x86_64
Just start a node and then ping it or do nex node ls
A node can be placed into lame duck mode by a remote request. This lame duck mode would:
essential
label from all applicable running workloadslame_duck_entered
eventlame_duck
flagOnce in lame duck mode, a node cannot be "un-lame-duck'ed". The only acceptable state after lame duck is termination.
We may want to prepare a node for maintenance, update, pending network partition, or maybe just a particularly long graceful shutdown. Implementing a request to enter lame duck mode would accomplish this.
No response
Have preflight check for these binaries and install if not present
Kernel: https://s3.amazonaws.com/spec.ccfc.min/firecracker-ci/v1.5/x86_64/vmlinux-5.10.186
Plugins: https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-arm-v1.3.0.tgz
Plugins SHA: https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-arm-v1.3.0.tgz.sha256
tc-redirect-tap: https://github.com/awslabs/tc-redirect-tap (will need to build 😫 )
We need to rebuild the rootfs when the agent changes, we will do this in CI and add the rootfs as a artifact, but it would also be nice to have some more fluent devex around this for motivated engineers that are hacking away at the agent.
When the Nex CLI is performing a task that requires a connection to the NATS server, we should be able to detect if a nats context is available and selected and even potentially offer the ability to override the nats context.
If someone isn't using the default NATS URL for their development server, then they have to add (at least) --server
to all of their nex commands. We should make it much easier and detect/reuse the nats context.
No response
Migrate to log/slog before we get to the point of no return.
Right now node up just continues allocating IP addresses. In production it won't take long for the IPs to run out.
Note that this behavior will happen regardless of how many times the node is started. Basically after 254 VMs from a fresh start, we're going to run out. So I see the requirements as follows:
.2
address.2
Need to remove go install method of installation (while replace stanza in go.mod) and provide better solution.
Better DevEx
No response
if $HOME/.nex/issuer.nk
doesnt already exist, devrun will fail with a no such file
error
All i needed to do was touch
the file and the next time I ran nex
, is created the issuer and publisher nkeys correctly.
Probably as simple as stat-ing the file before trying to open it...will look into it tomorrow
Follow the same pattern(s) as the NATS server (where/if appropriate) to allow Nex nodes to emit (or be scraped for) metrics. The ability to export should be considered the primary goal (OTLP) while supporting scraping should only be done if we get that for free or "for cheap" when we're setting up the metrics and exporting. We shouldn't use any specific vendor libraries like Jaeger or Prometheus, everything should be otel/OTLP standards.
Some gauges that I think we'll want ("deployed" includes all workload types, incl. functions):
Counters:
Operators managing a fleet of Nex nodes will need this feature
No response
We should implement the core functionality that allows for the nex node to emit OpenTelemetry spans. On top of that, we should keep track of function execution as a span.
As a Nex developer or operator, I want to be able to track activity through my own OpenTelemetry infrastructure. The primary desire is that I can track function executions and correlate all of the host service calls that each function makes.
So, for example, a "span" might look something like this:
No response
Start a nex node and have the agent within the first VM not report a handshake (this is pretty difficult to reproduce). A request to deploy a workload comes in and the node will attempt to deploy into that VM, and since the VM is effectively dead, the result will be a "nats: timeout"
The Nex node should abort and terminate if an agent fails to handshake. Agents failing to handshake is an indication of instability that is enough to warrant stopping the node. A missing handshake used to stop the node and no longer does, so this is a regression.
This should then never allow the situation where we attempt to deploy into a VM that hasn't given a handshake
Latest release of both
No response
No response
If a workload request contains a machine configuration override field, then the workload will use that override and not the machine configuration template that was specified via the --config
option when the nex node was started (nex node up --config ....
).
When applications have more specific requirements , such as controlling inbound TCP port access, and those requirements don't apply to all firecracker VMs on a given node, then we want the ability to specify per-machine configuration. For example, if we want all instances of a given workload, let's say alice-service
to listen on port 8080
on the node host versus listening on the firecracker IP (e.g. 192.168.127.x:8080
), then a configuration that contains the use of the CNI portmap
plugin needs to be specified, and likely only applies to a given workload and not all workloads on that node.
This should be labeled as an advanced feature and the Nex node itself should emit a warning whenever a per-workload machine configuration override is included in a deployment request. The use of this feature will also impact startup performance. Since the workload deployment request is for a bespoke machine configuration, we can't reuse one of the "warm" VMs in the firecracker pool, and it has to be started specifically for this request. For more workloads this startup penalty is negligible, but it's definitely something an operator needs to be aware of.
The warning for using this feature should also remind developers that whatever validation was performed during the most recent preflight
check does not apply to this workload. So we can't pre-verify that the workload configuration override is referring to real, accessible linux kernel and rootfs images or installed CNI plugins.
In short, this feature would be "use at your own risk"
No response
When we run node up
, we should be able to specify that we want to run that in interactive mode. The default should continue to operate the way it does now where it's just emitting logs to stdout (e.g. the production-ready mode).
We should be able to opt into interactive mode (e.g. with --interactive
flag on node up
). This would give us a full screen tui that lets us navigate around the various control operations for that specific node, as well as view the logs and other important information.
When developing workloads and functions for Nex, I want to be able to have more easy interaction with my development node server. I imagine myself being able to use arrow keys to navigate through the list of namespaces in use and the running workloads within those, and to be able to view and tweak whatever configuration I might like.
This lets me have my code in one window and an easy and fun to use terminal window for the node.
No response
arm architecture support is required to use in A1.metal,
the cheapest bare metal environment that does not use x86_64.
aws ec2 a1.metal
No response
Right now the open telemetry tracing support is limited to just the execution trigger handler on functions. We want to expand this so that we can create child spans that maintain the appropriate parent for all of the host services, including kv, object store, and messaging.
We want to see child spans every time a function utilizes a host service capability
No response
In the current implementation, the only time you can specify things like rate limiters is in the machineconfig
template that is defined in the node configuration (which is really just a firecracker rate limiter JSON definition). We should be able to provide more constraints around the node limits as well as be able to override some or all of the machineconfig template in the individual workload.
Note when we do allow overriding the firecracker constraints on a per-workload basis, we will have to document the caveat that if you override the machine config template for a given workload, that workload will be started from scratch, and not use a warm VM from the pool.
Incomplete list of Node limitations we should probably support:
The above limitations can be specified either as per node (within a single nex
process) or per namespace. In the case of the latter, e.g. the max machines deployed number will be enforced per namespace rather than for the entire node as a whole.
Overridable Limits that can be supplied with an individual workload:
When we emit the workload_deployed
event, we should include the effective limitations, which should be representative of the actual limitations used in the firecracker definition for that VM.
In the current version, nex-node preflight
will tell you when you're missing the CNI plugins, your kernel, and your rootfs. We should publish "known good" versions of the CNI plugins, a linux kernel, and a default rootfs so that preflight
can download those all to help people out and potentially make it easier for ops teams to manage a nex-node fleet.
Bonus: preflight could even check for newer versions of nex-node
itself and update automatically if the user wants.
Since we're already in a monorepo and there's already a lot of shared code between the nex node and the nex CLI, we'd like to consolidate them.
node
sub-command, e.g. nex node up
and nex node preflight
Right now the web/dist
folder is being generated locally and committed. I think a better way to do this might be to leave that directory uncommitted and run pnpm install
and pnpm build
in the nex-ui/web directory prior to building the binary.
Limited-environment languages such as Cloudflare Workers are suitable for the needs of rapid prototyping like startups, but realistically, they are difficult to adopt in companies.
If you can run an image of a personally created service on the NATS Edge
I think it will be a really good execution engine.
I would like to ask if there is a plan for priorities.
(This sentence was written using a translator, so there are some awkward parts, so please understand.)
run an image created at nats edge
No response
The Nex binary already embeds an internal NATS server for private use. It would be cool if we could also have the Nex server start a "regular" (aka userland) NATS server by passing a configuration file.
When developing locally it might be more convenient if the NATS node started its own NATS server. Also, on constrained devices it might be easier to just deploy the nex
binary rather than both nex
and nats
No response
Current implementation uses the difference between now and the original deploy/launch time to determine the uptime
value for a given workload. This value has the potential to confuse people. By changing to runtime
, that value should now have the same meaning for both functions and services/elf, though calculated differently.
now - deploy time
Having runtime be an accumulator for functions also makes for a nice prometheus metric.
We should also continue to maintain the original deployment time and make that available to control client queries, as it might be really useful to know that a function was deployed 3 weeks ago and has only accrued 12 minutes of total execution time.
The nex-node calls awaitHandshake
each time a new firecracker VM is started.
The timeout used by awaitHandshake
should be based on the workload type (e.g., OCI requests may need more time).
In the initial implementation of function support, the functions have no access to side-effects (I/O). This doesn't matter for the native elf binaries because they are free to create sockets and perform all kinds of activities that functions can't.
To make functions practical and useful, we want to give them access to a suite of "host services", similar to the functionality sets provided in "cloud function SDKs", e.g. AWS SDK, Azure, Google, CloudFlare, etc.
This issue covers implementing host services JavaScript functions and hopefully scoping out the work it will take to make similar enhancements to WebAssembly workloads. Implementing host workloads involves a few core pieces:
The following is a list of services to be exposed to functions.
Each function-type workload will have access to a single key-value bucket. By convention, this bucket will be named {namespace}-{workload}
where namespace
is the namespace in which the workload is deployed (defaults to default
) and workload
is the name of the workload. For example, if a workload called echofunction
is deployed in the default
namespace, it will be given access to the default-echofunction
key-value store.
The key-value store will only be created on demand, and so functions that never make KV requests will not consume unnecessary resources.
Operations available:
Get
Set
Delete
Keys
(this can fail beyond some upper limit and/or require some form of paging)History or revision control is not available in host key-value services.
Object stores are also dynamically allocated on demand and also follow the same naming convention as the key-value stores. Therefore a workload name echofunction
deployed in the default
namespace will be provisioned an object store default-echofunction
. Just like key-value buckets, the Nex node configuration can contain limits and constraints for buckets or people can even pre-create these buckets and so Nex will just reuse the existing resource.
If the developer wants multi-level or hierarchical functionality, they can implement that by using their own tokenization / prefix scheme.
Operations available:
PutChunk
GetChunk
Delete
Keys
(this can fail beyond some upper limit and/or require some form of paging)There is no implied revision list or history available. Delete here should be considered an operation that cannot be undone.
Functions have access to basic NATS messaging, using the connection proxied on behalf of the function by the host.
Operations available:
Publish
Request
RequestMany
Functions cannot create subscriptions nor can they interact with any other low-level aspects of the messaging system. Remember that functions "subscribe" by virtue of their configured triggers and not by executing code.
I feel like we might want to be able to provide a simple HTTP client proxy so that operators don't need to allow CNI configurations to let the v8 engines inside firecracker make outbound HTTP calls.
Operations available:
Request
(payload contains method, headers, etc)Functions will need access to a specific NATS server, but in real-world deployments we likely will not want them to be communicating with the Nex control plane (e.g. the topic space in which workload start/stop and node interaction commands exist). Rather, we want the functions to use a different NATS server. In other words, you don't want functions to be able to shut down workloads (unless you explicitly want that).
As such, Nex will examine the environment variables in the work request and if a suite of NATS_xxx
variables such as NATS_URL
exist, then it will use those to create a connection on behalf of the workload. Whenever host services requests are received from the agent on behalf of the workload, those requests will be satisfied using the proxy connection.
Note that if a start request for a function-type workload doesn't include NATS connection parameters, then the host will use the default (control plane) connection and will emit a warning in its log. We can add a "strict mode" to node configuration that will instead reject any request to deploy a function that doesn't have explicit connection values. This would be something we'd want to enable in production and leave off during the development cycle.
This kind of separation is essential to be able to securely support multi-tenant workloads.
Using the forthcoming WASI component model as the set of contracts between guest and host. We'll re-evaluate that in the coming months as, hopefully, the spec and tooling for components improve.
Allows for Nex to be built/released for non-Linux operating systems
The primary use case for this is for people who want to remotely control/operate Nex nodes from machines that don't meet the node execution requirements but can still act as control clients. This could be either Windows or MacOS.
Ideally if you run nex node ...
on non-Linux, you'd get an error message rather than the current behavior, which I believe fails to compile on MacOS.
No response
Host services is the set of functionality provided to Nex workloads as a built-in convenience. For example, JavaScript functions get access to a key-value bucket, an object store, basic pub/sub, and an HTTP client.
This would extract host services client code into its own "package"
This would let code in long-running services utilize host services if they wanted to, giving people more built-in goodness when they use Nex deployments.
No response
In addition to trigger subjects, we should allow developers to specify the name of a pull consumer which will trigger the function each time that consumer pulls a message. For now, we can wrap the execution in an ack so if the function works, we ack, if the function fails, we nack. We can always add support for manual ack/nack to functions in the future if we want.
The current UI implementation is stubs and placeholders. The following features need to be "live" for the UI to be considered MVP:
These were all built and completed in the Elixir Phoenix prototype in under a week so hopefully the Go version won't take much longer.
I want to be able to use nex ui
to make it easier and more comfortable to perform some common nex tasks like monitoring logs and events, discovering workloads, and starting (devrun) workloads
No response
Support the ability to "probe" a nexus (cluster of nex nodes) for a given namespace. So instead of pinging and getting a reply from all machines, instead get a reply from each workload in a given namespace.
The ping and ping filtering system could use a bit of a refactor anyway, especially to accommodate another filter like this.
Probe for "everything running in this namespace".
No response
Run nex update
and it will automatically update the nex
binary.
For people who don't like having to remember what toolchain they used to install a given binary (was it brew? apt? nix? asdf? curl?), it just makes life so much easier if we can do nex update
and have it update to the latest version. note that this just applies to replacing the binary being run, and doesn't affect a running process
No response
This PR moved the responsibility of validating workloads to the agent. This means validation now happens asynchronously.
nex-node needs to handle workload validations asynchronously and propagate failures to clients.
When a request comes in to stop a workload, we should send a message into the agent telling it to gracefully terminate its workload prior to invoking stopVMM
on the firecracker machine.
In one of our simplest use cases, echoservice
, if you use devrun
to launch it and then use devrun
again, the running workload is stopped and the new one is started. This is useful in the developer iteration loop, but if the workload isn't given a chance to gracefully shut down (explicitly drain and disconnect), then the NATS server to which echoservice
connects will have a zombie subscription until the NATS server's idle client detection timer expires.
NOTE - The current workaround to this is to set the NATS server's client/subscription timeout value to something very small so that developers don't experience the black hole subscription. Once this change is implemented, a developer should be able to run devrun
twice in a row with little to no noticeable overlap in subscriptions for echoservice
This was reported by a user in the NATS slack
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.