ansible / receptor Goto Github PK

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.

License: Other

Go 89.56% Makefile 1.13% Dockerfile 0.64% Shell 0.63% Python 8.03%

receptor's Introduction

Receptor

Receptor is an overlay network intended to ease the distribution of work across a large and dispersed collection of workers. Receptor nodes establish peer-to-peer connections with each other via existing networks. Once connected, the Receptor mesh provides datagram (UDP-like) and stream (TCP-like) capabilities to applications, as well as robust unit-of-work handling with resiliency against transient network failures.

See the readthedocs page for Receptor at:

https://ansible.readthedocs.io/projects/receptor/

Terminology and Concepts

Receptor: The Receptor application taken as a whole, that typically runs as a daemon.
Receptorctl: A user-facing command line used to interact with Receptor, typically over a Unix domain socket.
Netceptor: The networking part of Receptor. Usable as a Go library.
Workceptor: The unit-of-work handling of Receptor, which makes use of Netceptor. Also usable as a Go library.
Node: A single running instance of Receptor.
Node ID: An arbitrary string identifying a single node, analogous to an IP address.
Service: An up-to-8-character string identifying an endpoint on a Receptor node that can receive messages. Analogous to a port number in TCP or UDP.
Backend: A type of connection that Receptor nodes can pass traffic over. Current backends include TCP, UDP and websockets.
Control Service: A built-in service that usually runs under the name control. Used to report status and to launch and monitor work.

How to Get It

The easiest way to check out Receptor is to run it as a container. Images are kept on the Quay registry. To use this, run:

[docker|podman] pull quay.io/ansible/receptor
[docker|podman] run -d -v /path/to/receptor.conf:/etc/receptor/receptor.conf:Z receptor

Use as a Go library

This code can be imported and used from Go programs. The main libraries are:

Netceptor: https://pkg.go.dev/github.com/ansible/receptor/pkg/netceptor
Workceptor: https://pkg.go.dev/github.com/ansible/receptor/pkg/workceptor

See the example/ directory for examples of using these libraries from Go.

Use as a command-line tool

The receptor command runs a Receptor node with access to all included backends and services. See receptor --help for details.

The command line is organized into entities which take parameters, like: receptor --entity1 param1=value1 param2=value1 --entity2 param1=value2 param2=value2. In this case we are configuring two things, entity1 and entity2, each of which takes two parameters. Distinct entities are marked with a double dash, and bare parameters attach to the immediately preceding entity.

Receptor can also take its configuration from a file in YAML format. The allowed directives are the same as on the command line, with a top-level list of entities and each entity receiving zero or more parameters as a dict. The above command in YAML format would look like this:

---
- entity1:
    param1: value1
    param2: value1
- entity2:
    param1: value2
    param2: value2

Python Receptor and the 0.6 versions

As of June 25th, this repo is the Go implementation of Receptor. If you are looking for the older Python version of Receptor, including any 0.6.x version, it is now located at https://github.com/project-receptor/python-receptor.

receptor's People

Contributors

Stargazers

Watchers

Forkers

jakemcdermott fosterseth squidboylan ghjm shanemcd tisnik benthomasson adamruzicka m3ntalsp00n ryanpetrello marcnuri-forks amolgautam25 spredzy simaishi clhao1980 thenets tchellomello yagomarques samdoran shrews rooftopcellist ekmixon alancoding kdelee jladdjr john-westcott-iv wojtek0806 tiagodread starryinternet classicvalues maxamillion akasurde kokes sarabrajsingh saito-hideki therealhaoliu knw257 sunidhi-gaonkar1 shyiko nnguyenrh daveshow lucasaoki masayag ba0hai nitzmahone dsavineau oranod ariordan-redhat bmwtsn098 zokormazo tanganellilore kurokobo aaronh88 dhageman rcarrillocruz samkenxstream thom-at-redhat rohitthakur2590 aknochow toxicglados matoval resolutecoder eclipseo jpmens thriqon jangel97 samccann willthames gsoc2 geetikakay bfarr-rh kapouer diademiemi gundalow chrismeyersfsu jeffkala touchclasslandscaping

receptor's Issues

Implement Receptor traceroute

If along with #13 we also arrange for a reject message to be sent when HopsToLive reaches zero, then combined with #149, we will have the elements needed to implement a traceroute function for Receptor.

Allow Listeners to set a cost per dialer

In the current implementation of connection costs, a listener sets a cost and a dialer must agree with it in order to connect. This restricts the flexibility of graph costs because all dialers into a single listener must have the same cost, it's not hard to imagine a network in which you would want different dialers into the same listener to have different costs (for example if they have different latencies to the same node).

There are reasons to have some negotiation of cost, for example it gives the listener some control over preventing misbehaving dialers from assigning a very cheap cost and everything getting routed through them.

A proposed solution is a listener could have a map of minimum costs for each dialer, this gives some control to the listener while allowing the dialers to set their own cost.

Add an exponential backoff for redialing backends

Currently we redial our backends on 5 second intervals, it would be good to redial quickly then have an exponential backoff, this might also help speed up some test times. See https://github.com/project-receptor/receptor/blob/devel/pkg/backends/utils.go#L13

Tests

As it stands now, the Go Receptor test frameworks are as follows:

Unit testing: None
Functional testing: None
Integration testing: None
Performance testing: None

This should change.

Fix Netceptor Shutdown()

The Netceptor Shutdown function currently hangs, with non-blocking sends to the shutdown channels it doesnt hang, however it doesnt fully clean up all of the resources, particularly TCP ports, I think this is because one of our shutdown signals sends a signal to shutdown a TCP session, but we dont send a signal to close the port.

See #26 for some initial thoughts about this.

Documentation

There should be documentation for at least the following use cases:

End users wanting to use this as a CLI
Go developers wanting to use this as a library
Python developers wanting to write worker plugins
Python developers wanting to write controller apps

Receptorctl release doesn't actually release multiple units

$ receptorctl work release `receptorctl work list -q`
Released: ('OsSEqzKV', 'XUFsuGUz', 'j0Wg4qey', 'kcKoFZl9', 'qgGTCqFp')
$ receptorctl work release `receptorctl work list -q`
Released: ('XUFsuGUz', 'j0Wg4qey', 'kcKoFZl9', 'qgGTCqFp')
$ receptorctl work release `receptorctl work list -q`
Released: ('j0Wg4qey', 'kcKoFZl9', 'qgGTCqFp')

It claims to have released multiple units, but it only actually releases the first one.

This is probably happening with receptorctl cancel as well.

Generate x509 certs on the fly in our tests

Currently we have some certs for testing stored in the repo, it would make the test suite more flexible if we can generate that on the fly, I didnt do that initially because I couldnt decide how to represent it in our yaml mesh definitions. What might make sense is to add a generatecerts field to the yaml mesh definition, then store the related certs/tls.config in the Node object (certs can technically be per backend but it makes sense to me that per node is a realistic usage).

receptor -h doesn't show param types for maps

receptor -h

e.g.

--work-python
        ..
        config=<>: Plugin-specific configuration settings

and

--tcp-listener
      ..
      nodecost=<>: Connection cost (weight) for each node

these have types map[string]interface{} and map[string]float64, respectively

Hash not found

After repeatedly disconnecting and reconnecting nodes, you get the following error:

ERROR 2020/07/14 11:53:48 Error translating data to message struct: hash not found

This indicates that we received a data message from a node for which we've never received a routing message, so we can't translate the address hash to a node ID string.

This might just be normal for a reconnecting node, in which case we might want to avoid reporting it as an error.

Authentication for back-ends and services

Currently, back-ends and services can be set up with mutual TLS, including client certificates. This may be insufficient for multi-tenant/security-conscious environments, and at the same time may be too heavyweight for lighter environments. We should consider whether there should be other authentication mechanisms, like perhaps PSK.

Remote cancel and release in Workceptor

Right now the Workceptor work cancel and work release commands are not aware of remote jobs. Cancelling or releasing a job created by work submit will only cancel or release the local monitor associated with the job - it will keep running, and its files will continue to exist, on the remote node. The cancel and release commands should be extended to operate on the remote node. (One issue with this is defining what the behavior should be if the remote node is unreachable.)

Duplicate nodes can exist

Right now there is nothing preventing two nodes from joining the network with the same node ID, leading to who-knows-what routing confusion and destabilization of the network.

At first I thought we could do admission control so that a duplicate node ID gets rejected the same as if it wasn't in the allowed-connections list. This turns out to be difficult. First of all, in the current design it's surprisingly hard to tell if a given routing update is really a new node, or an old node reconnecting after an outage.

Second, even if it could be done reliably, admissions control by a neighbor is not actually good enough. If two same-named nodes join the network far apart from each other, and are accepted by their neighbors (who have, as yet, no reason to reject them), we still wind up with duplicate nodes.

Then I thought we could do duplicate node detection after the fact, by having nodes look for a situation where the same node ID exists at two different places on the network. But there are problems with this as well. First, how do you know if seeing a distant route to the same node ID means there's a duplicate, vs. the "real" node just having made a second connection to a distant peer? Second, even if you do detect a duplicate, how do you know which one should shut down?

So, this turns out to be a difficult problem that will take some thought.

Add test to make sure TLS is really using TLS

Add a test, perhaps using openssl s_client -connect, to verify that the TCP and websockets backends are really using TLS.

Ideally, also add a similar test for Receptor-level TLS (one Receptor user talking to another via netceptor.Conn), either by adding a TCP proxy and using openssl per above, or by using the control socket connect.

Jobs stuck in pending even when remote node comes back online

Steps to reproduce,

start local control node (foo.yml)
start remote node (bar.yml) and allow connections to establish
kill bar node
on foo:control, work submit bar 100cat
job will start in pending state as expected
bring bar back up
work remains in pending and never starts
If you run line 4 again, a new job will start and run successfully.

It does work properly if you skip steps 2 and 3. If you just startfoo, then submit job, then bring up bar (for the first time), the work will starting running.

Note,

work submit calls connectToRemote() and I've noticed a hang occurs on netceptor/conn.go

qc, err := quic.Dial(pc, rAddr, s.nodeID, tls, cfg

foo.yml

---
- node:
    id: foo
    allowedpeers: bar,fish

- control-service:
    service: control
    filename: /tmp/foo.sock

- tcp-listener:
    port: 2222

bar.yml

---
- node:
    id: bar

- tcp-peer:
    address: localhost:2222
    redial: true

- control-service:
    service: control
    filename: /tmp/bar.sock

- work-command:
    service: 100cat
    command: bash
    params: "-c \"seq 100 | /home/sfoster/sockceptor/sleepcat.sh\""

go test exec.Command('receptor') command not found on PATH

e.g. https://github.com/project-receptor/receptor/blob/615c1eaa01054ee8f0304a0675063ced9a49d9da/tests/functional/cli/cli_test.go#L52

Maybe we can use relative paths, or add project folder to PATH automatically

Review library licenses & status

We should make sure the licenses for all libraries we're using are compatible with an Apache licensed project, and also make sure we aren't using anything that looks like it's unmaintained or abandoned.

Write receptorctl

We should add a receptorctl command-line tool that interacts with Receptor control sockets through Unix sockets and maybe also through TCP. This should have command line options that mirror the socket options - status, ping, work submit, work results, etc.

I am thinking this should be written in Python, and be importable as a Python library. This would provide a replacement for the controller interface in Python. I don't think we can or should try to make this have the same API that Python Receptor did.

Fix concurrent read and writes on listenerRegistry

When testing the ping command using a control socket I get the following error:

fatal error: concurrent map read and map write

goroutine 3364 [running]:
runtime.throw(0xb985a7, 0x21)
        /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc001a134a0 sp=0xc001a13470 pc=0x4364a2
runtime.mapaccess2_faststr(0xabbe60, 0xc0001ab3e0, 0xb8985c, 0x7, 0x1146360, 0x0)
        /usr/local/go/src/runtime/map_faststr.go:116 +0x47c fp=0xc001a13510 sp=0xc001a134a0 pc=0x414bbc
github.com/project-receptor/receptor/pkg/netceptor.(*Netceptor).handleMessageData(0xc0000fe900, 0xc001b00de0, 0xc001a135c8, 0x5)
        /home/squid/work/project-receptor/receptor/pkg/netceptor/netceptor.go:805 +0x128 fp=0xc001a13568 sp=0xc001a13510 pc=0x8d6558
github.com/project-receptor/receptor/pkg/netceptor.(*Netceptor).sendMessage(0xc0000fe900, 0xc0003244b8, 0x8, 0xc000029998, 0x5, 0xb8985c, 0x7, 0xc001396c00, 0x1c, 0x5ac, ...)
        /home/squid/work/project-receptor/receptor/pkg/netceptor/netceptor.go:623 +0x2f2 fp=0xc001a13628 sp=0xc001a13568 pc=0x8d49a2
github.com/project-receptor/receptor/pkg/netceptor.(*PacketConn).WriteTo(0xc001b00000, 0xc001396c00, 0x1c, 0x5ac, 0xc7c220, 0xc001872bc0, 0x1119760, 0xb06f00, 0xc000dff0c0)
        /home/squid/work/project-receptor/receptor/pkg/netceptor/packetconn.go:87 +0xb1 fp=0xc001a13698 sp=0xc001a13628 pc=0x8d9701
github.com/lucas-clemente/quic-go.(*conn).Write(0xc0015fe980, 0xc001396c00, 0x1c, 0x5ac, 0x1, 0x0)
        /home/squid/go/pkg/mod/github.com/lucas-clemente/[email protected]/conn.go:27 +0x64 fp=0xc001a136f0 sp=0xc001a13698 pc=0x89de14
github.com/lucas-clemente/quic-go.(*sendQueue).Run(0xc00193af30, 0x0, 0x0)
        /home/squid/go/pkg/mod/github.com/lucas-clemente/[email protected]/send_queue.go:37 +0x1a4 fp=0xc001a137b0 sp=0xc001a136f0 pc=0x8acdb4
github.com/lucas-clemente/quic-go.(*session).run.func1(0xc0011a38c0)
        /home/squid/go/pkg/mod/github.com/lucas-clemente/[email protected]/session.go:496 +0x2f fp=0xc001a137d8 sp=0xc001a137b0 pc=0x8c817f
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc001a137e0 sp=0xc001a137d8 pc=0x4685f1
created by github.com/lucas-clemente/quic-go.(*session).run
        /home/squid/go/pkg/mod/github.com/lucas-clemente/[email protected]/session.go:495 +0x118

This seems to be caused by a concurrent read and write on the netceptor's listenerRegistry field. I'm not sure exactly what the other function is that is writing to the map while the function above reads it, but we probably want locks in any case.

I was able to pretty consistently reproduce this using my support_cli_meshes branch and running go test ./... -test.v -parallel=1 -run=Startup

Allow for providing receptor-specific parameters when starting units of work

As I was reviewing #71, it occurred to me that we'll need a way to provide a kubeconfig / pod name at runtime.

Spoke with @ghjm about this briefly this morning. He's going to noodle on it.

DTLS for UDP / receptor packetconn

There is a Go DTLS implementation at github.com/pion/dtls. This provides encryption for datagram-oriented protocols like UDP and Receptor PacketConn connections. If this is easy to add to Receptor then it seems like a nice capability to have.

Ncat: Connection refused error after restarting control service

start node with ./receptor -c foo.yml
nc -U /tmp/foo.sock > work start 100cat
sigterm node to kill it
start node again with ./receptor -c foo.yml
nc -U /tmp/foo.sock > error Ncat: Connection refused

step 2 seems to be necessary -- a long running task. If I run other commands like status or work list or start a short-running job like work start sleepcat with small payload, then the error on step 5 doesn't occur.

foo.yml

---
- node:
    id: foo
    allowedpeers: bar,fish

- control-service:
    service: control
    filename: /tmp/foo.sock

- tcp-listener:
    port: 2222

- work-command:
    service: 100cat
    command: bash
    params: "-c \"seq 100 | /home/sfoster/sockceptor/sleepcat.sh\""

"Connection write error" when starting work

ERROR 2020/08/07 15:29:27 Connection write error: write unix /tmp/foo.sock->@: use of closed network connection
ERROR 2020/08/07 15:29:27 Not all bytes written

This log message shows up on the control node when submitting work start or submit command

However, job seems to run fine despite the error message.

foo.yml

---
- node:
    id: foo
    allowedpeers: bar,fish

- control-service:
    service: control
    filename: /tmp/foo.sock

- tcp-listener:
    port: 2222

- work-command:
    service: 100cat
    command: bash
    params: "-c \"seq 100 | /home/sfoster/sockceptor/sleepcat.sh\""

first I removed /tmp/receptor and /tmp/foo.sock then ran the following,
./receptor -c foo.yml
nc -U /tmp/foo.sock > work start 100cat

error will show up in the log at this point

I am synced with devel

Allow remote work to specify remote control service TLS configuration

Remote work submission and monitoring currently assumes that the remote node's control service will always be listening on the name control (see #15) and will never have TLS configured. However, we want it to be possible to configure a node to enforce client certificate authentication before allowing access to the control service.

The difficulty here is that we might need to select different client TLS configurations based on which remote node we're connecting to. Perhaps this could be a new section in the config file that takes a map of node names to TLS client configs.

Singleton enforcement in cmdline

It should be possible to mark a cmdline object as a singleton at registration time, so that the library will only ever create one object. Perhaps this object should always be created with default parameters, even if it is never referred to on the command line or in a config file.

To be able to set some parameters in a config file and others on the command line, perhaps the semantics should be that multiple invocations just add parameters to the same object. Perhaps it should be an error to attempt to set the same parameter twice. (Or perhaps it's okay to just let the last writer win.)

Exclusive objects should have priority over singletons. If there is an exclusive object, no singletons should be created.

Advertise things other than service listeners

Workers defined by work-command do not listen on a service, and so they are not advertised to the network. We should add an Advertise() method to Netceptor that adds an item to the advertisement broadcasts like ListenAndAdvertise(), but without needing a service listener. This probably also implies changing the ServiceAdvertisement struct so that a recipient can tell what kind of non-service thing is being advertised.

Kubernetes worker does not preserve state on Receptor restart

If you launch a Kubernetes job and then restart Receptor, it loses track of the job - stdout is never made available, the job is not cancellable, etc. To fix this we need to do a rewrite of Workceptor to allow storing worker-specific data in the status file, most notably the pod ID. This Workceptor rewrite is also likely needed for #78.

Add Python worker plugin back-end

We should add a back-end that allows the use of already-existing Python worker plugins.

How I imagine this would work:

There is a Python app (the "shim") that gets called as a Receptor worker process.
The shim would take a parameter that identifies a worker plugin. It would still find the entrypoint through setuptools, but all installed Python modules would not automatically be run as Receptor services. You would have to enable each one in the Receptor config file or command line.
The config file already has a key-value section for plugin configuration. This would need to be adapted to be able to supply the config settings that current plugins are expecting.
Adding to the results queue would produce Receptor stdout, probably in newline-delimited-JSON format.

hang in connectToRemote

connectToRemote() in remote_work.go. I've added debug statements

func (rw *remoteUnit) connectToRemote(ctx context.Context) (net.Conn, *bufio.Reader, error) {
	rw.statusLock.RLock()
	red, ok := rw.status.ExtraData.(*remoteExtraData)
	if !ok {
		rw.statusLock.RUnlock()
		return nil, nil, fmt.Errorf("remote ExtraData missing")
	}
	node := red.RemoteNode
	rw.statusLock.RUnlock()
	logger.Debug("entering DialContext")
	conn, err := rw.w.nc.DialContext(ctx, node, "control", nil)
	logger.Debug("exiting DialContext")
	if err != nil {
		return nil, nil, err
	}
	logger.Debug("get new reader")
	reader := bufio.NewReader(conn)
	logger.Debug("will read hello from reader")
	hello, err := reader.ReadString('\n')
	if err != nil {
		return nil, nil, err
	}
	logger.Debug("did read hello from reader")
	if !strings.Contains(hello, node) {
		return nil, nil, fmt.Errorf("while expecting node ID %s, got message: %s", node,
			strings.TrimRight(hello, "\n"))
	}
	logger.Debug("successfully connected to remote")
	return conn, reader, nil
}

here is the log output from a test failure (note the * Debug lines)

INFO 2020/09/03 11:46:12 Known Connections:
INFO 2020/09/03 11:46:12    node1: node2(1.00) 
INFO 2020/09/03 11:46:12    node2: node1(1.00) node3(1.00) 
INFO 2020/09/03 11:46:12    node3: node2(1.00) 
INFO 2020/09/03 11:46:12 Routing Table:
INFO 2020/09/03 11:46:12    node2 via node2
INFO 2020/09/03 11:46:12    node3 via node2
* DEBUG 2020/09/03 11:46:13 attempting to connect to remote node
* DEBUG 2020/09/03 11:46:13 entering DialContext
DEBUG 2020/09/03 11:46:14 Sending service advertisement: &{node1 control 2020-09-03 11:46:14.071460719 -0400 EDT m=+5.565684937 map[]}
DEBUG 2020/09/03 11:46:14 Received service advertisement &{0xc0000bad00 false}
DEBUG 2020/09/03 11:46:16 Received service advertisement &{0xc0000bad40 false}
DEBUG 2020/09/03 11:46:18 Sending routing update. Connections: node2(1.00)
DEBUG 2020/09/03 11:46:18 Received routing update from node3 via node2
DEBUG 2020/09/03 11:46:21 Received routing update from node2 via node2
* DEBUG 2020/09/03 11:46:26 exiting DialContext
* DEBUG 2020/09/03 11:46:26 get new reader
* DEBUG 2020/09/03 11:46:26 will read hello from reader
DEBUG 2020/09/03 11:46:28 Sending routing update. Connections: node2(1.00)
DEBUG 2020/09/03 11:46:28 Received routing update from node3 via node2
DEBUG 2020/09/03 11:46:31 Received routing update from node2 via node2

As you can see it hangs on this line hello, err := reader.ReadString('\n') (the logs continue for another 45 seconds before termination)

I think we need to way to time this out and try the connectToRemote again

Fix loading a listener cost from a config file

When specifying a cost in the cost or nodecost parameter on a listener, if you specify the value without a decimal, for example 1 instead of 1.0, the config loader fails:

squid@localhost:~/work/project-receptor/receptor$ cat foo.yaml
---
- node:
    id: foo
- log-level:
    level: debug
- tcp-listener:
    port: 10000
    cost: 2
- control-service:
    filename: foo.sock
- trace
squid@localhost:~/work/project-receptor/receptor$ ./receptor --config foo.yaml
panic: reflect.Set: value of type int is not assignable to type float64

goroutine 1 [running]:
reflect.Value.assignTo(0x14b8a20, 0xc00028e510, 0x82, 0x16f12d4, 0xb, 0x1495ce0, 0x0, 0x1495ce0, 0x14b8a20, 0x1793500)
        /usr/local/go/src/reflect/value.go:2411 +0x426
reflect.Value.Set(0x1495ce0, 0xc000313568, 0x18e, 0x14b8a20, 0xc00028e510, 0x82)
        /usr/local/go/src/reflect/value.go:1539 +0xbd
github.com/project-receptor/receptor/pkg/cmdline.setValue(0xc0001c77e0, 0x14b8a20, 0xc00028e510, 0xc0001c77e0, 0x0)
        /home/squid/work/project-receptor/receptor/pkg/cmdline/cmdline.go:194 +0xec2
github.com/project-receptor/receptor/pkg/cmdline.loadConfigFromFile(0x7ffd2d0a65b6, 0x8, 0x7ffd2d0a65b6, 0x8, 0xc0001b9dc8, 0x203000, 0x40deb6)
        /home/squid/work/project-receptor/receptor/pkg/cmdline/cmdline.go:448 +0x551
github.com/project-receptor/receptor/pkg/cmdline.ParseAndRun(0xc00011c160, 0x2, 0x2)
        /home/squid/work/project-receptor/receptor/pkg/cmdline/cmdline.go:517 +0x104f
main.main()
        /home/squid/work/project-receptor/receptor/cmd/receptor.go:60 +0x33c

Consider making the name of the 'control' service non-configurable

Workceptor remote job submission assumes that the remote node always has a Receptor control service and that this service is listening as 'control'. We should either make this name non-configurable, or update workceptor to take the remote service name as a parameter. I think the former is probably easier. The default should probably be that Receptor nodes always run a control service, perhaps with a command-line parameter like no-control-service to suppress it.

Network packets should have TTL

The Receptor routing algorithm is resistant to loops, but if we did somehow develop a routing loop, packets would go round it endlessly, consuming all bandwidth. We should add a mechanism for limiting the lifespan of packets in this situation.

Add logging and clean up debug output

Right now there are a lot of calls to debug.Printf throughout the application. These should be changed to use the Go logger and some thought should be put in to how much, if any, output should be produced when running not in debug mode.

copying to stdout breaks with very short duration jobs

- work-command:
    service: echosleepshort
    command: bash
    params:  "-c \"for i in {1..5}; do echo $i;done\""

Start 2 nodes, foo.yml and bar.yml
nc -U /tmp/foo.sock
work submit bar echosleepshort

job completes with success, but the following errors appear.

error on bar ERROR 2020/08/28 18:09:20 Write error in control service: stream 0 was reset with error code 499

error on foo ERROR 2020/08/28 18:09:20 Error copying to stdout file /tmp/receptor/foo/bSEyWDfT/stdout: Read on stream 0 canceled with error code 499

this breaks work results {workID} of course as the stdout file is never copied

I think it's a race condition, because if I change command to echo $i; sleep 1 then everything works fine.

foo.yml

---
- node:
    id: foo
    allowedpeers: bar
- control-service:
    service: control
    filename: /tmp/foo.sock
- tcp-listener:
    port: 2222

bar.yml

---
- node:
    id: bar
    allowedpeers: foo
- tcp-peer:
    address: localhost:2222
    redial: true
- work-command:
    service: echosleepshort
    command: bash
    params:  "-c \"for i in {1..5}; do echo $i;done\""

Websockets library

The websockets back-end is currently using the golang.org/x/net/websocket library. This library.s godoc page has a message "this package currently lacks some features found in alternative and more actively maintained WebSocket packages" and lacks features we care about, most significantly proxy server support. It also looks like it's not easy to run a TLS server with this library.

The godoc page refers to two other libraries, github.com/gorilla/websocket and nhooyr.io/websocket. We should pick whichever one has the needed functionality, and switch the websockets backend to it.

Hard-code the name of the control service

Right now the control service needs to be called out with a command line option, and if that option is not present, there is no control service, meaning this node can't be communicated with, can't report status, etc. The default should be to always run a control service named control, even if not explicitly asked to. Perhaps there should be an option to suppress this if you really want to run a node with no services, like a high security routing-only node.

Unreachable for non-existing services

We currently have a message type of MsgTypeReject which is used when a node refuses a connection from another node. We should adapt this message type to include sending a per-connection reject when a stream connection is attempted to a service that doesn't have a listener on the remote side (similar to an ICMP Unreachable in tcp/ip). Currently we get log errors on the remote side, but the dialer just has to sit there and eventually time out.

To do this we would need to:

Add a data payload to the MsgTypeReject packets so we know what is being rejected
When we receive a MsgTypeReject (that isn't for a whole connection), look up the dialer socket that originated it, and flag it as being in an error state
Return the error from anything blocked on that dialer, or any subsequent reads/writes to it

Fix goroutine leaks

when using https://github.com/fortytw2/leaktest it finds the following leak:

    leaktest.go:150: leaktest: leaked goroutine: goroutine 317 [chan receive]:
        github.com/project-receptor/receptor/pkg/backends.(*UDPListenerSession).Recv(0xc000582040, 0xc0007f87a0, 0xffb940, 0xa8, 0x0, 0x0)
                /home/squid/work/project-receptor/receptor/pkg/backends/udp.go:206 +0x4a
        github.com/project-receptor/receptor/pkg/netceptor.(*connInfo).protoReader(0xc000892050, 0xbb9e00, 0xc000582040)
                /home/squid/work/project-receptor/receptor/pkg/netceptor/netceptor.go:858 +0x90
        created by github.com/project-receptor/receptor/pkg/netceptor.(*Netceptor).runProtocol
                /home/squid/work/project-receptor/receptor/pkg/netceptor/netceptor.go:971 +0x207

This is the only thing it finds though so once we fix this we should be good to gate on testing for goroutine leaks

Test receptor-level TLS

Receptor has tls on multiple levels, backends can listen on TLS, and services on the receptor network can also use TLS. We have tested the tls on the backend level fairly well, but currently have no testing for the receptor-level TLS as this will also require some work on the dev side. This will consist of:

Testing tls client auth
- No client cert/key (failure)
- unsigned client cert/key (failure)
- signed client cert/key (success)
Testing tls without client certs
Testing that a service configured to use TLS is actually using TLS using openssl s_client (see #113 )
Test things like work submit and connect

Remote units of work should have expiration

When issuing a unit of work to be executed on a remote node via work submit, the originator should specify an expiration time or time-to-live. If the job hasn't been able to be sent to the remote node and started by this time, it should be marked as failed.

receptorctl --help requires --socket to be defined

[sfoster@localhost sockceptor]$ receptorctl work --help
Error: missing parameter: socket

FIPS compliance

We need to verify that the crypto libraries we are using are FIPS 140-2 compliant, and change them if they are not.

Group addresses

Python Receptor had a concept of group addresses that was never fully realized. If we still want to do this, here's one way it could work:

At startup, nodes are initialized with a list of groups they are members of.
When someone wants to connect to a group address, they call a new method, netceptor.DialGroup. This sends a broadcast packet saying "I want a connection to a member of this group."
On receiving this, group members reply, perhaps with some kind of preference metric (ie, their current load average or some other measure of busyness).
The sender chooses the first reply, or the lowest-metric reply within double the time of the first reply, or something like that.
A normal point-to-point connection is established to the node ID of the selected respondent.

Perhaps it would also be valuable for PacketConn to be able to send broadcast messages targeted at groups, like a sort of Receptor multicast capability.

Check for and reject negative path costs

I think in the current implementation, you could do --tcp-peer cost=-1.0 and cause the route calculation to blow up (for example, it might loop infinitely, trying to find the path that maximizes the number of times the packet traverses this path, because every traversal reduces the cost).

Negative path costs should be checked for at config load time and rejected.

Concurrent job limit in Workceptor

Looking at the Python Receptor code, I realized that worker plugins are run in a thread pool that limits the number of concurrent executions. As things stand now, Go Receptor's Workceptor will launch as many jobs as it's asked to, immediately. We probably want it to have some concept of holding jobs for later execution based on some type of concurrency constraint.

Upstream packaging

We should write a spec file to produce an RPM, and maybe produce a DEB if we feel like it.

We should include a systemd unit file that allows multiple invocations - systemctl start receptor@config1, etc - that refer to config files like /etc/receptor/config1.conf. See OpenVPN for an example of how this works.

Make sure we have good error messages for TLS-to-non-TLS

After #110 and #112 land, make sure that in the case where a non-TLS tcp-peer connects to a TLS tcp-listener or vice versa, that the error message produced is reasonable on both sides.

First record does not look like a TLS handshake is barely reasonable. Unsupported SSLv2 handshake received is not reasonable.

Make this work with Ansible Runner

Supporting Ansible Runner may be as simple as having the right command-line parameters and stdin, running in --json mode, and passing the output as Receptor stdout. The existing --via-receptor parameter in Runner expects to import old Receptor, so this should be modified.

The current Receptor plugin in Ansible Runner is incomplete anyway. It supports passing job results to stout, but does not call the Python callbacks on the originating side. We should make the implementation more transparent, so a Python user of Runner doesn't have to change their API at all to run jobs through Receptor.

Consider removing "work start" command from the control service

Having two closely related commands "work start" and "work submit" is confusing. Receptorctl doesn't even have a "work start" command. Now that "work submit" does the right thing when you start a job on the local node, consider just removing "work start."

Broken connection while sending stdin

If, during a work start or work submit, the connection fails with an error other than EOF, then we can't guarantee we have a complete unit of work and should mark the unit as failed. Currently, it looks like we just continue attempting to process the work with the partial data.

ansible / receptor Goto Github PK

receptor's Introduction

Receptor

Terminology and Concepts

How to Get It

Use as a Go library

Use as a command-line tool

Python Receptor and the 0.6 versions

receptor's People

Contributors

Stargazers

Watchers

Forkers

receptor's Issues

Recommend Projects

Recommend Topics

Recommend Org