Comments (21)
Recommendation from our discussion: on start, lpc55 sprot_serverr asserts ROT_IRQ until it has finished initializing FIFOs and SSA/SSD state. This does not fully cover the case where RoT has been reset and the SP steps on FIFO initialization, but it does cover some of that window and makes it much easier to find these failures on a logic analyzer.
from hubris.
Recommendation from Matt: after hot path receive, the RoT checks for CSn asserted. If asserted, we have a synchronization error. RoT waits for CSn deasserted, then cleans up, and queues a SYNC_ERROR message, then asserts ROT_IRQ.
from hubris.
Possibly related, although a different error message: #1451
from hubris.
Yeah, thanks for linking that!
I'm not sure if they're related; a timeout occurs if the RoT never asserts ROT_TO_SP_IRQ_L
, versus sending corrupt data over SPI.
On the other hand, if we're blaming FIFOs for framing issues, we could hypothesize
- This issue is caused by FIFO shenanigans in the RoT → SP direction, so the SP sees corrupt messages
- #1451 is caused by FIFO shenanigans in the SP → RoT direction, so the RoT sees corrupt messages (and never pulls the IRQ line low, because it can't reply)
(This is extremely hand-wavey / unjustified, e.g. I can't remember if the RoT replies with an error on a corrupt message, but just putting it out there)
from hubris.
Re "This looks like a framing issue where the SP and RoT have gotten out of sync. ..."
SPI messages are framed by the SPI bus CSn signal.
from hubris.
Note: When SP does a flush (SPI CSn deassert, 10ms delay with no SPI clocks, CSn assert), the SP and RoT communications should be re-synchronized if they ever fell out of sync.
ajs and I don't see anything wrong there yet.
from hubris.
Is there any chance that the LPC55S69 parts are programmed to operate at 48MHz instead of the required 96MHz?
I don't think that is likely. But it can cause intermittent failures. Need to double check CMPA programming.
from hubris.
I did some SysTick (1ms) resolution ringbuf logging that doesn't immediately shed any light.
humility: ring buffer drv_lpc55_sprot_server::TIMES in sprot:
humility: attached to 1fc9:0143:MMLBHACB24IDJ via CMSIS-DAP
NDX LINE GEN COUNT PAYLOAD
39 185 89 1 Now(0x34310) // top of loop
40 231 89 1 Now(0x34310) // wait_for_csn_asserted(false), notified.
41 264 89 1 Now(0x34318) // wait_for_request(): read some, ssd
42 276 89 1 Now(0x34318) // wait_for_request(): read remainder
43 287 89 1 Now(0x34318) // received SP's message
44 188 89 1 Now(0x34318) // message ok, sending to handler
45 207 89 1 Now(0x3431b) // message handled
46 209 89 1 Now(0x3431b) // ringbuf_entry!(Trace::Stats done
47 311 89 1 Now(0x3431b) // reply: begining reply
48 231 89 2 Now(0x3431b) // wait_for_csn_asserted(true), ROT_IRQ, then notified.
49 322 89 1 Now(0x3431e) // reply: Tx loop done
50 352 89 1 Now(0x3431e) // reply: FIFOs ready for next request.
51 185 89 1 Now(0x3431e) // top of loop
52 231 89 1 Now(0x3431e) // wait_for_csn_asserted
from hubris.
I was unable to reproduce on madrid on tuesday, however today I reproduced a single time on my gimletlet and rot-carrier. Reproduction here means that we get an endless stream of CrcError
responses from faux-mgs state
and the system never self corrects. Here are my raw notes from tonight's debugging session:
Reproduced on the gimletlet/rot-carrier with Keeter's recommendation
Spam faux-mgs state
requests and reset RoT via debugger
Specifically, this command was run on my helios box:
ajs@tarz:~/management-gateway-service$ while true; do target/release/faux-mgs --interface igb0 state && sleep 0.1; done
Then on my mac: rota reset
Here's the relevant RoT ringbuf:
➜ hubris git:(hubris-1507-debugging) ✗ rota ringbuf
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
NDX LINE GEN COUNT PAYLOAD
27 356 66 1 Underrun
28 316 66 1 SentBytes(0x0)
29 267 66 1 ReceivedBytes(0x10)
30 91 66 1 Err(UnsupportedProtocol)
31 194 66 1 Stats(RotIoStats { rx_received: 0x15f, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x15f, rx_invalid: 0x15f, tx_incomplete: 0x0 })
0 277 67 1 ReplyLen(0xb)
1 356 67 1 Underrun
2 316 67 1 SentBytes(0x0)
3 267 67 1 ReceivedBytes(0x10)
4 91 67 1 Err(UnsupportedProtocol)
5 194 67 1 Stats(RotIoStats { rx_received: 0x160, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x160, rx_invalid: 0x160, tx_incomplete: 0x0 })
6 277 67 1 ReplyLen(0xb)
7 356 67 1 Underrun
8 316 67 1 SentBytes(0x0)
9 267 67 1 ReceivedBytes(0x10)
10 91 67 1 Err(UnsupportedProtocol)
11 194 67 1 Stats(RotIoStats { rx_received: 0x161, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x161, rx_invalid: 0x161, tx_incomplete: 0x0 })
12 277 67 1 ReplyLen(0xb)
13 356 67 1 Underrun
14 316 67 1 SentBytes(0x0)
15 267 67 1 ReceivedBytes(0x10)
16 91 67 1 Err(UnsupportedProtocol)
17 194 67 1 Stats(RotIoStats { rx_received: 0x162, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x162, rx_invalid: 0x162, tx_incomplete: 0x0 })
18 277 67 1 ReplyLen(0xb)
19 356 67 1 Underrun
20 316 67 1 SentBytes(0x0)
21 267 67 1 ReceivedBytes(0x10)
22 91 67 1 Err(UnsupportedProtocol)
23 194 67 1 Stats(RotIoStats { rx_received: 0x163, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x163, rx_invalid: 0x163, tx_incomplete: 0x0 })
24 277 67 1 ReplyLen(0xb)
25 356 67 1 Underrun
26 316 67 1 SentBytes(0x0)
What's odd is that consistently, UnsupportedProtocol
is the error on the RoT
because of an overrun, and it returns this. The first 6 bytes come back on the
SP, but the remaining 5 are 0 resulting in a CRC error. This is the same on every iteration.
It's like the RoT is stuck and not actually mutating its buffers.
Here's the SP state for the last retry. The 0x17..
are timestamps in ticks (ms).
22 596 276 1 ReqRotBootInfo
23 218 276 1 DoSendRecv(0x17a1e5)
24 346 276 1 RotIrqNotAsserted
25 262 276 1 Sent(0xa)
26 399 276 1 WaitIrqStarted(true, 0x17a1e5)
27 413 276 1 WaitIrqStopped(true, 0x17a1e7)
28 304 276 1 Received(0xb)
29 470 276 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
30 480 276 1 Error(Protocol(InvalidCrc))
31 218 276 1 DoSendRecv(0x17a1f0)
32 346 276 1 RotIrqNotAsserted
33 262 276 1 Sent(0xa)
34 399 276 1 WaitIrqStarted(true, 0x17a1f0)
35 413 276 1 WaitIrqStopped(true, 0x17a1f2)
36 304 276 1 Received(0xb)
37 470 276 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
38 480 276 1 Error(Protocol(InvalidCrc))
39 218 276 1 DoSendRecv(0x17a1fb)
40 346 276 1 RotIrqNotAsserted
41 262 276 1 Sent(0xa)
42 399 276 1 WaitIrqStarted(true, 0x17a1fb)
43 413 276 1 WaitIrqStopped(true, 0x17a1fd)
44 304 276 1 Received(0xb)
45 470 276 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
46 480 276 1 Error(Protocol(InvalidCrc))
47 490 276 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
The first 6 bytes indicate an error response for UnsupportedProtocol
at
protocol version 3. However, the remaining bytes aren't filled in. The last 2
bytes should be the CRC, but they are 0.
I hooked up the salae and I can see that the RoT is raising ROT_IRQ before it
sees CSn de-asserted. It looks like this happens at byte 9 of the response
from the RoT, and then 6 bytes get clocked out after. Interestingly this would
indeed make the checksums 0, which they also are on the wire and result in the
SP seeing a CRC error. However, it's still unclear to me why the RoT thinks
it's seeing an UnsupportedProtocol
. The message from the SP on the wire is correct.
What could explain both these things: early ROT_IRQ raise and invalid read is
if perhaps CSN being de-asserted is misdetected/ tracked wrong in a variable.
That's what I'll be looking into in the code next.
We also see that the RoT is returning 0 bytes on reply and detecting an underrun. It's
also receiving 16 bytes, which should only occur when the SP is waiting for a reply. So clearly
there is a persistent desynchronization, but the SP also should only wait for reply
when ROT_IRQ is asserted, which I don't see in the trace. However, it looks like
we don't de-assert ROT_IRQ at RoT startup and that could cause issues.
Going to add some more debug traces to the RoT side and try to reproduce again.
Having a hard time reproducing again with my added ringbuf traces on the RoT. So
let's do some speculating.
We know from the RoT ringbuf that the RoT receives 16 bytes while
waiting for request. It sees all 16 bytes, and so decodes them as
SprotError::UnknownProtocol
. We can assume that this is the SP trying to get
a reply to its prior messsage, as the SP always tries to read a full fifo of 16
bytes. And while waiting for a reply the SP is clocking out zeroes which would
be interpreted by the RoT
as UnknownProtocol
. However, the SP will only try
to read a reply if it sees ROT_IRQ raised. So the RoT must be asserting ROT_IRQ,
even though it's waiting for a request from the SP and should have ROT_IRQ
deasserted during that time.
We also know that the RoT when trying to reply with this error had not clocked
out more than its fifo, before CSn was deasserted, due to the misleading
Trace::SentBytes(0)
. This implies that as soon as CSn was asserted and the irq fired in reply
that CSn had also become deasserted.
We then notice in the Rot reply
code that we drain the fifo and deassert
ROT_IRQ. At this point we should go back to waiting for the next message in
wait_for_request
. But the deassert of ROT_IRQ shows up "early" at the SP and
the SP detects a CRC Error. The SP then retries to send after 5ms and the RoT sees an UnsupportedProtocol
,
raises ROT_IRQ and replies, while the SP thinks it is still sending, then the SP goes to receive and gets back partial bytes.
So maybe this is purely just a timing rhythm that kicks off when the RoT starts
at a particular point in time. Here are the hypothesized steps.
- The SP sends a status message. It has clocked out most of the bytes, and then de-asserts CSn.
- The RoT wakes up, waits for CSn to be asserted by waiting for an IRQ.
- Some time passes, and the RoT gets woken up, sees CSn is asserted, then goes into a tight loop to read the message.
- The RoT only clocks in a partial message (the last byte or so from the SP request)
- The RoT sees that CSn is already de-asserted
- The RoT tries to deserialize the reply, sees that it is invalid then calls
reply
. - By this point, the SP's retry timer has fired causing it to start clocking out the next request
- The RoT raises ROT_IRQ, then waits on an IRQ for CSn to be asserted.
- CSn is set in the interrupt register on the RoT and the RoT goes to reply.
- The RoT only clocks a few bytes out of its reply because it is actually receiving the remainder of a partial SP messaage. This occurs because the RoT sees that CSn is also deasserted via the ssd bit.
- The RoT deasserts-rot-irq and goes to wait for the next request.
- The SP gets a short reply, with a CRC error then goes to retry.
- The cycle repeats..
This is plausible because we see an overrun in the RoT logs indicating a short
read from the SP on every retry and we also saw ROT_IRQ trigger early on the
salae trace, and we know that only partial replies are coming back to the SP,
even though they always start with the right header bytes.
It seems that this is all due to the retry timeout being 5ms and us hitting
that repeatedly. So a poorly chosen timeout results in a loop. This is deeply
unsatisfying as it's very hard to prove its the truth. It also doesn't really
explain why even after the retries are up, that the system never comes back
and continues with the same behavior on the next attempt. Regardless, there are
a few things we should do:
- Raise the retry timeout
- If the RoT receives a bad message from the SP (flow error, unsupported
protocol, etc...), the RoT should NOT send a reply. This just causes the RoT
to waste time in what will result in a retry anyway. Instead the RoT should go
back to waiting for the next message and let the SP timeout. This will also have the side effect
of simplifying the retry error handling on the SP, as now only "unretryable" errors will be returned. - Initialize the ROT_IRQ to deasserted on RoT sprot startup. There is no evidence that this is an issue,
but it should be done.
Now, to help clarify, I'd like to get a reproduction with a salae trace. I'm
going to remove my RoT changes and try again, recording the whole thing from the
start of the SP spam prior to RoT reset.
from hubris.
Ok, reproduced again. Interestingly, a closer look shows that there are a few UnsupportedProtocol
reported from the SP to faux-mgs. But we are definitely stuck. This time I got the whole thing recorded on the salae. I Need to figure out where the first error after the timeout starts. I will check that out in the morning. I attached the trace:
SprotInvalidCrc.sal.zip
Here's the full gimletlet ringbuf
➜ hubris git:(hubris-1507-debugging) ✗ gimletlet ringbuf sprot
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
humility: ring buffer drv_stm32h7_spi_server_core::__RINGBUF in sprot:
NDX LINE GEN COUNT PAYLOAD
7 430 1993 1 Tx(0x0)
8 430 1993 1 Tx(0x3)
9 430 1993 1 Tx(0x7)
10 494 1993 2 WaitISR(0xa0002)
11 464 1993 1 Rx(0x0)
12 430 1993 1 Tx(0xa9)
13 494 1993 2 WaitISR(0x90002)
14 464 1993 1 Rx(0x0)
15 430 1993 1 Tx(0x0)
16 494 1993 2 WaitISR(0x80012)
17 464 1993 1 Rx(0x0)
18 494 1993 2 WaitISR(0x70012)
19 464 1993 1 Rx(0x0)
20 494 1993 2 WaitISR(0x60012)
21 464 1993 1 Rx(0x0)
22 494 1993 2 WaitISR(0x50012)
23 464 1993 1 Rx(0x0)
24 494 1993 2 WaitISR(0x40012)
25 464 1993 1 Rx(0x0)
26 494 1993 2 WaitISR(0x30012)
27 464 1993 2 Rx(0x0)
28 494 1993 2 WaitISR(0x10012)
29 464 1993 1 Rx(0x0)
30 322 1993 1 Start(read, (0x0, 0x10))
31 430 1993 8 Tx(0x0)
32 494 1993 2 WaitISR(0x100002)
33 464 1993 1 Rx(0x3)
34 430 1993 1 Tx(0x0)
35 494 1993 2 WaitISR(0xf0002)
36 464 1993 1 Rx(0x0)
37 430 1993 1 Tx(0x0)
38 494 1993 3 WaitISR(0xe0002)
39 464 1993 1 Rx(0x0)
40 430 1993 1 Tx(0x0)
41 494 1993 1 WaitISR(0xd0002)
42 464 1993 1 Rx(0x0)
43 430 1993 1 Tx(0x0)
44 494 1993 2 WaitISR(0xc0002)
45 464 1993 1 Rx(0x3)
46 430 1993 1 Tx(0x0)
47 494 1993 1 WaitISR(0xb0002)
48 464 1993 1 Rx(0x0)
49 430 1993 1 Tx(0x0)
50 494 1993 1 WaitISR(0xb0002)
51 494 1993 2 WaitISR(0xa0002)
52 464 1993 1 Rx(0x0)
53 430 1993 1 Tx(0x0)
54 494 1993 1 WaitISR(0x90002)
55 464 1993 1 Rx(0x0)
56 430 1993 1 Tx(0x0)
57 494 1993 2 WaitISR(0x80012)
58 464 1993 1 Rx(0x0)
59 494 1993 2 WaitISR(0x70012)
60 464 1993 1 Rx(0x0)
61 494 1993 2 WaitISR(0x60012)
62 464 1993 1 Rx(0x0)
63 494 1993 2 WaitISR(0x50012)
0 464 1994 2 Rx(0x0)
1 494 1994 2 WaitISR(0x30012)
2 464 1994 1 Rx(0x0)
3 494 1994 2 WaitISR(0x20012)
4 464 1994 1 Rx(0x0)
5 494 1994 2 WaitISR(0x10012)
6 464 1994 1 Rx(0x0)
humility: ring buffer drv_stm32h7_sprot_server::__RINGBUF in sprot:
NDX LINE GEN COUNT PAYLOAD
244 399 57 1 WaitIrqStarted(true, 0x1b3d0)
245 413 57 1 WaitIrqStopped(true, 0x1b3d2)
246 304 57 1 Received(0xb)
247 470 57 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
248 480 57 1 Error(Protocol(InvalidCrc))
249 218 57 1 DoSendRecv(0x1b3db)
250 346 57 1 RotIrqNotAsserted
251 262 57 1 Sent(0xa)
252 399 57 1 WaitIrqStarted(true, 0x1b3db)
253 413 57 1 WaitIrqStopped(true, 0x1b3dd)
254 304 57 1 Received(0xb)
255 470 57 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
0 480 58 1 Error(Protocol(InvalidCrc))
1 218 58 1 DoSendRecv(0x1b3e6)
2 346 58 1 RotIrqNotAsserted
3 262 58 1 Sent(0xa)
4 399 58 1 WaitIrqStarted(true, 0x1b3e6)
5 413 58 1 WaitIrqStopped(true, 0x1b3e8)
6 304 58 1 Received(0xb)
7 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
8 480 58 1 Error(Protocol(InvalidCrc))
9 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
10 596 58 1 ReqRotBootInfo
11 218 58 1 DoSendRecv(0x1b456)
12 346 58 1 RotIrqNotAsserted
13 262 58 1 Sent(0xa)
14 399 58 1 WaitIrqStarted(true, 0x1b456)
15 413 58 1 WaitIrqStopped(true, 0x1b458)
16 304 58 1 Received(0xb)
17 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
18 480 58 1 Error(Protocol(InvalidCrc))
19 218 58 1 DoSendRecv(0x1b461)
20 346 58 1 RotIrqNotAsserted
21 262 58 1 Sent(0xa)
22 399 58 1 WaitIrqStarted(true, 0x1b461)
23 413 58 1 WaitIrqStopped(true, 0x1b463)
24 304 58 1 Received(0xb)
25 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
26 480 58 1 Error(Protocol(InvalidCrc))
27 218 58 1 DoSendRecv(0x1b46c)
28 346 58 1 RotIrqNotAsserted
29 262 58 1 Sent(0xa)
30 399 58 1 WaitIrqStarted(true, 0x1b46c)
31 413 58 1 WaitIrqStopped(true, 0x1b46e)
32 304 58 1 Received(0xb)
33 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
34 480 58 1 Error(Protocol(InvalidCrc))
35 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
36 596 58 1 ReqRotBootInfo
37 218 58 1 DoSendRecv(0x1b4dc)
38 346 58 1 RotIrqNotAsserted
39 262 58 1 Sent(0xa)
40 399 58 1 WaitIrqStarted(true, 0x1b4dc)
41 413 58 1 WaitIrqStopped(true, 0x1b4de)
42 304 58 1 Received(0xb)
43 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
44 480 58 1 Error(Protocol(InvalidCrc))
45 218 58 1 DoSendRecv(0x1b4e7)
46 346 58 1 RotIrqNotAsserted
47 262 58 1 Sent(0xa)
48 399 58 1 WaitIrqStarted(true, 0x1b4e7)
49 413 58 1 WaitIrqStopped(true, 0x1b4e9)
50 304 58 1 Received(0xb)
51 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
52 480 58 1 Error(Protocol(InvalidCrc))
53 218 58 1 DoSendRecv(0x1b4f2)
54 346 58 1 RotIrqNotAsserted
55 262 58 1 Sent(0xa)
56 399 58 1 WaitIrqStarted(true, 0x1b4f2)
57 413 58 1 WaitIrqStopped(true, 0x1b4f4)
58 304 58 1 Received(0xb)
59 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
60 480 58 1 Error(Protocol(InvalidCrc))
61 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
62 596 58 1 ReqRotBootInfo
63 218 58 1 DoSendRecv(0x1b562)
64 346 58 1 RotIrqNotAsserted
65 262 58 1 Sent(0xa)
66 399 58 1 WaitIrqStarted(true, 0x1b562)
67 413 58 1 WaitIrqStopped(true, 0x1b564)
68 304 58 1 Received(0xb)
69 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
70 480 58 1 Error(Protocol(InvalidCrc))
71 218 58 1 DoSendRecv(0x1b56d)
72 346 58 1 RotIrqNotAsserted
73 262 58 1 Sent(0xa)
74 399 58 1 WaitIrqStarted(true, 0x1b56d)
75 413 58 1 WaitIrqStopped(true, 0x1b56f)
76 304 58 1 Received(0xb)
77 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
78 480 58 1 Error(Protocol(InvalidCrc))
79 218 58 1 DoSendRecv(0x1b578)
80 346 58 1 RotIrqNotAsserted
81 262 58 1 Sent(0xa)
82 399 58 1 WaitIrqStarted(true, 0x1b578)
83 413 58 1 WaitIrqStopped(true, 0x1b57a)
84 304 58 1 Received(0xb)
85 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
86 480 58 1 Error(Protocol(InvalidCrc))
87 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
88 596 58 1 ReqRotBootInfo
89 218 58 1 DoSendRecv(0x1b5e8)
90 346 58 1 RotIrqNotAsserted
91 262 58 1 Sent(0xa)
92 399 58 1 WaitIrqStarted(true, 0x1b5e8)
93 413 58 1 WaitIrqStopped(true, 0x1b5ea)
94 304 58 1 Received(0xb)
95 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
96 480 58 1 Error(Protocol(InvalidCrc))
97 218 58 1 DoSendRecv(0x1b5f3)
98 346 58 1 RotIrqNotAsserted
99 262 58 1 Sent(0xa)
100 399 58 1 WaitIrqStarted(true, 0x1b5f3)
101 413 58 1 WaitIrqStopped(true, 0x1b5f5)
102 304 58 1 Received(0xb)
103 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
104 480 58 1 Error(Protocol(InvalidCrc))
105 218 58 1 DoSendRecv(0x1b5fe)
106 346 58 1 RotIrqNotAsserted
107 262 58 1 Sent(0xa)
108 399 58 1 WaitIrqStarted(true, 0x1b5fe)
109 413 58 1 WaitIrqStopped(true, 0x1b600)
110 304 58 1 Received(0xb)
111 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
112 480 58 1 Error(Protocol(InvalidCrc))
113 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
114 596 58 1 ReqRotBootInfo
115 218 58 1 DoSendRecv(0x1b66e)
116 346 58 1 RotIrqNotAsserted
117 262 58 1 Sent(0xa)
118 399 58 1 WaitIrqStarted(true, 0x1b66e)
119 413 58 1 WaitIrqStopped(true, 0x1b670)
120 304 58 1 Received(0xb)
121 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
122 480 58 1 Error(Protocol(InvalidCrc))
123 218 58 1 DoSendRecv(0x1b679)
124 346 58 1 RotIrqNotAsserted
125 262 58 1 Sent(0xa)
126 399 58 1 WaitIrqStarted(true, 0x1b679)
127 413 58 1 WaitIrqStopped(true, 0x1b67b)
128 304 58 1 Received(0xb)
129 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
130 480 58 1 Error(Protocol(InvalidCrc))
131 218 58 1 DoSendRecv(0x1b684)
132 346 58 1 RotIrqNotAsserted
133 262 58 1 Sent(0xa)
134 399 58 1 WaitIrqStarted(true, 0x1b684)
135 413 58 1 WaitIrqStopped(true, 0x1b686)
136 304 58 1 Received(0xb)
137 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
138 480 58 1 Error(Protocol(InvalidCrc))
139 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
140 596 58 1 ReqRotBootInfo
141 218 58 1 DoSendRecv(0x1b6f4)
142 346 58 1 RotIrqNotAsserted
143 262 58 1 Sent(0xa)
144 399 58 1 WaitIrqStarted(true, 0x1b6f4)
145 413 58 1 WaitIrqStopped(true, 0x1b6f6)
146 304 58 1 Received(0xb)
147 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
148 480 58 1 Error(Protocol(InvalidCrc))
149 218 58 1 DoSendRecv(0x1b6ff)
150 346 58 1 RotIrqNotAsserted
151 262 58 1 Sent(0xa)
152 399 58 1 WaitIrqStarted(true, 0x1b6ff)
153 413 58 1 WaitIrqStopped(true, 0x1b701)
154 304 58 1 Received(0xb)
155 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
156 480 58 1 Error(Protocol(InvalidCrc))
157 218 58 1 DoSendRecv(0x1b70a)
158 346 58 1 RotIrqNotAsserted
159 262 58 1 Sent(0xa)
160 399 58 1 WaitIrqStarted(true, 0x1b70a)
161 413 58 1 WaitIrqStopped(true, 0x1b70c)
162 304 58 1 Received(0xb)
163 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
164 480 58 1 Error(Protocol(InvalidCrc))
165 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
166 596 58 1 ReqRotBootInfo
167 218 58 1 DoSendRecv(0x1b77a)
168 346 58 1 RotIrqNotAsserted
169 262 58 1 Sent(0xa)
170 399 58 1 WaitIrqStarted(true, 0x1b77a)
171 413 58 1 WaitIrqStopped(true, 0x1b77c)
172 304 58 1 Received(0xb)
173 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
174 480 58 1 Error(Protocol(InvalidCrc))
175 218 58 1 DoSendRecv(0x1b785)
176 346 58 1 RotIrqNotAsserted
177 262 58 1 Sent(0xa)
178 399 58 1 WaitIrqStarted(true, 0x1b785)
179 413 58 1 WaitIrqStopped(true, 0x1b787)
180 304 58 1 Received(0xb)
181 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
182 480 58 1 Error(Protocol(InvalidCrc))
183 218 58 1 DoSendRecv(0x1b790)
184 346 58 1 RotIrqNotAsserted
185 262 58 1 Sent(0xa)
186 399 58 1 WaitIrqStarted(true, 0x1b790)
187 413 58 1 WaitIrqStopped(true, 0x1b792)
188 304 58 1 Received(0xb)
189 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
190 480 58 1 Error(Protocol(InvalidCrc))
191 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
192 596 58 1 ReqRotBootInfo
193 218 58 1 DoSendRecv(0x1b800)
194 346 58 1 RotIrqNotAsserted
195 262 58 1 Sent(0xa)
196 399 58 1 WaitIrqStarted(true, 0x1b800)
197 413 58 1 WaitIrqStopped(true, 0x1b802)
198 304 58 1 Received(0xb)
199 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
200 480 58 1 Error(Protocol(InvalidCrc))
201 218 58 1 DoSendRecv(0x1b80b)
202 346 58 1 RotIrqNotAsserted
203 262 58 1 Sent(0xa)
204 399 58 1 WaitIrqStarted(true, 0x1b80b)
205 413 58 1 WaitIrqStopped(true, 0x1b80d)
206 304 58 1 Received(0xb)
207 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
208 480 58 1 Error(Protocol(InvalidCrc))
209 218 58 1 DoSendRecv(0x1b816)
210 346 58 1 RotIrqNotAsserted
211 262 58 1 Sent(0xa)
212 399 58 1 WaitIrqStarted(true, 0x1b816)
213 413 58 1 WaitIrqStopped(true, 0x1b818)
214 304 58 1 Received(0xb)
215 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
216 480 58 1 Error(Protocol(InvalidCrc))
217 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
218 596 58 1 ReqRotBootInfo
219 218 58 1 DoSendRecv(0x1b886)
220 346 58 1 RotIrqNotAsserted
221 262 58 1 Sent(0xa)
222 399 58 1 WaitIrqStarted(true, 0x1b886)
223 413 58 1 WaitIrqStopped(true, 0x1b888)
224 304 58 1 Received(0xb)
225 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
226 480 58 1 Error(Protocol(InvalidCrc))
227 218 58 1 DoSendRecv(0x1b891)
228 346 58 1 RotIrqNotAsserted
229 262 58 1 Sent(0xa)
230 399 58 1 WaitIrqStarted(true, 0x1b891)
231 413 58 1 WaitIrqStopped(true, 0x1b893)
232 304 58 1 Received(0xb)
233 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
234 480 58 1 Error(Protocol(InvalidCrc))
235 218 58 1 DoSendRecv(0x1b89c)
236 346 58 1 RotIrqNotAsserted
237 262 58 1 Sent(0xa)
238 399 58 1 WaitIrqStarted(true, 0x1b89c)
239 413 58 1 WaitIrqStopped(true, 0x1b89e)
240 304 58 1 Received(0xb)
241 470 58 1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
242 480 58 1 Error(Protocol(InvalidCrc))
243 490 58 1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
Here's the full RoT ringbuf:
➜ hubris git:(hubris-1507-debugging) ✗ rota ringbuf sprot
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
NDX LINE GEN COUNT PAYLOAD
18 356 69 1 Underrun
19 316 69 1 SentBytes(0x0)
20 267 69 1 ReceivedBytes(0x10)
21 91 69 1 Err(UnsupportedProtocol)
22 194 69 1 Stats(RotIoStats { rx_received: 0x16f, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16a, rx_invalid: 0x16f, tx_incomplete: 0x0 })
23 277 69 1 ReplyLen(0xb)
24 356 69 1 Underrun
25 316 69 1 SentBytes(0x0)
26 267 69 1 ReceivedBytes(0x10)
27 91 69 1 Err(UnsupportedProtocol)
28 194 69 1 Stats(RotIoStats { rx_received: 0x170, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16b, rx_invalid: 0x170, tx_incomplete: 0x0 })
29 277 69 1 ReplyLen(0xb)
30 356 69 1 Underrun
31 316 69 1 SentBytes(0x0)
0 267 70 1 ReceivedBytes(0x10)
1 91 70 1 Err(UnsupportedProtocol)
2 194 70 1 Stats(RotIoStats { rx_received: 0x171, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16c, rx_invalid: 0x171, tx_incomplete: 0x0 })
3 277 70 1 ReplyLen(0xb)
4 356 70 1 Underrun
5 316 70 1 SentBytes(0x0)
6 267 70 1 ReceivedBytes(0x10)
7 91 70 1 Err(UnsupportedProtocol)
8 194 70 1 Stats(RotIoStats { rx_received: 0x172, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16d, rx_invalid: 0x172, tx_incomplete: 0x0 })
9 277 70 1 ReplyLen(0xb)
10 356 70 1 Underrun
11 316 70 1 SentBytes(0x0)
12 267 70 1 ReceivedBytes(0x10)
13 91 70 1 Err(UnsupportedProtocol)
14 194 70 1 Stats(RotIoStats { rx_received: 0x173, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16e, rx_invalid: 0x173, tx_incomplete: 0x0 })
15 277 70 1 ReplyLen(0xb)
16 356 70 1 Underrun
17 316 70 1 SentBytes(0x0)
from hubris.
I also noticed, scrolling back in the faux-mgs buffer that the first message after the timeouts from the RoT reset that starts this cycle is RoT state: Err(Sprot(BadMessageLength)
from hubris.
Looks like the RoT is coming out of reset and issuing it's first ROT_IRQ asserted at ~95.056 s on the Salae trace. At what looks like the same time, CSn is getting de-asserted. So it seems like ROT_IRQ is firing early here for some reason. Then it ends early, presumably because it sees CSn deasserted already when replying.
![Screenshot 2023-08-31 at 3 15 21 AM](https://private-user-images.githubusercontent.com/466937/264565304-a3de5939-418d-4711-8ed9-99fa4b1515a1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MzM5MDUsIm5iZiI6MTcyMTkzMzYwNSwicGF0aCI6Ii80NjY5MzcvMjY0NTY1MzA0LWEzZGU1OTM5LTQxOGQtNDcxMS04ZWQ5LTk5ZmE0YjE1MTVhMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNVQxODUzMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wNjNlN2YzYzUzNTAyNWMzMjM4NzliZWI3Y2E0NThhYWEwODU0NzEzY2U0NTAxZDk0MDc3NTNmNTAzY2ZkZjExJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.RW0KJ0rtLLLl6isBKzim25W9F04WLPQ3X2MJiOedR6g)
![Screenshot 2023-08-31 at 3 15 48 AM](https://private-user-images.githubusercontent.com/466937/264565442-a0677186-5ba9-40b2-b2d1-39612fa18618.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MzM5MDUsIm5iZiI6MTcyMTkzMzYwNSwicGF0aCI6Ii80NjY5MzcvMjY0NTY1NDQyLWEwNjc3MTg2LTViYTktNDBiMi1iMmQxLTM5NjEyZmExODYxOC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNVQxODUzMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00NDI1NmRlNzZhN2EyYWRkMTRiNTZkNWY4YTNkYTU5MTM4ODliZGE3ZTQyYWNjYTNkZDBhNDc4ZGFjMTZhOWJiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.j80S57Q3kvrdefAwVbKOOdw3dd7TWYP94WrYG84o2q8)
On the next status request things look saner from the ROT_IRQ being asserted POV, but it still ends early.
![Screenshot 2023-08-31 at 3 18 20 AM](https://private-user-images.githubusercontent.com/466937/264566129-daca2124-5b40-4574-8666-65f528d8f778.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MzM5MDUsIm5iZiI6MTcyMTkzMzYwNSwicGF0aCI6Ii80NjY5MzcvMjY0NTY2MTI5LWRhY2EyMTI0LTViNDAtNDU3NC04NjY2LTY1ZjUyOGQ4Zjc3OC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNVQxODUzMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mZWU2MzgxZmE3YWE0MzBmZjAyMmE4YmE4Y2RhMmYzM2I4NzJkZDY5MjE3Mzg3Yjg4MjJlN2M3M2U4ZGEzYzcxJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.wo7O5wiXk8BBonER9TLDhe9D8dU1i3D0AoyVlwiRaIY)
This pattern repeats indefinitely with ROT_IRQ de-asserting early. This is leading me strongly to believe that the SSD bit in the spi status register (CSn de-asserted) is being detected in the reply
method of the lpc55-sprot-server even though the SP hasn't yet de-asserted CSn for this reply. This is almost certainly a mismanagement of the ssd bit, and not clearing it when it should be. We should probably guarantee we clear it right before we wait for the SSA (CSn asserted) interrupt.
If I had to point to one suspicious bit in the code it would be the place where we probably do the mishandling of the SSD bit:
hubris/drv/lpc55-sprot-server/src/main.rs
Lines 294 to 300 in cd5f3c4
Then right after, we go ahead and drain our TX fifo, which probably triggers the zeroes in the CRC location to show up at the SP instead of the actual CRC:
hubris/drv/lpc55-sprot-server/src/main.rs
Lines 318 to 330 in cd5f3c4
from hubris.
Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.
from hubris.
Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.
Hmm, actually it might. We also clear the receive fifo with the self.spi.drain()
. However, the SP is still sending 0s here after we de-assert ROT_IRQ. Those end up in the receive fifo, and therefore will come before the next received request. Then what will happen is that the RoT will go back to waiting for the SSA interrupt, but after the few remaining 0s clocked out from the SP, the SP will de-assert CSn causing the SSD bit on the RoT to get set. This happens after we clear SSD in reply
on the RoT. Then the next request comes in from the SP, we already have some leading 0s at the RoT and so we know it will be UndefinedProtocol
. The RoT gets the SSA interrupt, then it goes into a tight loop, sees that SSD is already set, and reads as subset of the bytes from the SP. The RoT then goes into reply()
and starts putting its reply into its TX fifo. Then the RoT asserts ROT_IRQ and waits for the SSA interrupt again. Then the SP finishes sending its request and now de-asserts CSn. When the SSA interrupt fires, SSD will again already be set, and the ROT will return a short response. This will happen perpetually, because of the mismanagement of the SSD bit.
TODO: Figure out how to not misuse the SSD bit on the RoT!
from hubris.
Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.
from hubris.
Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.
Thanks for taking a look @labbott! This matches my understanding of how the SSA/SSD bits work. Note that we do read them from different registers though. The SSA read is from the intstat register, which is destructive, while the SSD bit is from the status register which is non-destructive. This was a change I made while refactoring many months ago, but I don't think it's the cause of the issue or anything. AFAICT the bits act as we understand and as documented. I don't think we can necessarily blame NXP for anything this time 😆
from hubris.
Zooming in on the first trace, shows that right after a timeout SSD fires immediately after ROT_IRQ and sets us on this disastrous path.
from hubris.
The above is presumably because as the RoT comes out of reset, SSA and SSD are already set so it reads what's in its fifo and goes to reply. Then the request from the SP actually ends.
from hubris.
I wrote a script that can sometimes reproduce. I'll likely continue tweaking:
#!/usr/bin/env bash
MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}
ready_to_reset=false
failures_in_a_row=0
# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
# humility reset failed: Probe could not be created
#
# Caused by:
# 0: hidapi error: hid_error is not implemented yet
# 1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES
while true; do
# Run our command
output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
echo "$output" | grep -q "RoT state: Ok"
if [[ $? -eq 0 ]]; then
failures_in_a_row=0
echo "."
else
failures_in_a_row=$((failures_in_a_row+1))
echo "+"
fi
# increment our grace period counter
reset_grace=$((reset_grace+1))
# Are we done?
if [[ failures_in_a_row -eq MAX_FAILS ]]; then
echo "Test Complete: found faulty sprot"
break
fi
# Should we reset the RoT to try again?
choice=$RANDOM
sleep=$((1/(choice % 100 + 1)))
if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
# reset our grace period counter
reset_grace=0
# Flip a coin to see if we should sleep before we try to reset or after
# We right shift to create a boolean
let "choice >>= 14"
if [[ $choice -eq 0 ]]; then
HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
else
HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
fi
else
sleep 0.01
fi
done
from hubris.
Further validation from a local reproduction that the CSn
pin remains asserted when we read it directly after a reply completes. I'm doing this manually after the reproduction where we are stuck in this state, and can see CSn continually asserted. This also means that we can use this pin (and @mkeeter's great suggestion) to detect the stuck state, wait for the actual CSn line to become de-asserted and then return a SYNC_ERROR
message.
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
NDX LINE GEN COUNT PAYLOAD
23 272 43 1 ReceivedBytes(0x10)
24 91 43 1 Err(UnsupportedProtocol)
25 195 43 1 Stats(RotIoStats { rx_received: 0xab, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xaa, rx_invalid: 0xab, tx_incomplete: 0x0 })
26 282 43 1 ReplyLen(0xb)
27 207 43 1 Pins(One, One)
28 361 43 1 Underrun
29 321 43 1 SentBytes(0x0)
30 207 43 1 Pins(Zero, One)
31 272 43 1 ReceivedBytes(0x10)
0 91 44 1 Err(UnsupportedProtocol)
1 195 44 1 Stats(RotIoStats { rx_received: 0xac, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xab, rx_invalid: 0xac, tx_incomplete: 0x0 })
2 282 44 1 ReplyLen(0xb)
3 207 44 1 Pins(One, One)
4 361 44 1 Underrun
5 321 44 1 SentBytes(0x0)
6 207 44 1 Pins(Zero, One)
7 272 44 1 ReceivedBytes(0x10)
8 91 44 1 Err(UnsupportedProtocol)
9 195 44 1 Stats(RotIoStats { rx_received: 0xad, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xac, rx_invalid: 0xad, tx_incomplete: 0x0 })
10 282 44 1 ReplyLen(0xb)
11 207 44 1 Pins(One, One)
12 361 44 1 Underrun
13 321 44 1 SentBytes(0x0)
14 207 44 1 Pins(Zero, One)
15 272 44 1 ReceivedBytes(0x10)
16 91 44 1 Err(UnsupportedProtocol)
17 195 44 1 Stats(RotIoStats { rx_received: 0xae, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xad, rx_invalid: 0xae, tx_incomplete: 0x0 })
18 282 44 1 ReplyLen(0xb)
19 207 44 1 Pins(One, One)
20 361 44 1 Underrun
21 321 44 1 SentBytes(0x0)
22 207 44 1 Pins(Zero, One)
And here's a screenshot of the early ROT_IRQ de-assertion due to desync from the salae trace for this run. This happens indefinitely.
from hubris.
Adapted Andrew's script for my environment:
#!/usr/bin/env bash
set -e
START=$(date +%s)
MAX_GRACE_CYCLES=20
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}
#MAX_FAILS=${MAX_FAILS:=5}
LOG_LEVEL="--log-level critical"
case $USER in
ajs)
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
export HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
rota_reset='humility -t rot-carrier --archive-name a reset'
INTERFACE=en6
MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
FAUX_MGS="${MGS_REPO}/target/release/faux-mgs --interface ${INTERFACE} ${LOG_LEVEL}"
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
;;
stoltz)
FAUX_MGS="$(which faux-mgs) ${LOG_LEVEL}"
HUBRIS_REPO=$HOME/Oxide/src/hubris
export HUMILITY_ARCHIVE="$HUBRIS_REPO/target/rot-carrier/dist/a/build-rot-carrier-image-a.zip"
export HUMILITY_PROBE="1fc9:0143:MMLBHACB24IDJ"
rota_reset="humility reset"
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
;;
*)
echo 'Where am I?'
exit 1
;;
esac
ready_to_reset=false
failures_in_a_row=0
# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
# humility reset failed: Probe could not be created
#
# Caused by:
# 0: hidapi error: hid_error is not implemented yet
# 1: hidapi error: hid_error is not implemented yet
#
reset_grace=$MAX_GRACE_CYCLES
while true; do
# Run our command
if (${FAUX_MGS} state |& grep -q "RoT state: Ok" )
then
failures_in_a_row=0
echo -n .
else
failures_in_a_row=$((failures_in_a_row+1))
echo -n +$failures_in_a_row
fi
# increment our grace period counter
reset_grace=$((reset_grace+1))
# Are we done?
if [[ failures_in_a_row -eq MAX_FAILS ]]; then
echo "Test Complete: found faulty sprot after $(( $(date +%2) - START )) seconds."
break
fi
# Should we reset the RoT to try again?
choice=$RANDOM
sleep=$((1/(choice % 100 + 1)))
if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
# reset our grace period counter
reset_grace=0
# Flip a coin to see if we should sleep before we try to reset or after
# We right shift to create a boolean
choice=$((choice >> 14))
if [[ $choice -eq 0 ]]; then
bash -c " sleep ${sleep} && ${ROT_RESET_COMMAND} &"
else
bash -c "( ${ROT_RESET_COMMAND} )&"
fi
else
sleep 0.01
fi
done
from hubris.
Related Issues (20)
- Hubris: Config Spartan7 from Aux Flash
- Hubris: Host OS QSPI Driver via FPGA
- Hubris FMC integration on grapefruit
- Hubris: DDR Proxy
- Hubris: UART Proxy Drivers/Implementation
- Ruby Dev Box thermal management
- Hubris/FPGA: Interrupt pins
- SP serial console doesn't detach stale faux-mgs clients on an idle console
- dump-agent gets stuck in send to net HOT 4
- Rack 2 powered off and left blinking power sequencers behind HOT 2
- Load Front IO VPD into `packrat`
- Can't collect an SP dump
- Hubris Work for Grapefruit HOT 2
- maybe we should have a way to release a claimed EXTI interrupt?
- I2C should make sure the controller's off before taking over its pins
- I2C should not treat a mux as having been reset if it NACKs or otherwise fails.
- I2C driver is issuing bus resets in the middle of perfectly reasonable transactions. HOT 1
- I2C is generating 600-750ns glitches, and it should feel bad about this. HOT 1
- GPIO IRQ is causing task rebuilds HOT 3
- support default features at the task / kernel level
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hubris.