Giter Club home page Giter Club logo

Comments (21)

lzrd avatar lzrd commented on July 26, 2024 1

Recommendation from our discussion: on start, lpc55 sprot_serverr asserts ROT_IRQ until it has finished initializing FIFOs and SSA/SSD state. This does not fully cover the case where RoT has been reset and the SP steps on FIFO initialization, but it does cover some of that window and makes it much easier to find these failures on a logic analyzer.

from hubris.

lzrd avatar lzrd commented on July 26, 2024 1

Recommendation from Matt: after hot path receive, the RoT checks for CSn asserted. If asserted, we have a synchronization error. RoT waits for CSn deasserted, then cleans up, and queues a SYNC_ERROR message, then asserts ROT_IRQ.

from hubris.

jgallagher avatar jgallagher commented on July 26, 2024

Possibly related, although a different error message: #1451

from hubris.

mkeeter avatar mkeeter commented on July 26, 2024

Yeah, thanks for linking that!

I'm not sure if they're related; a timeout occurs if the RoT never asserts ROT_TO_SP_IRQ_L, versus sending corrupt data over SPI.

On the other hand, if we're blaming FIFOs for framing issues, we could hypothesize

  • This issue is caused by FIFO shenanigans in the RoT → SP direction, so the SP sees corrupt messages
  • #1451 is caused by FIFO shenanigans in the SP → RoT direction, so the RoT sees corrupt messages (and never pulls the IRQ line low, because it can't reply)

(This is extremely hand-wavey / unjustified, e.g. I can't remember if the RoT replies with an error on a corrupt message, but just putting it out there)

from hubris.

lzrd avatar lzrd commented on July 26, 2024

Re "This looks like a framing issue where the SP and RoT have gotten out of sync. ..."

SPI messages are framed by the SPI bus CSn signal.

from hubris.

lzrd avatar lzrd commented on July 26, 2024

Note: When SP does a flush (SPI CSn deassert, 10ms delay with no SPI clocks, CSn assert), the SP and RoT communications should be re-synchronized if they ever fell out of sync.

ajs and I don't see anything wrong there yet.

from hubris.

lzrd avatar lzrd commented on July 26, 2024

Is there any chance that the LPC55S69 parts are programmed to operate at 48MHz instead of the required 96MHz?
I don't think that is likely. But it can cause intermittent failures. Need to double check CMPA programming.

from hubris.

lzrd avatar lzrd commented on July 26, 2024

I did some SysTick (1ms) resolution ringbuf logging that doesn't immediately shed any light.

humility: ring buffer drv_lpc55_sprot_server::TIMES in sprot:
humility: attached to 1fc9:0143:MMLBHACB24IDJ via CMSIS-DAP
 NDX LINE      GEN    COUNT PAYLOAD

  39  185       89        1 Now(0x34310) // top of loop
  40  231       89        1 Now(0x34310) // wait_for_csn_asserted(false), notified.
  41  264       89        1 Now(0x34318) // wait_for_request(): read some, ssd
  42  276       89        1 Now(0x34318) // wait_for_request(): read remainder
  43  287       89        1 Now(0x34318) // received SP's message
  44  188       89        1 Now(0x34318) // message ok, sending to handler
  45  207       89        1 Now(0x3431b) // message handled
  46  209       89        1 Now(0x3431b) // ringbuf_entry!(Trace::Stats done
  47  311       89        1 Now(0x3431b) // reply: begining reply
  48  231       89        2 Now(0x3431b) // wait_for_csn_asserted(true), ROT_IRQ, then notified.
  49  322       89        1 Now(0x3431e) // reply: Tx loop done
  50  352       89        1 Now(0x3431e) // reply: FIFOs ready for next request.
  51  185       89        1 Now(0x3431e) // top of loop
  52  231       89        1 Now(0x3431e) // wait_for_csn_asserted

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

I was unable to reproduce on madrid on tuesday, however today I reproduced a single time on my gimletlet and rot-carrier. Reproduction here means that we get an endless stream of CrcError responses from faux-mgs state and the system never self corrects. Here are my raw notes from tonight's debugging session:


Reproduced on the gimletlet/rot-carrier with Keeter's recommendation

Spam faux-mgs state requests and reset RoT via debugger

Specifically, this command was run on my helios box:

ajs@tarz:~/management-gateway-service$ while true; do target/release/faux-mgs --interface igb0 state && sleep 0.1; done

Then on my mac: rota reset

Here's the relevant RoT ringbuf:

➜  hubris git:(hubris-1507-debugging) ✗ rota ringbuf
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  27  356       66        1 Underrun
  28  316       66        1 SentBytes(0x0)
  29  267       66        1 ReceivedBytes(0x10)
  30   91       66        1 Err(UnsupportedProtocol)
  31  194       66        1 Stats(RotIoStats { rx_received: 0x15f, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x15f, rx_invalid: 0x15f, tx_incomplete: 0x0 })
   0  277       67        1 ReplyLen(0xb)
   1  356       67        1 Underrun
   2  316       67        1 SentBytes(0x0)
   3  267       67        1 ReceivedBytes(0x10)
   4   91       67        1 Err(UnsupportedProtocol)
   5  194       67        1 Stats(RotIoStats { rx_received: 0x160, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x160, rx_invalid: 0x160, tx_incomplete: 0x0 })
   6  277       67        1 ReplyLen(0xb)
   7  356       67        1 Underrun
   8  316       67        1 SentBytes(0x0)
   9  267       67        1 ReceivedBytes(0x10)
  10   91       67        1 Err(UnsupportedProtocol)
  11  194       67        1 Stats(RotIoStats { rx_received: 0x161, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x161, rx_invalid: 0x161, tx_incomplete: 0x0 })
  12  277       67        1 ReplyLen(0xb)
  13  356       67        1 Underrun
  14  316       67        1 SentBytes(0x0)
  15  267       67        1 ReceivedBytes(0x10)
  16   91       67        1 Err(UnsupportedProtocol)
  17  194       67        1 Stats(RotIoStats { rx_received: 0x162, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x162, rx_invalid: 0x162, tx_incomplete: 0x0 })
  18  277       67        1 ReplyLen(0xb)
  19  356       67        1 Underrun
  20  316       67        1 SentBytes(0x0)
  21  267       67        1 ReceivedBytes(0x10)
  22   91       67        1 Err(UnsupportedProtocol)
  23  194       67        1 Stats(RotIoStats { rx_received: 0x163, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x163, rx_invalid: 0x163, tx_incomplete: 0x0 })
  24  277       67        1 ReplyLen(0xb)
  25  356       67        1 Underrun
  26  316       67        1 SentBytes(0x0)

What's odd is that consistently, UnsupportedProtocol is the error on the RoT
because of an overrun, and it returns this. The first 6 bytes come back on the
SP, but the remaining 5 are 0 resulting in a CRC error. This is the same on every iteration.

It's like the RoT is stuck and not actually mutating its buffers.

Here's the SP state for the last retry. The 0x17.. are timestamps in ticks (ms).


 22  596      276        1 ReqRotBootInfo
  23  218      276        1 DoSendRecv(0x17a1e5)
  24  346      276        1 RotIrqNotAsserted
  25  262      276        1 Sent(0xa)
  26  399      276        1 WaitIrqStarted(true, 0x17a1e5)
  27  413      276        1 WaitIrqStopped(true, 0x17a1e7)
  28  304      276        1 Received(0xb)
  29  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  30  480      276        1 Error(Protocol(InvalidCrc))
  31  218      276        1 DoSendRecv(0x17a1f0)
  32  346      276        1 RotIrqNotAsserted
  33  262      276        1 Sent(0xa)
  34  399      276        1 WaitIrqStarted(true, 0x17a1f0)
  35  413      276        1 WaitIrqStopped(true, 0x17a1f2)
  36  304      276        1 Received(0xb)
  37  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  38  480      276        1 Error(Protocol(InvalidCrc))
  39  218      276        1 DoSendRecv(0x17a1fb)
  40  346      276        1 RotIrqNotAsserted
  41  262      276        1 Sent(0xa)
  42  399      276        1 WaitIrqStarted(true, 0x17a1fb)
  43  413      276        1 WaitIrqStopped(true, 0x17a1fd)
  44  304      276        1 Received(0xb)
  45  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  46  480      276        1 Error(Protocol(InvalidCrc))
  47  490      276        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

The first 6 bytes indicate an error response for UnsupportedProtocol at
protocol version 3. However, the remaining bytes aren't filled in. The last 2
bytes should be the CRC, but they are 0.

I hooked up the salae and I can see that the RoT is raising ROT_IRQ before it
sees CSn de-asserted. It looks like this happens at byte 9 of the response
from the RoT, and then 6 bytes get clocked out after. Interestingly this would
indeed make the checksums 0, which they also are on the wire and result in the
SP seeing a CRC error. However, it's still unclear to me why the RoT thinks
it's seeing an UnsupportedProtocol. The message from the SP on the wire is correct.

What could explain both these things: early ROT_IRQ raise and invalid read is
if perhaps CSN being de-asserted is misdetected/ tracked wrong in a variable.
That's what I'll be looking into in the code next.

We also see that the RoT is returning 0 bytes on reply and detecting an underrun. It's
also receiving 16 bytes, which should only occur when the SP is waiting for a reply. So clearly
there is a persistent desynchronization, but the SP also should only wait for reply
when ROT_IRQ is asserted, which I don't see in the trace. However, it looks like
we don't de-assert ROT_IRQ at RoT startup and that could cause issues.

Going to add some more debug traces to the RoT side and try to reproduce again.


Having a hard time reproducing again with my added ringbuf traces on the RoT. So
let's do some speculating.

We know from the RoT ringbuf that the RoT receives 16 bytes while
waiting for request. It sees all 16 bytes, and so decodes them as
SprotError::UnknownProtocol. We can assume that this is the SP trying to get
a reply to its prior messsage, as the SP always tries to read a full fifo of 16
bytes. And while waiting for a reply the SP is clocking out zeroes which would
be interpreted by the RoT as UnknownProtocol. However, the SP will only try
to read a reply if it sees ROT_IRQ raised. So the RoT must be asserting ROT_IRQ,
even though it's waiting for a request from the SP and should have ROT_IRQ
deasserted during that time.

We also know that the RoT when trying to reply with this error had not clocked
out more than its fifo, before CSn was deasserted, due to the misleading
Trace::SentBytes(0). This implies that as soon as CSn was asserted and the irq fired in reply
that CSn had also become deasserted.

We then notice in the Rot reply code that we drain the fifo and deassert
ROT_IRQ. At this point we should go back to waiting for the next message in
wait_for_request. But the deassert of ROT_IRQ shows up "early" at the SP and
the SP detects a CRC Error. The SP then retries to send after 5ms and the RoT sees an UnsupportedProtocol,
raises ROT_IRQ and replies, while the SP thinks it is still sending, then the SP goes to receive and gets back partial bytes.

So maybe this is purely just a timing rhythm that kicks off when the RoT starts
at a particular point in time. Here are the hypothesized steps.

  1. The SP sends a status message. It has clocked out most of the bytes, and then de-asserts CSn.
  2. The RoT wakes up, waits for CSn to be asserted by waiting for an IRQ.
  3. Some time passes, and the RoT gets woken up, sees CSn is asserted, then goes into a tight loop to read the message.
  4. The RoT only clocks in a partial message (the last byte or so from the SP request)
  5. The RoT sees that CSn is already de-asserted
  6. The RoT tries to deserialize the reply, sees that it is invalid then calls reply.
  7. By this point, the SP's retry timer has fired causing it to start clocking out the next request
  8. The RoT raises ROT_IRQ, then waits on an IRQ for CSn to be asserted.
  9. CSn is set in the interrupt register on the RoT and the RoT goes to reply.
  10. The RoT only clocks a few bytes out of its reply because it is actually receiving the remainder of a partial SP messaage. This occurs because the RoT sees that CSn is also deasserted via the ssd bit.
  11. The RoT deasserts-rot-irq and goes to wait for the next request.
  12. The SP gets a short reply, with a CRC error then goes to retry.
  13. The cycle repeats..

This is plausible because we see an overrun in the RoT logs indicating a short
read from the SP on every retry and we also saw ROT_IRQ trigger early on the
salae trace, and we know that only partial replies are coming back to the SP,
even though they always start with the right header bytes.

It seems that this is all due to the retry timeout being 5ms and us hitting
that repeatedly. So a poorly chosen timeout results in a loop. This is deeply
unsatisfying as it's very hard to prove its the truth. It also doesn't really
explain why even after the retries are up, that the system never comes back
and continues with the same behavior on the next attempt. Regardless, there are
a few things we should do:

  • Raise the retry timeout
  • If the RoT receives a bad message from the SP (flow error, unsupported
    protocol, etc...), the RoT should NOT send a reply. This just causes the RoT
    to waste time in what will result in a retry anyway. Instead the RoT should go
    back to waiting for the next message and let the SP timeout. This will also have the side effect
    of simplifying the retry error handling on the SP, as now only "unretryable" errors will be returned.
  • Initialize the ROT_IRQ to deasserted on RoT sprot startup. There is no evidence that this is an issue,
    but it should be done.

Now, to help clarify, I'd like to get a reproduction with a salae trace. I'm
going to remove my RoT changes and try again, recording the whole thing from the
start of the SP spam prior to RoT reset.

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Ok, reproduced again. Interestingly, a closer look shows that there are a few UnsupportedProtocol reported from the SP to faux-mgs. But we are definitely stuck. This time I got the whole thing recorded on the salae. I Need to figure out where the first error after the timeout starts. I will check that out in the morning. I attached the trace:
SprotInvalidCrc.sal.zip

Here's the full gimletlet ringbuf

➜  hubris git:(hubris-1507-debugging) ✗ gimletlet ringbuf sprot
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
humility: ring buffer drv_stm32h7_spi_server_core::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
   7  430     1993        1 Tx(0x0)
   8  430     1993        1 Tx(0x3)
   9  430     1993        1 Tx(0x7)
  10  494     1993        2 WaitISR(0xa0002)
  11  464     1993        1 Rx(0x0)
  12  430     1993        1 Tx(0xa9)
  13  494     1993        2 WaitISR(0x90002)
  14  464     1993        1 Rx(0x0)
  15  430     1993        1 Tx(0x0)
  16  494     1993        2 WaitISR(0x80012)
  17  464     1993        1 Rx(0x0)
  18  494     1993        2 WaitISR(0x70012)
  19  464     1993        1 Rx(0x0)
  20  494     1993        2 WaitISR(0x60012)
  21  464     1993        1 Rx(0x0)
  22  494     1993        2 WaitISR(0x50012)
  23  464     1993        1 Rx(0x0)
  24  494     1993        2 WaitISR(0x40012)
  25  464     1993        1 Rx(0x0)
  26  494     1993        2 WaitISR(0x30012)
  27  464     1993        2 Rx(0x0)
  28  494     1993        2 WaitISR(0x10012)
  29  464     1993        1 Rx(0x0)
  30  322     1993        1 Start(read, (0x0, 0x10))
  31  430     1993        8 Tx(0x0)
  32  494     1993        2 WaitISR(0x100002)
  33  464     1993        1 Rx(0x3)
  34  430     1993        1 Tx(0x0)
  35  494     1993        2 WaitISR(0xf0002)
  36  464     1993        1 Rx(0x0)
  37  430     1993        1 Tx(0x0)
  38  494     1993        3 WaitISR(0xe0002)
  39  464     1993        1 Rx(0x0)
  40  430     1993        1 Tx(0x0)
  41  494     1993        1 WaitISR(0xd0002)
  42  464     1993        1 Rx(0x0)
  43  430     1993        1 Tx(0x0)
  44  494     1993        2 WaitISR(0xc0002)
  45  464     1993        1 Rx(0x3)
  46  430     1993        1 Tx(0x0)
  47  494     1993        1 WaitISR(0xb0002)
  48  464     1993        1 Rx(0x0)
  49  430     1993        1 Tx(0x0)
  50  494     1993        1 WaitISR(0xb0002)
  51  494     1993        2 WaitISR(0xa0002)
  52  464     1993        1 Rx(0x0)
  53  430     1993        1 Tx(0x0)
  54  494     1993        1 WaitISR(0x90002)
  55  464     1993        1 Rx(0x0)
  56  430     1993        1 Tx(0x0)
  57  494     1993        2 WaitISR(0x80012)
  58  464     1993        1 Rx(0x0)
  59  494     1993        2 WaitISR(0x70012)
  60  464     1993        1 Rx(0x0)
  61  494     1993        2 WaitISR(0x60012)
  62  464     1993        1 Rx(0x0)
  63  494     1993        2 WaitISR(0x50012)
   0  464     1994        2 Rx(0x0)
   1  494     1994        2 WaitISR(0x30012)
   2  464     1994        1 Rx(0x0)
   3  494     1994        2 WaitISR(0x20012)
   4  464     1994        1 Rx(0x0)
   5  494     1994        2 WaitISR(0x10012)
   6  464     1994        1 Rx(0x0)
humility: ring buffer drv_stm32h7_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
 244  399       57        1 WaitIrqStarted(true, 0x1b3d0)
 245  413       57        1 WaitIrqStopped(true, 0x1b3d2)
 246  304       57        1 Received(0xb)
 247  470       57        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 248  480       57        1 Error(Protocol(InvalidCrc))
 249  218       57        1 DoSendRecv(0x1b3db)
 250  346       57        1 RotIrqNotAsserted
 251  262       57        1 Sent(0xa)
 252  399       57        1 WaitIrqStarted(true, 0x1b3db)
 253  413       57        1 WaitIrqStopped(true, 0x1b3dd)
 254  304       57        1 Received(0xb)
 255  470       57        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   0  480       58        1 Error(Protocol(InvalidCrc))
   1  218       58        1 DoSendRecv(0x1b3e6)
   2  346       58        1 RotIrqNotAsserted
   3  262       58        1 Sent(0xa)
   4  399       58        1 WaitIrqStarted(true, 0x1b3e6)
   5  413       58        1 WaitIrqStopped(true, 0x1b3e8)
   6  304       58        1 Received(0xb)
   7  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   8  480       58        1 Error(Protocol(InvalidCrc))
   9  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  10  596       58        1 ReqRotBootInfo
  11  218       58        1 DoSendRecv(0x1b456)
  12  346       58        1 RotIrqNotAsserted
  13  262       58        1 Sent(0xa)
  14  399       58        1 WaitIrqStarted(true, 0x1b456)
  15  413       58        1 WaitIrqStopped(true, 0x1b458)
  16  304       58        1 Received(0xb)
  17  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  18  480       58        1 Error(Protocol(InvalidCrc))
  19  218       58        1 DoSendRecv(0x1b461)
  20  346       58        1 RotIrqNotAsserted
  21  262       58        1 Sent(0xa)
  22  399       58        1 WaitIrqStarted(true, 0x1b461)
  23  413       58        1 WaitIrqStopped(true, 0x1b463)
  24  304       58        1 Received(0xb)
  25  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  26  480       58        1 Error(Protocol(InvalidCrc))
  27  218       58        1 DoSendRecv(0x1b46c)
  28  346       58        1 RotIrqNotAsserted
  29  262       58        1 Sent(0xa)
  30  399       58        1 WaitIrqStarted(true, 0x1b46c)
  31  413       58        1 WaitIrqStopped(true, 0x1b46e)
  32  304       58        1 Received(0xb)
  33  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  34  480       58        1 Error(Protocol(InvalidCrc))
  35  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  36  596       58        1 ReqRotBootInfo
  37  218       58        1 DoSendRecv(0x1b4dc)
  38  346       58        1 RotIrqNotAsserted
  39  262       58        1 Sent(0xa)
  40  399       58        1 WaitIrqStarted(true, 0x1b4dc)
  41  413       58        1 WaitIrqStopped(true, 0x1b4de)
  42  304       58        1 Received(0xb)
  43  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  44  480       58        1 Error(Protocol(InvalidCrc))
  45  218       58        1 DoSendRecv(0x1b4e7)
  46  346       58        1 RotIrqNotAsserted
  47  262       58        1 Sent(0xa)
  48  399       58        1 WaitIrqStarted(true, 0x1b4e7)
  49  413       58        1 WaitIrqStopped(true, 0x1b4e9)
  50  304       58        1 Received(0xb)
  51  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  52  480       58        1 Error(Protocol(InvalidCrc))
  53  218       58        1 DoSendRecv(0x1b4f2)
  54  346       58        1 RotIrqNotAsserted
  55  262       58        1 Sent(0xa)
  56  399       58        1 WaitIrqStarted(true, 0x1b4f2)
  57  413       58        1 WaitIrqStopped(true, 0x1b4f4)
  58  304       58        1 Received(0xb)
  59  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  60  480       58        1 Error(Protocol(InvalidCrc))
  61  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  62  596       58        1 ReqRotBootInfo
  63  218       58        1 DoSendRecv(0x1b562)
  64  346       58        1 RotIrqNotAsserted
  65  262       58        1 Sent(0xa)
  66  399       58        1 WaitIrqStarted(true, 0x1b562)
  67  413       58        1 WaitIrqStopped(true, 0x1b564)
  68  304       58        1 Received(0xb)
  69  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  70  480       58        1 Error(Protocol(InvalidCrc))
  71  218       58        1 DoSendRecv(0x1b56d)
  72  346       58        1 RotIrqNotAsserted
  73  262       58        1 Sent(0xa)
  74  399       58        1 WaitIrqStarted(true, 0x1b56d)
  75  413       58        1 WaitIrqStopped(true, 0x1b56f)
  76  304       58        1 Received(0xb)
  77  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  78  480       58        1 Error(Protocol(InvalidCrc))
  79  218       58        1 DoSendRecv(0x1b578)
  80  346       58        1 RotIrqNotAsserted
  81  262       58        1 Sent(0xa)
  82  399       58        1 WaitIrqStarted(true, 0x1b578)
  83  413       58        1 WaitIrqStopped(true, 0x1b57a)
  84  304       58        1 Received(0xb)
  85  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  86  480       58        1 Error(Protocol(InvalidCrc))
  87  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  88  596       58        1 ReqRotBootInfo
  89  218       58        1 DoSendRecv(0x1b5e8)
  90  346       58        1 RotIrqNotAsserted
  91  262       58        1 Sent(0xa)
  92  399       58        1 WaitIrqStarted(true, 0x1b5e8)
  93  413       58        1 WaitIrqStopped(true, 0x1b5ea)
  94  304       58        1 Received(0xb)
  95  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  96  480       58        1 Error(Protocol(InvalidCrc))
  97  218       58        1 DoSendRecv(0x1b5f3)
  98  346       58        1 RotIrqNotAsserted
  99  262       58        1 Sent(0xa)
 100  399       58        1 WaitIrqStarted(true, 0x1b5f3)
 101  413       58        1 WaitIrqStopped(true, 0x1b5f5)
 102  304       58        1 Received(0xb)
 103  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 104  480       58        1 Error(Protocol(InvalidCrc))
 105  218       58        1 DoSendRecv(0x1b5fe)
 106  346       58        1 RotIrqNotAsserted
 107  262       58        1 Sent(0xa)
 108  399       58        1 WaitIrqStarted(true, 0x1b5fe)
 109  413       58        1 WaitIrqStopped(true, 0x1b600)
 110  304       58        1 Received(0xb)
 111  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 112  480       58        1 Error(Protocol(InvalidCrc))
 113  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 114  596       58        1 ReqRotBootInfo
 115  218       58        1 DoSendRecv(0x1b66e)
 116  346       58        1 RotIrqNotAsserted
 117  262       58        1 Sent(0xa)
 118  399       58        1 WaitIrqStarted(true, 0x1b66e)
 119  413       58        1 WaitIrqStopped(true, 0x1b670)
 120  304       58        1 Received(0xb)
 121  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 122  480       58        1 Error(Protocol(InvalidCrc))
 123  218       58        1 DoSendRecv(0x1b679)
 124  346       58        1 RotIrqNotAsserted
 125  262       58        1 Sent(0xa)
 126  399       58        1 WaitIrqStarted(true, 0x1b679)
 127  413       58        1 WaitIrqStopped(true, 0x1b67b)
 128  304       58        1 Received(0xb)
 129  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 130  480       58        1 Error(Protocol(InvalidCrc))
 131  218       58        1 DoSendRecv(0x1b684)
 132  346       58        1 RotIrqNotAsserted
 133  262       58        1 Sent(0xa)
 134  399       58        1 WaitIrqStarted(true, 0x1b684)
 135  413       58        1 WaitIrqStopped(true, 0x1b686)
 136  304       58        1 Received(0xb)
 137  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 138  480       58        1 Error(Protocol(InvalidCrc))
 139  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 140  596       58        1 ReqRotBootInfo
 141  218       58        1 DoSendRecv(0x1b6f4)
 142  346       58        1 RotIrqNotAsserted
 143  262       58        1 Sent(0xa)
 144  399       58        1 WaitIrqStarted(true, 0x1b6f4)
 145  413       58        1 WaitIrqStopped(true, 0x1b6f6)
 146  304       58        1 Received(0xb)
 147  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 148  480       58        1 Error(Protocol(InvalidCrc))
 149  218       58        1 DoSendRecv(0x1b6ff)
 150  346       58        1 RotIrqNotAsserted
 151  262       58        1 Sent(0xa)
 152  399       58        1 WaitIrqStarted(true, 0x1b6ff)
 153  413       58        1 WaitIrqStopped(true, 0x1b701)
 154  304       58        1 Received(0xb)
 155  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 156  480       58        1 Error(Protocol(InvalidCrc))
 157  218       58        1 DoSendRecv(0x1b70a)
 158  346       58        1 RotIrqNotAsserted
 159  262       58        1 Sent(0xa)
 160  399       58        1 WaitIrqStarted(true, 0x1b70a)
 161  413       58        1 WaitIrqStopped(true, 0x1b70c)
 162  304       58        1 Received(0xb)
 163  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 164  480       58        1 Error(Protocol(InvalidCrc))
 165  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 166  596       58        1 ReqRotBootInfo
 167  218       58        1 DoSendRecv(0x1b77a)
 168  346       58        1 RotIrqNotAsserted
 169  262       58        1 Sent(0xa)
 170  399       58        1 WaitIrqStarted(true, 0x1b77a)
 171  413       58        1 WaitIrqStopped(true, 0x1b77c)
 172  304       58        1 Received(0xb)
 173  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 174  480       58        1 Error(Protocol(InvalidCrc))
 175  218       58        1 DoSendRecv(0x1b785)
 176  346       58        1 RotIrqNotAsserted
 177  262       58        1 Sent(0xa)
 178  399       58        1 WaitIrqStarted(true, 0x1b785)
 179  413       58        1 WaitIrqStopped(true, 0x1b787)
 180  304       58        1 Received(0xb)
 181  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 182  480       58        1 Error(Protocol(InvalidCrc))
 183  218       58        1 DoSendRecv(0x1b790)
 184  346       58        1 RotIrqNotAsserted
 185  262       58        1 Sent(0xa)
 186  399       58        1 WaitIrqStarted(true, 0x1b790)
 187  413       58        1 WaitIrqStopped(true, 0x1b792)
 188  304       58        1 Received(0xb)
 189  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 190  480       58        1 Error(Protocol(InvalidCrc))
 191  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 192  596       58        1 ReqRotBootInfo
 193  218       58        1 DoSendRecv(0x1b800)
 194  346       58        1 RotIrqNotAsserted
 195  262       58        1 Sent(0xa)
 196  399       58        1 WaitIrqStarted(true, 0x1b800)
 197  413       58        1 WaitIrqStopped(true, 0x1b802)
 198  304       58        1 Received(0xb)
 199  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 200  480       58        1 Error(Protocol(InvalidCrc))
 201  218       58        1 DoSendRecv(0x1b80b)
 202  346       58        1 RotIrqNotAsserted
 203  262       58        1 Sent(0xa)
 204  399       58        1 WaitIrqStarted(true, 0x1b80b)
 205  413       58        1 WaitIrqStopped(true, 0x1b80d)
 206  304       58        1 Received(0xb)
 207  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 208  480       58        1 Error(Protocol(InvalidCrc))
 209  218       58        1 DoSendRecv(0x1b816)
 210  346       58        1 RotIrqNotAsserted
 211  262       58        1 Sent(0xa)
 212  399       58        1 WaitIrqStarted(true, 0x1b816)
 213  413       58        1 WaitIrqStopped(true, 0x1b818)
 214  304       58        1 Received(0xb)
 215  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 216  480       58        1 Error(Protocol(InvalidCrc))
 217  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 218  596       58        1 ReqRotBootInfo
 219  218       58        1 DoSendRecv(0x1b886)
 220  346       58        1 RotIrqNotAsserted
 221  262       58        1 Sent(0xa)
 222  399       58        1 WaitIrqStarted(true, 0x1b886)
 223  413       58        1 WaitIrqStopped(true, 0x1b888)
 224  304       58        1 Received(0xb)
 225  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 226  480       58        1 Error(Protocol(InvalidCrc))
 227  218       58        1 DoSendRecv(0x1b891)
 228  346       58        1 RotIrqNotAsserted
 229  262       58        1 Sent(0xa)
 230  399       58        1 WaitIrqStarted(true, 0x1b891)
 231  413       58        1 WaitIrqStopped(true, 0x1b893)
 232  304       58        1 Received(0xb)
 233  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 234  480       58        1 Error(Protocol(InvalidCrc))
 235  218       58        1 DoSendRecv(0x1b89c)
 236  346       58        1 RotIrqNotAsserted
 237  262       58        1 Sent(0xa)
 238  399       58        1 WaitIrqStarted(true, 0x1b89c)
 239  413       58        1 WaitIrqStopped(true, 0x1b89e)
 240  304       58        1 Received(0xb)
 241  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 242  480       58        1 Error(Protocol(InvalidCrc))
 243  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

Here's the full RoT ringbuf:

➜  hubris git:(hubris-1507-debugging) ✗ rota ringbuf sprot
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  18  356       69        1 Underrun
  19  316       69        1 SentBytes(0x0)
  20  267       69        1 ReceivedBytes(0x10)
  21   91       69        1 Err(UnsupportedProtocol)
  22  194       69        1 Stats(RotIoStats { rx_received: 0x16f, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16a, rx_invalid: 0x16f, tx_incomplete: 0x0 })
  23  277       69        1 ReplyLen(0xb)
  24  356       69        1 Underrun
  25  316       69        1 SentBytes(0x0)
  26  267       69        1 ReceivedBytes(0x10)
  27   91       69        1 Err(UnsupportedProtocol)
  28  194       69        1 Stats(RotIoStats { rx_received: 0x170, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16b, rx_invalid: 0x170, tx_incomplete: 0x0 })
  29  277       69        1 ReplyLen(0xb)
  30  356       69        1 Underrun
  31  316       69        1 SentBytes(0x0)
   0  267       70        1 ReceivedBytes(0x10)
   1   91       70        1 Err(UnsupportedProtocol)
   2  194       70        1 Stats(RotIoStats { rx_received: 0x171, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16c, rx_invalid: 0x171, tx_incomplete: 0x0 })
   3  277       70        1 ReplyLen(0xb)
   4  356       70        1 Underrun
   5  316       70        1 SentBytes(0x0)
   6  267       70        1 ReceivedBytes(0x10)
   7   91       70        1 Err(UnsupportedProtocol)
   8  194       70        1 Stats(RotIoStats { rx_received: 0x172, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16d, rx_invalid: 0x172, tx_incomplete: 0x0 })
   9  277       70        1 ReplyLen(0xb)
  10  356       70        1 Underrun
  11  316       70        1 SentBytes(0x0)
  12  267       70        1 ReceivedBytes(0x10)
  13   91       70        1 Err(UnsupportedProtocol)
  14  194       70        1 Stats(RotIoStats { rx_received: 0x173, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16e, rx_invalid: 0x173, tx_incomplete: 0x0 })
  15  277       70        1 ReplyLen(0xb)
  16  356       70        1 Underrun
  17  316       70        1 SentBytes(0x0)

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

I also noticed, scrolling back in the faux-mgs buffer that the first message after the timeouts from the RoT reset that starts this cycle is RoT state: Err(Sprot(BadMessageLength)

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Looks like the RoT is coming out of reset and issuing it's first ROT_IRQ asserted at ~95.056 s on the Salae trace. At what looks like the same time, CSn is getting de-asserted. So it seems like ROT_IRQ is firing early here for some reason. Then it ends early, presumably because it sees CSn deasserted already when replying.

Screenshot 2023-08-31 at 3 15 21 AM Screenshot 2023-08-31 at 3 15 48 AM

On the next status request things look saner from the ROT_IRQ being asserted POV, but it still ends early.
Screenshot 2023-08-31 at 3 17 44 AM

Screenshot 2023-08-31 at 3 18 20 AM

This pattern repeats indefinitely with ROT_IRQ de-asserting early. This is leading me strongly to believe that the SSD bit in the spi status register (CSn de-asserted) is being detected in the reply method of the lpc55-sprot-server even though the SP hasn't yet de-asserted CSn for this reply. This is almost certainly a mismanagement of the ssd bit, and not clearing it when it should be. We should probably guarantee we clear it right before we wait for the SSA (CSn asserted) interrupt.

If I had to point to one suspicious bit in the code it would be the place where we probably do the mishandling of the SSD bit:

while !self.spi.ssd() {
while self.spi.can_tx() {
let entry = get_u16(idx, tx_buf);
self.spi.send_u16(entry);
idx += 2;
}
}

Then right after, we go ahead and drain our TX fifo, which probably triggers the zeroes in the CRC location to show up at the SP instead of the actual CRC:

// Prime our write fifo, so we clock out zero bytes on the next receive
// We also empty our read fifo, since we don't bother reading bytes while writing.
self.spi.drain();
while self.spi.can_tx() {
self.spi.send_u16(0);
}
// We don't bother receiving bytes when sending. So we must clear
// the overrun error condition for the next time we wait for a reply.
self.spi.rxerr_clear();
// Now that we are ready to handle the next request, let the SP know we
// are ready.
self.deassert_rot_irq();

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.

Hmm, actually it might. We also clear the receive fifo with the self.spi.drain(). However, the SP is still sending 0s here after we de-assert ROT_IRQ. Those end up in the receive fifo, and therefore will come before the next received request. Then what will happen is that the RoT will go back to waiting for the SSA interrupt, but after the few remaining 0s clocked out from the SP, the SP will de-assert CSn causing the SSD bit on the RoT to get set. This happens after we clear SSD in reply on the RoT. Then the next request comes in from the SP, we already have some leading 0s at the RoT and so we know it will be UndefinedProtocol. The RoT gets the SSA interrupt, then it goes into a tight loop, sees that SSD is already set, and reads as subset of the bytes from the SP. The RoT then goes into reply() and starts putting its reply into its TX fifo. Then the RoT asserts ROT_IRQ and waits for the SSA interrupt again. Then the SP finishes sending its request and now de-asserts CSn. When the SSA interrupt fires, SSD will again already be set, and the ROT will return a short response. This will happen perpetually, because of the mismanagement of the SSD bit.

TODO: Figure out how to not misuse the SSD bit on the RoT!

from hubris.

labbott avatar labbott commented on July 26, 2024

Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.

Thanks for taking a look @labbott! This matches my understanding of how the SSA/SSD bits work. Note that we do read them from different registers though. The SSA read is from the intstat register, which is destructive, while the SSD bit is from the status register which is non-destructive. This was a change I made while refactoring many months ago, but I don't think it's the cause of the issue or anything. AFAICT the bits act as we understand and as documented. I don't think we can necessarily blame NXP for anything this time 😆

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Zooming in on the first trace, shows that right after a timeout SSD fires immediately after ROT_IRQ and sets us on this disastrous path.
Screenshot 2023-09-01 at 12 35 27 PM

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

The above is presumably because as the RoT comes out of reset, SSA and SSD are already set so it reads what's in its fifo and goes to reply. Then the request from the SP actually ends.

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

I wrote a script that can sometimes reproduce. I'll likely continue tweaking:

#!/usr/bin/env bash


MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}

HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}

# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}


ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  	1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
  echo "$output" | grep -q "RoT state: Ok"
  if [[ $? -eq 0 ]]; then
    failures_in_a_row=0
    echo "."
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo "+"
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot"
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    let "choice >>= 14"
    if [[ $choice -eq 0 ]]; then
      HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
    else 
      HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
    fi 
  else
    sleep 0.01
  fi
done

from hubris.

andrewjstone avatar andrewjstone commented on July 26, 2024

Further validation from a local reproduction that the CSn pin remains asserted when we read it directly after a reply completes. I'm doing this manually after the reproduction where we are stuck in this state, and can see CSn continually asserted. This also means that we can use this pin (and @mkeeter's great suggestion) to detect the stuck state, wait for the actual CSn line to become de-asserted and then return a SYNC_ERROR message.

humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  23  272       43        1 ReceivedBytes(0x10)
  24   91       43        1 Err(UnsupportedProtocol)
  25  195       43        1 Stats(RotIoStats { rx_received: 0xab, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xaa, rx_invalid: 0xab, tx_incomplete: 0x0 })
  26  282       43        1 ReplyLen(0xb)
  27  207       43        1 Pins(One, One)
  28  361       43        1 Underrun
  29  321       43        1 SentBytes(0x0)
  30  207       43        1 Pins(Zero, One)
  31  272       43        1 ReceivedBytes(0x10)
   0   91       44        1 Err(UnsupportedProtocol)
   1  195       44        1 Stats(RotIoStats { rx_received: 0xac, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xab, rx_invalid: 0xac, tx_incomplete: 0x0 })
   2  282       44        1 ReplyLen(0xb)
   3  207       44        1 Pins(One, One)
   4  361       44        1 Underrun
   5  321       44        1 SentBytes(0x0)
   6  207       44        1 Pins(Zero, One)
   7  272       44        1 ReceivedBytes(0x10)
   8   91       44        1 Err(UnsupportedProtocol)
   9  195       44        1 Stats(RotIoStats { rx_received: 0xad, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xac, rx_invalid: 0xad, tx_incomplete: 0x0 })
  10  282       44        1 ReplyLen(0xb)
  11  207       44        1 Pins(One, One)
  12  361       44        1 Underrun
  13  321       44        1 SentBytes(0x0)
  14  207       44        1 Pins(Zero, One)
  15  272       44        1 ReceivedBytes(0x10)
  16   91       44        1 Err(UnsupportedProtocol)
  17  195       44        1 Stats(RotIoStats { rx_received: 0xae, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xad, rx_invalid: 0xae, tx_incomplete: 0x0 })
  18  282       44        1 ReplyLen(0xb)
  19  207       44        1 Pins(One, One)
  20  361       44        1 Underrun
  21  321       44        1 SentBytes(0x0)
  22  207       44        1 Pins(Zero, One)

And here's a screenshot of the early ROT_IRQ de-assertion due to desync from the salae trace for this run. This happens indefinitely.
Screenshot 2023-09-07 at 2 19 03 AM

from hubris.

lzrd avatar lzrd commented on July 26, 2024

Adapted Andrew's script for my environment:

#!/usr/bin/env bash

set -e
START=$(date +%s)
MAX_GRACE_CYCLES=20
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}
#MAX_FAILS=${MAX_FAILS:=5}

LOG_LEVEL="--log-level critical"

case $USER in
  ajs)
    HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
    export HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
    rota_reset='humility -t rot-carrier --archive-name a reset'
    INTERFACE=en6
    MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
    FAUX_MGS="${MGS_REPO}/target/release/faux-mgs --interface ${INTERFACE} ${LOG_LEVEL}"
    ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
    ;;
  stoltz)
    FAUX_MGS="$(which faux-mgs) ${LOG_LEVEL}"

    HUBRIS_REPO=$HOME/Oxide/src/hubris
    export HUMILITY_ARCHIVE="$HUBRIS_REPO/target/rot-carrier/dist/a/build-rot-carrier-image-a.zip"
    export HUMILITY_PROBE="1fc9:0143:MMLBHACB24IDJ"
    rota_reset="humility reset"
    ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
    ;;
  *)
    echo 'Where am I?'
    exit 1
    ;;
esac

ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  1: hidapi error: hid_error is not implemented yet
#
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  if (${FAUX_MGS} state |& grep -q "RoT state: Ok" )
  then
    failures_in_a_row=0
    echo -n .
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo -n +$failures_in_a_row
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot after $(( $(date +%2) - START )) seconds."
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    choice=$((choice >> 14))
    if [[ $choice -eq 0 ]]; then
      bash -c " sleep ${sleep} && ${ROT_RESET_COMMAND} &"
    else
      bash -c "( ${ROT_RESET_COMMAND} )&"
    fi
  else
    sleep 0.01
  fi
done

from hubris.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.