Giter Club home page Giter Club logo

Comments (17)

mcb30 avatar mcb30 commented on June 6, 2024 1

Based on IEEE 802.3-2008 sections 28.2.3.4.1 ("Next page encodings") and 40.5.1.2 ("1000BASE-T Auto-Negotiation page use"), the final value in the normal 1000BASE-T autonegotiation sequence should comprise an unformatted page containing the 11-bit master-slave seed value. According to section 40.5.2 ("MASTER-SLAVE configuration resolution"), this value will be used to determine which end acts as the master, but only in the case of a multiport<->multiport or single-port<->single-port connection. Since you have the single-port NIC attached to a multiport switch, this value should be irrelevant.

Unfortunately, the preceding autonegotiation page values (including the multiport bit) are not directly retrievable from the MII registers.

from ipxe.

mcb30 avatar mcb30 commented on June 6, 2024

@alkisg First thing worth checking is the "ifstat" output after a completed download, to check for reported errors.

If nothing shows up there, then a tcpdump (.pcap file) would be most useful, since it would show the relative packet timings along with any protocol oddities.

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

I inserted a shell; ifstat right before boot; here's the output of the slow boot (ipxe.pxe):

/ltsp/x86_64/vmlinuz... ok
/ltsp/ltsp.img... ok
/ltsp/x86_64/initrd.img... ok
iPXE> ifstat
net0: 3c:07:71:a2:02:e3 using rtl8168 on 0000:01:00.0 (open)
  [Link:up, TX:65229 TXE:1 RX:67185 RXE:1033]
  [TXE: 1 x "Network unreachable (http://ipxe.org/28086011)"]
  [RXE: 451 x "Operation not supported (http://ipxe.org/3c086003)"]
  [RXE: 560 x "The socket is not connected (http://ipxe.org/380f6001)"]
net1: a4:17:31:ed:f9:b1 using ar9485 on 0000:03:00.0 (closed)
  [Link:down, TX:0 TXE:0 RX:0 RXE:0]
  [Link status: Unknown (http://ipxe.org/1a086101)]

I'll upload the output of the fast boot (undionly.kpxe) next.

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

This is the output of the fast boot (BIOS PXE => undionly.kpxe). There are much fewer errors, but I wonder if the absence of net1 (=internal wifi) is significant; could it be that it's confusing iPXE even though it's not used anywhere?

/ltsp/x86_64/vmlinuz... ok
/ltsp/ltsp.img... ok
/ltsp/x86_64/initrd.img... ok
iPXE> ifstat
net0: 3c:07:71:a2:02:e3 using undionly on 0000:01:00.0 (open)
  [Link:up, TX:65228 TXE:1 RX:65301 RXE:8]
  [TXE: 1 x "Network unreachable (http://ipxe.org/28086011)"]
  [RXE: 8 x "The socket is not connected (http://ipxe.org/380f6001)"]

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

There's a race condition involved somewhere.
A cold boot usually makes the problem appear more often.
undionly appears to always be fast,
while ipxe.pxe, ipxe.kpxe, realtek.pxe and realtek.kpxe are sometimes fast and sometimes slow.

I'll try more in order to better understand when exactly it happens.

from ipxe.

NiKiZe avatar NiKiZe commented on June 6, 2024

do not use .kpxe target for anything other than undionly.
retest with .pxe

You might want to compile with DEBUG=realtek

Also compare cold vs restart from working OS. There is many bugs in the rtl chips, some are even cheap clones from other companies, and they do behave pretty badly.

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

I used make DEBUG=realtek bin/ipxe.pxe.
This is a video of the fast boot (8 secs): https://photos.app.goo.gl/kDp3bGN5K8qKRpKY7
This is a video of the slow boot (120 secs): https://photos.app.goo.gl/HNWX3dAaqEaw5ZT68
I have more videos and screenshots if something is blurry/unreadable in the ones above.

I didn't yet manage to pinpoint when exactly it happens.

  • undionly.kpxe is always fast
  • The rest are sometimes slow, sometimes fast
  • When they're fast, it takes many reboots to make them slow again
  • Whey they're slow, it takes many reboots to make them fast again
  • It happens less times with a direct cable connections and a bit more frequently when I use the aforementioned gigabit switch
  • I tried isolating the switch from the LAN traffic with no remarkable changes
  • I tried rebooting from Ubuntu 20.04 with no remarkable changes
  • I tried removing the netbook battery etc with no remarkable changes

Tomorrow I'll try to capture and upload a TCP dump of a smaller file, not 80MB.

from ipxe.

mcb30 avatar mcb30 commented on June 6, 2024

@alkisg One thing definitely worth checking is the link speed. It should link-up at 1Gbps, but it is possible that something is causing this to fail and fall back to e.g. 10Mbps.

You can see the link speed in the output when built with DEBUG=realtek, or just check the link speed indicator LEDs on the switch port (not the NIC itself).

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

I'm capturing the boot process including an imgfetch of an empty (all zeroes) 100MB file, but to keep the logs shorter, I stop it after a few MB are already transferred.

This is the pcap of the slow boot (150 seconds), I'll upload the fast one in a bit:

slow.pcap.zip

In the switch the green light is on (1000 Mbps) for both the server and the client, but iftop -i enp5s0 on the server shows that the bandwidth utilization is very small, around 5 Mbps.

There's something random in the server logs; I sometimes see flow control off and sometimes flow control rx/tx, I'm not sure if this matters.

[Sep29 09:05] r8169 0000:05:00.0 enp5s0: Link is Down
[ +11.899150] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control off
[  +3.294484] r8169 0000:05:00.0 enp5s0: Link is Down
[  +2.199555] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

And this is the pcap of the same boot process when it randomly happens to be fast (6 seconds to load the 100M image).
In this case, iftop -i enp5s0 shows a peak of 140 Mbps.
It was taking me a while to reproduce this using the switch, so I used a direct cable connection which randomly happens to be fast more frequently, I hope it doesn't matter.

fast.pcap.zip

from ipxe.

mcb30 avatar mcb30 commented on June 6, 2024

Thanks for the captures. There's nothing immediately odd visible: a steady data rate (as indicated by the I/O graph in wireshark at 100ms resolution), and no retransmissions.

In the slow case, there is consistently a high latency (~2ms) between TFTP DATA packets sent by the server and the corresponding ACK packet received by the server. There is no obvious reason why this might be happening.

I can only suggest a brute-force approach at this point: try swapping out components (e.g. different client NIC with everything else identical, different switch/no switch, different server NIC) until some pattern emerges.

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

Thank you Michael,

This also affects http transfers although a bit less frequently, so I changed the title issue.
It happens over a direct cable connection as well, so it's not the switch.
I tested with a different direct cable, it's not the cable either.
I tested with tftpd-hpa as the TFTP server, it's not specific to dnsmasq.
It never happens with undionly.kpxe.

I'll also test with a different server NIC, but I believe it only happens for specific client NICs, as I've seen it in schools, where some clients were booting fast and some slowly.
I think it's related to Realtek NIC initialization, as I wasn't able to change from slow to fast or the opposite without rebooting.

Are there any delay/sleep/postpone/wait functions in the iPXE code, that I could put debug prints in them, to see if they're called while they shouldn't? E.g. is it possible that iPXE would treat the link as 10 Mbps (due to some bug) even though it reports 1000 Mbps?

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

I tested with an Intel server NIC (8086:107c) and I got the same issue, I think it's only related to the Realtek client NIC.

I tested with another client (desktop PC instead of netbook) that had a very similar Realtek client NIC and the problem never happened there. Both NICs=10ec:8168, the slow one in lspci shows "rev 07" and the related BIOS message is "Realtek PCIe GBE Family Controller Series v2.44 (10/07/11)", while the fast one shows "rev 06" in lspci and "v2.56 (07/01/13)" in BIOS.
Another thing that was different in slow vs fast boots was the Realtek MII register 08, I don't know what that is:
MII 0xe2474 registers 08-0f: 4839 0300 2800 0000 0000 4007 0006 3000

I also tried forcing the negotiation speed from the server using a direct cable connection. The slow transfers only seem to happen when the link is gigabit, not in 100 or 10 Mbps links. The seconds needed to imgfetch the 100M image are:

  • 1000 Mbps randomly fast: 6 sec
  • 1000 Mbps randomly slow: 150 sec
  • 100 Mbps: 102 sec
  • 10 Mbps: 1200 sec

So when the bug is triggered, it makes the gigabit link behave like a ~66 Mbps link, slower than the 100 Mbps one but a lot faster than the 10 Mbps one. It doesn't matter if I force 1000 Mbps from the server or allow auto-negotiation, they're both randomly either fast or slow.

That's all I have for now; when I'm able to find another client that has the issue and bring it to my office, I'll post more.

from ipxe.

mcb30 avatar mcb30 commented on June 6, 2024

Thought I'd replied to this already, but it seems to have gone missing. MII register 08 is the autonegotiation next page receive register, which should generally end up containing the final 16-bit word received by the NIC in the autonegotiation sequence. This value should depend only on the device on the other end of the link.

from ipxe.

NiKiZe avatar NiKiZe commented on June 6, 2024

@mcb30 any chance we can make it easier to collect data here?
Maybe it is an issue only when the RTL is one of master/slave?
Maybe there is some way to detect this special case? And if so try to force reneg?

from ipxe.

alkisg avatar alkisg commented on June 6, 2024

Sorry guys, due to covid, schools are closed and I have no feedback about netbooted clients with similar issues. It may take months, but I'll post more details when I have them. If you prefer to close the issue till then, do so, and I'll reopen it when I have more data.

Testing a bit more with that one client that I have in my office, I saw I was wrong about this:

The slow transfers only seem to happen when the link is gigabit, not in 100 or 10 Mbps links.

The slow down does happen in 10/100 as well.

Another thing that was different in slow vs fast boots was the Realtek MII register 08...

Finally, I wasn't able to correlate any certain value or bit in that register with slow/fast boots. Maybe that register isn't related to the problem.

from ipxe.

mcb30 avatar mcb30 commented on June 6, 2024

Another thing that was different in slow vs fast boots was the Realtek MII register 08...

Finally, I wasn't able to correlate any certain value or bit in that register with slow/fast boots. Maybe that register isn't related to the problem.

Thanks; that probably saves a lot of potential fruitless debugging.

I've pushed commit 8ef22d819 which might help debug further. Build with DEBUG=tftp and with PROFSTAT_CMD enabled in config/general.h, and use the profstat command after a completed TFTP transfer. This will show the times (in CPU ticks) that iPXE thinks it spent responding to TFTP data packets, and that iPXE thinks it took for the server to respond with a new data packet. We already know from the wireshark trace that the server end perceives an unexpected ~2ms latency from iPXE in the "slow" case: this debug should show us whether iPXE thinks it's the server that's taking an unexpectedly long time to respond.

from ipxe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.