Giter Club home page Giter Club logo

Comments (34)

Cornelicorn avatar Cornelicorn commented on June 23, 2024 2

To add a data point to the reports: We had the same issue with Mellanox Technologies MT27710 Family [ConnectX-4 Lx] and "fixed" it by adding an#undef NET_PROTO_EAPOL to our build config.

from ipxe.

NiKiZe avatar NiKiZe commented on June 23, 2024 1

This sounds like a possible duplicate of #1048 which should be fixed in current master. Can you verify which commit you have checked out?

from ipxe.

NiKiZe avatar NiKiZe commented on June 23, 2024 1

Could we get confirmation if this is fixed by the merge of #1174, thanks

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024 1

With ipxe.efi this does work for my test host, though we can’t use that bootloader due to issues that occur when many different nics are installed in a system.

Thank you for testing.

Your result indicates that the issue is fixed in iPXE, so I will close this issue now. If you want to continue using snponly.efi, you will need to contact your UEFI BIOS vendor to get a BIOS update that includes the equivalent fix in the BIOS-provided SNP driver.

You can also open a separate issue to cover whatever problem you are seeing that prevents you from using ipxe.efi when many different NICs are installed.

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

The tested revert was specifically fbc3b4a
master...nshalman:ipxe:fbc3b4a104698658202c2a83217ca8722453bf49

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

I may not have tested on the latest master. Thank you for the pointer.

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

I may not have tested on the latest master. Thank you for the pointer.

Based on your git bisect log, your most recent commit tested was 115707c which is older than the known fix for this issue.

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

My test fails on the latest commit of master (98dd25a)

http://147.28.150.231:8000/ipxe.efi... ok
iPXE initialising devices...ok



iPXE 1.0.0+ -- Open Source Network Boot Firmware -- https://ipxe.org
Features: DNS HTTP HTTPS NFS TFTP VLAN EFI Menu
Welcome to iPXE Stress Test Embedded Script!
Configuring (net0 98:03:9b:89:d9:36)..................... ok
https://artifacts.platformequinix.com/images/ubuntu/22_04/fe3f18eead9ab1bf6a333294198cdb6cdf918290/image.tar.gz.................. Connection timed out (https://ipxe.org/4c116092)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)
flexboot_nodnic_ports_register_dev: port register_dev failed (Status = -336093320)
flexboot_nodnic_probe: flexboot_nodnic_ports_register_dev failed (Status = -336093320)

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

Just confirming that additional testing confirms that Mellanox CX4 cards are having trouble once booted into the latest commit of iPXe (98dd25a) but the problems go away if I apply my revert commit (fbc3b4a)

What additional debugging information would be of use for tracking down the issue?

from ipxe.

NiKiZe avatar NiKiZe commented on June 23, 2024

What is the card connected to, and what do you see on the wire?

from ipxe.

ad-sei avatar ad-sei commented on June 23, 2024

Can confirm, that 8b14652 breaks it also for Mellanox ConnectX-6 LX cards. This happens up to the latest commit.

grafik

The NICs are connected through 100GBASE-CR4 QSFP28 cables through LAG to the switch.

tcpdump done on the switch:

Switch A

bash-4.2# tcpdump -i vlan1101 ether host b8:3f:d2:99:f0:34 -vvv

tcpdump: listening on vlan1101, link-type EN10MB (Ethernet), capture size 262144 bytes


12:56:14.461352 b8:3f:d2:99:f0:34 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 431: (tos 0x0, ttl 64, id 4395, offset 0, flags [none], proto UDP (17), length 417)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:3f:d2:99:f0:34 (oui Unknown), length 389, xid 0xb0370948, secs 4, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address b8:3f:d2:99:f0:34 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            MSZ Option 57, length 2: 1472
            ARCH Option 93, length 2: 11
            NDI Option 94, length 3: 1.3.10
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00011:UNDI:003010"
            User-Class Option 77, length 4:
              instance#1: ERROR: invalid option
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
              Hostname, Domain-Name, RP, MTU
              NTP, Vendor-Option, Vendor-Class, TFTP
              BF, Option 119, Option 128, Option 129
              Option 130, Option 131, Option 132, Option 133
              Option 134, Option 135, Option 175, Option 203
            T175 Option 175, length 36: 2969895189,3004178411,50402561,385941796,16847617,17891585,654377237,16852481,17957121
            Client-ID Option 61, length 7: ether b8:3f:d2:99:f0:34
            GUID Option 97, length 17: 0.80.53.57.56.54.57.83.71.72.51.50.56.70.50.90.83
            END Option 255, length 0
12:56:14.461520 b8:3f:d2:99:f0:34 (oui Unknown) > 33:33:00:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 70: (hlim 255, next-header ICMPv6 (58) payload length: 16) fe80::ba3f:d2ff:fe99:f034 > ff02::2: [icmp6 sum ok] ICMP6, router solicitation, length 16
          source link-address option (1), length 8 (1): b8:3f:d2:99:f0:34
            0x0000:  b83f d299 f034
12:56:14.776110 00:1c:73:00:00:99 (oui Arista Networks) > b8:3f:d2:99:f0:34 (oui Unknown), ethertype IPv6 (0x86dd), length 118: (hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::21c:73ff:fe00:99 > fe80::ba3f:d2ff:fe99:f034: [icmp6 sum ok] ICMP6, router advertisement, length 64
        hop limit 64, Flags [managed], pref medium, router lifetime 1800s, reachable time 0ms, retrans timer 1000ms
          source link-address option (1), length 8 (1): 00:1c:73:00:00:99
            0x0000:  001c 7300 0099
          mtu option (5), length 8 (1):  9100
            0x0000:  0000 0000 238c
          prefix info option (3), length 32 (4): 2a05:b540:2:22::/64, Flags [onlink], valid time 2592000s, pref. time 604800s
            0x0000:  4080 0027 8d00 0009 3a80 0000 0000 2a05
            0x0010:  b540 0002 0022 0000 0000 0000 0000
12:56:14.942300 00:1c:73:00:00:99 (oui Arista Networks) > b8:3f:d2:99:f0:34 (oui Unknown), ethertype IPv6 (0x86dd), length 118: (hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::21c:73ff:fe00:99 > fe80::ba3f:d2ff:fe99:f034: [icmp6 sum ok] ICMP6, router advertisement, length 64
        hop limit 64, Flags [managed], pref medium, router lifetime 1800s, reachable time 0ms, retrans timer 1000ms
          source link-address option (1), length 8 (1): 00:1c:73:00:00:99
            0x0000:  001c 7300 0099
          mtu option (5), length 8 (1):  9100
            0x0000:  0000 0000 238c
          prefix info option (3), length 32 (4): 2a05:b540:2:22::/64, Flags [onlink], valid time 2592000s, pref. time 604800s
            0x0000:  4080 0027 8d00 0009 3a80 0000 0000 2a05
            0x0010:  b540 0002 0022 0000 0000 0000 0000

Switch B:

12:59:22.241906 b8:3f:d2:99:f0:34 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 431: (tos 0x0, ttl 64, id 6565, offset 0, flags [none], proto UDP (17), length 417)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:3f:d2:99:f0:34 (oui Unknown), length 389, xid 0x23431559, secs 4, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address b8:3f:d2:99:f0:34 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            MSZ Option 57, length 2: 1472
            ARCH Option 93, length 2: 11
            NDI Option 94, length 3: 1.3.10
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00011:UNDI:003010"
            User-Class Option 77, length 4:
              instance#1: ERROR: invalid option
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
              Hostname, Domain-Name, RP, MTU
              NTP, Vendor-Option, Vendor-Class, TFTP
              BF, Option 119, Option 128, Option 129
              Option 130, Option 131, Option 132, Option 133
              Option 134, Option 135, Option 175, Option 203
            T175 Option 175, length 36: 2969895189,3004178411,50402561,385941796,16847617,17891585,654377237,16852481,17957121
            Client-ID Option 61, length 7: ether b8:3f:d2:99:f0:34
            GUID Option 97, length 17: 0.80.53.57.56.54.57.83.71.72.51.50.56.70.50.90.83
            END Option 255, length 0
12:59:22.242081 b8:3f:d2:99:f0:34 (oui Unknown) > 33:33:00:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 70: (hlim 255, next-header ICMPv6 (58) payload length: 16) fe80::ba3f:d2ff:fe99:f034 > ff02::2: [icmp6 sum ok] ICMP6, router solicitation, length 16
          source link-address option (1), length 8 (1): b8:3f:d2:99:f0:34
            0x0000:  b83f d299 f034
12:59:22.252335 00:1c:73:00:00:99 (oui Arista Networks) > b8:3f:d2:99:f0:34 (oui Unknown), ethertype IPv6 (0x86dd), length 118: (hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::21c:73ff:fe00:99 > fe80::ba3f:d2ff:fe99:f034: [icmp6 sum ok] ICMP6, router advertisement, length 64
        hop limit 64, Flags [managed], pref medium, router lifetime 1800s, reachable time 0ms, retrans timer 1000ms
          source link-address option (1), length 8 (1): 00:1c:73:00:00:99
            0x0000:  001c 7300 0099
          mtu option (5), length 8 (1):  9100
            0x0000:  0000 0000 238c
          prefix info option (3), length 32 (4): 2a05:b540:2:22::/64, Flags [onlink], valid time 2592000s, pref. time 604800s
            0x0000:  4080 0027 8d00 0009 3a80 0000 0000 2a05
            0x0010:  b540 0002 0022 0000 0000 0000 0000

hope that helps

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

I can also verify that reverting that commit fixes failure to boot with Mellanox CX5 nics as well.

In my tests, I'm booting using snponly.efi, FWIW

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

I tested the latest version again, as I saw some more commits related to eapol went in earlier today, but this is still broken.

It appears that the code in eapol.c, where it says "Ignore non-EAPol devices" isn't ignoring these Mellanox cards, because if I just add another unconditional "return 0;" before the "Initialize structure" comment, then my hosts w/ Mellanox boot interfaces will work.

from ipxe.

Rinaldo-lsw avatar Rinaldo-lsw commented on June 23, 2024

Hello,

I'm working for a relatively big hosting company and we also noticed that iPXE is broken for a while on Mellanox cards.

As an example we have new HP RL300 ARM servers and these chassis have an onboard Mellanox card.
Mellanox Technologies MT2894 Family [ConnectX-6 Lx]

This issue is not limited to this specific model, we also have 25GbE+ Mellanox cards that are acting in the same way.

We are still on commit cac3a584dc8acea1522669f1ed16e0979fb92252 which works for Mellanox cards.
However, anything after will break PXE boot.

from ipxe.

Smithx10 avatar Smithx10 commented on June 23, 2024

Ran into this issue with Mellanox CX5 and CX6, rebasing fbc3b4a this onto main got them booting again.

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

@Smithx10 can you try the workaround suggested by @Cornelicorn and report back if it helped as it's a much less invasive workaround to tweak a define than backing the code out entirely. I haven't had a chance to test for myself.

To add a data point to the reports: We had the same issue with Mellanox Technologies MT27710 Family [ConnectX-4 Lx] and "fixed" it by adding an#undef NET_PROTO_EAPOL to our build config.

from ipxe.

stappersg avatar stappersg commented on June 23, 2024

From the iPXE IRC channel:

21:00 < Redacted> I'm trying to boot a Mellanox ConnectX5 card and
  ran into  Configuring (net2 a0:88:c2:6b:7f:44).................. No
  configuration methods succeeded (https://ipxe.org/040ee119)
21:00 < Redacted> in both bios and uefi 
21:00 < Redacted> Is there some gotcha with these Mellanox cards ?
21:06 < stappers> https://ipxe.org/040ee119
21:14 < Redacted> @stappers  think I might be hitting
https://github.com/ipxe/ipxe/issues/1091 ?
21:24 < stappers> Keep thinking and act upon the better thoughts, at
  least try to do.
21:55 < Redacted> Interesting, rolling back to
  https://github.com/ipxe/ipxe/tree/8f1514a00450119b04b08642c55aa674bdf5a4ef
  worked, Im applying this
  https://github.com/ipxe/ipxe/commit/fbc3b4a104698658202c2a83217ca8722453bf49
  and seeing what happens
21:58  * stappers is in UTC+1 and goes sleeping
22:36 < Redacted> Yea,  just confirmed,  mellanox worked after rebasing
  that commit onto main

I as non mellanox hardware owner, are with the mellanox hardware owners: Somebody else should provide a merge request

from ipxe.

nshalman avatar nshalman commented on June 23, 2024

I as non mellanox hardware owner, are with the mellanox hardware owners: Somebody else should provide a merge request

I don't think my revert commit is a good solution. I believe @mcb30 is working on a better solution.
Of the short term fixes I can currently think of, changing the default for NET_PROTO_EAPOL to be undefined is one option, assuming that that workaround works.

I am going to update my description of this bug to include the suggestion that folks attempt #1091 (comment) before patching the source.

As I have said before, I haven't had time to test that myself, but it seems very likely to me that it is a much simpler workaround.

from ipxe.

danmcd avatar danmcd commented on June 23, 2024

Adding a comment so I can watch. I have a downstream and was planning on updating with master soon. Would very much like this fixed before I accept the merge. (And apparently I have the guilty commit in two releases of our downstream.)

from ipxe.

danmcd avatar danmcd commented on June 23, 2024

Could we get confirmation if this is fixed by the merge of #1174, thanks

I'm going to ask our ops team to try it out on an affected box. We have, in the interim, removed EAPOL support from Triton's downstream of ipxe, since we don't use it anyway currently. See TritonDataCenter/ipxe#25 .

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

Still failed for me when building with this latest change and eapol enabled again, booting from snponly.

On Sun, Mar 17, 2024 at 7:18 PM Christian I. Nilsson < @.> wrote: Could we get confirmation if this is fixed by the merge of #1174 <#1174>, thanks — Reply to this email directly, view it on GitHub <#1091 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZIKENFOFU6L32BLNASJWRDYYYQEBAVCNFSM6AAAAABADIPO2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGY2TGNBZGI . You are receiving this because you commented.Message ID: @.>

If you are using snponly then there's a good chance that the underlying SNP driver provided by Mellanox has the same bug, since Mellanox uses a shared driver codebase for both iPXE and their UEFI SNP driver. There's nothing we can do about the bug being present in the underlying SNP driver.

Please try using ipxe.efi instead of snponly.efi so that the updated iPXE driver (including the fix) is used to drive the hardware instead.

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

danmcd avatar danmcd commented on June 23, 2024

Your result indicates that the issue is fixed in iPXE, so I will close this issue now. If you want to continue using snponly.efi, you will need to contact your UEFI BIOS vendor to get a BIOS update that includes the equivalent fix in the BIOS-provided SNP driver.

Triton Data Center needs snponly.efi (and undionly.kpxe for BIOS) as well, so our testing would likely fail as well. (To that our, our downstream will maintain excluding EAPOL for now.)

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

The 18-byte packet will be zero-padded to 60 bytes on the wire anyway (64 bytes including the Ethernet FCS), since that is the minimum length Ethernet packet.

We could possibly work around the underlying SNP driver bug by pointlessly zero-padding the packet to 60 bytes ourselves. That would be sufficient to avoid the underlying bug in the SNP driver (assuming that it is using code identical to that fixed in commit c11734eee).

@ech68 could you please retest snponly.efi built from #1177 ?

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

On Mon, Mar 18, 2024 at 11:40 AM Michael Brown @.***> wrote: @ech68 https://github.com/ech68 could you please retest snponly.efi built from #1177 <#1177> ?
Same failure mode unfortunately.

Thanks for testing. Does ifstat report the driver as SNP or NII when you are using snponly.efi?

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

Thanks for testing. Does ifstat report the driver as SNP or NII when you are using snponly.efi?

NII

Thanks. I've generalised the PR to cover both SNP and NII, and force-pushed PR #1177. Could you please retest with this commit?

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

Thanks. I've generalised the PR to cover both SNP and NII, and
force-pushed PR #1177 #1177. Could you
please retest with this commit?

with that update, it works!

Fantastic, thank you! Could you let me know your name and email for the commit log testing credit?

from ipxe.

ech68 avatar ech68 commented on June 23, 2024

from ipxe.

mcb30 avatar mcb30 commented on June 23, 2024

Fantastic, thank you! Could you let me know your name and email for the commit log testing credit?

Eric Hagberg, ***@***.***

I think there may be some kind of automated censorship system at work here. 🙃

from ipxe.

stappersg avatar stappersg commented on June 23, 2024

Fantastic, thank you! Could you let me know your name and email for the commit log testing credit?

Eric Hagberg, @.***

I think there may be some kind of automated censorship system at work here. 🙃

In https://github.com/ipxe/ipxe/pull/1177/commits is an email address ( mcb30 AT ipxe . org ) Eric, please mail to that address directly to by-pass the automated censorship.

from ipxe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.