Giter Club home page Giter Club logo

Comments (13)

davidarinzon avatar davidarinzon commented on September 12, 2024

Thank you for raising this issue @wrfeewrtgaqwfrwaq

While we look into this further, can you please share more information?

  1. Have you observed this in previous versions of the ENA driver? Your comparison is between the base driver (in-tree) which comes with the kernel vs the github driver (out-of-tree), which differ.
  2. Which zero-copy patch have you applied?
  3. Can you please share more info on which round-trip tests you've done? There are different UDP-based tools.

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024
  1. I didn't do bisection so I can only compare old vs new driver. I'll try doing some more testing today
  2. Latest in this thread #221 (link)[https://github.com/amzn/amzn-drivers/files/14994782/0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch] But as I said, they patch doesn't influence results, the UDP is slower both with and without the patch on new driver
  3. It's in house made test bench that just sends a bunch of packets with tcs timestamp, "server" just mirrors them back and client calculates average round trip time. If you have any recommendations of more popular testbench I'm more than happy to run it.

from amzn-drivers.

davidarinzon avatar davidarinzon commented on September 12, 2024

Thank you for sharing this @wrfeewrtgaqwfrwaq
The reason I asked about the patch is the fact that it implies AF_XDP utilization and testing, while the discussion is more generic (per my understanding of the issue) and is unrelated to XDP and/or AF_XDP. Trying to reduce the number of variables here :)

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

So the plan was to try running onload on machines but after seeing much, much worse latency with onload than without I dug dipper only to find that latency spike is actually a driver issue.

from amzn-drivers.

davidarinzon avatar davidarinzon commented on September 12, 2024

Thank you for the information @wrfeewrtgaqwfrwaq

There are many factors that may impact latency, and getting stable latency results require configurations. Additional information about this topic can be found in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-improve-network-latency-linux.html.

It is interesting that you point out that TCP is not affected but UDP is, the distinction between TCP and UDP is not in the ENA driver but rather in the upper layers of the stack.

I've tried running uperf on the same instance configuration and haven't observed any noticeable differences in TCP or UDP.

One of the differences between the in-tree driver in 6.8.9 and the github driver is the fact that adaptive interrupt moderation is not enabled by default in the in-tree driver (is enabled from kernel 6.9). Having said that, this configuration alone shouldn't cause an increase to the numbers that you're observing.

In the experiments that you've conducted, can you please share some percentiles?

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

I rerun the experiments using uperf to unify the information and not leak any internal information. The following configuration was used for testing:

<?xml version="1.0"?>
<profile name="netperf">
  <group nthreads="1">
        <transaction iterations="1">
            <flowop type="connect" options="remotehost=$h protocol=tcp
            wndsz=50k tcp_nodelay"/>
        </transaction>
        <transaction iterations="100000">
            <flowop type="write" options="size=90"/>
            <flowop type="read" options="size=90"/>
        </transaction>
        <transaction iterations="1">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

I got results suggesting the performance is exactly the same, I'll need to further investigate if our test bench generates different traffic or is broken.

TCP UDP configuration
35679op/s 39177op/s in-tree driver
35679op/s 39960op/s 2.12.0g driver
37000op/s 39177op/s num_io_queues=1
37698op/s 43435op/s num_io_queues=1,sudo sysctl -w net.core.busy_read=50

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

Thank you for the information @wrfeewrtgaqwfrwaq

There are many factors that may impact latency, and getting stable latency results require configurations. Additional information about this topic can be found in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-improve-network-latency-linux.html.

It is interesting that you point out that TCP is not affected but UDP is, the distinction between TCP and UDP is not in the ENA driver but rather in the upper layers of the stack.

I've tried running uperf on the same instance configuration and haven't observed any noticeable differences in TCP or UDP.

One of the differences between the in-tree driver in 6.8.9 and the github driver is the fact that adaptive interrupt moderation is not enabled by default in the in-tree driver (is enabled from kernel 6.9). Having said that, this configuration alone shouldn't cause an increase to the numbers that you're observing.

In the experiments that you've conducted, can you please share some percentiles?

You're saying that difference between TCP and UDP is not made in the driver but in TCP/IP stack. What happens then when I enable ena express acceleration for udp? Where exactly the translation happens?

from amzn-drivers.

davidarinzon avatar davidarinzon commented on September 12, 2024

You're saying that difference between TCP and UDP is not made in the driver but in TCP/IP stack. What happens then when I enable ena express acceleration for udp? Where exactly the translation happens?

Hi @wrfeewrtgaqwfrwaq,
I was referring to the changes in how the network stack handles TCP and UDP traffic, but I am not sure that's relevant here and wouldn't be the reason for such a delta.

Can you share where and/or how ENA express is relevant here?

In addition, you're welcome to contact me via Email in [email protected].

from amzn-drivers.

davidarinzon avatar davidarinzon commented on September 12, 2024

Hi @wrfeewrtgaqwfrwaq

Do you have any further queries?

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

Hi @davidarinzon ,
sorry, I've been on holidays. I'm still working on bisecting the driver and differentiating the benchmarks. I'll get back with the results soon!

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

I tried compiling all driver versions to bisect where performance changes but I'm stuck around 2.8.0 (going backwards from the newest release) with the following error:

/home/mpietrzak/amzn-drivers/kernel/linux/ena/ena_ethtool.c:1220:20: error: initialization of ‘void (*)(struct net_device *, struct ethtool_ringparam *, struct kernel_ethtool_ringparam *, struct netlink_ext_ack *)’ from incompatible pointer type ‘void (*)(struct net_device *, struct ethtool_ringparam *)’ [-Werror=incompatible-pointer-types]
  .get_ringparam  = ena_get_ringparam,
                    ^~~~~~~~~~~~~~~~~
compilation terminated due to -Wfatal-errors.
cc1: some warnings being treated as errors

Do you have any ideas how to fix it?

In case it matters I'm not compiling using native system headers but different ones by patching BUILD_KERNEL in Makefile.

from amzn-drivers.

davidarinzon avatar davidarinzon commented on September 12, 2024

Hi @wrfeewrtgaqwfrwaq
Given our build system and how it works, I am not sure your method is going to work properly.
But, if you wish, you can contact me via Email at [email protected] and we can look at the technical details and howto overcome the obstacles.
My assumption is that the issue is that the signature of this function has changed in RHEL >= 8.7 and RHEL >= 9.1.
We expect the correct distribution versions when compiling.
(https://github.com/amzn/amzn-drivers/blame/master/kernel/linux/ena/kcompat.h#L914)

from amzn-drivers.

wrfeewrtgaqwfrwaq avatar wrfeewrtgaqwfrwaq commented on September 12, 2024

Thanks for the tip! I set up small EC2 instance with all tooling required and compiled the drivers there. I run some more tests, this time on x7i machines and all latency discrepancies seem to disappear. ena_linux_2.8.5 , ena_linux_2.9.0, ena_linux_2.10.0, ena_linux_2.12.1 and in-tree modules show latency within one sigma.

from amzn-drivers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.