Comments (18)
Thank you for reaching out to us.
We will conduct our benchmark test on c6in.metal internally and provide an update accordingly.
from amzn-drivers.
I apologize for overlooking that you are using the c6in.metal instance which is built with Nitro v4 system which requires to map memory BAR of the ENA as write combined, otherwise, performance may be degraded.
Please refer to our guide which includes instructions to enable WC. If this does not resolve your issue, please provide your instance ID and the test timeframe (UTC) so we can inspect our logs.
from amzn-drivers.
I apologize for overlooking that you are using the c6in.metal instance which is built with Nitro v4 system which requires to map memory BAR of the ENA as write combined, otherwise, performance may be degraded. Please refer to our guide which includes instructions to enable WC. If this does not resolve your issue, please provide your instance ID and the test timeframe (UTC) so we can inspect our logs.
I am pretty sure that I enabled WC and was using igb_uio
.
instance ID is i-046d3db433ead3d2a
.
I am not clear on what timeframe I should provide.
Can you please tell me a bit more details?
Do you want me to send burst packets with different configurations and write the timeframe?
from amzn-drivers.
Do you want me to send burst packets with different configurations and write the timeframe?
Exactly, please provide the test start time in UTC so we can inspect ec2 internal logs during your test.
In addition:
- please share which UIO driver you use and the command sequence you used to bring it up and bind to it.
metal instances supports IOMMU, but you may need to disable it in case you are working withigb_uio
orgeneric_uio_driver
which do not support it. - please see guide for verifying that the memory was mapped as WC.
from amzn-drivers.
Do you want me to send burst packets with different configurations and write the timeframe?
Exactly, please provide the test start time in UTC so we can inspect ec2 internal logs during your test. In addition:
- please share which UIO driver you use and the command sequence you used to bring it up and bind to it.
metal instances supports IOMMU, but you may need to disable it in case you are working withigb_uio
orgeneric_uio_driver
which do not support it.- please see guide for verifying that the memory was mapped as WC.
- Basically I followed the instructions in this link.
Here are the commands that I used.
cd ~/dpdk_24_07/
git clone git://dpdk.org/dpdk-kmods
cd ~/dpdk_24_07/dpdk-kmods/linux/igb_uio/
make
sudo modprobe uio
sudo rmmod igb_uio
sudo insmod ./igb_uio.ko wc_enabled=1
cd ~/dpdk_24_07/dpdk/usertools/
sudo python3 dpdk-devbind.py --status
sudo python3 dpdk-devbind.py --unbind 0000:09:00.0
sudo python3 dpdk-devbind.py --bind=igb_uio 0000:09:00.0
I know when using vfio_pcio
, we need to enable IOMMU following the instructions in this link.
When testing with igb_uio
, I basically reverted the changes in /etc/default/grub
, i.e. removed iommu=1 intel_iommu=on
from the file, and then ran grub2-mkconfig
, finally rebooted, so it is supposed to disable IOMMU.
2. I also verified that I enabled WC using the guide you provided. I will post the screenshot eventually.
from amzn-drivers.
@shaibran
Here is the WC check result on c6in.metal
instance, it looks like WC is not enabled properly.
[root@ip-172-31-41-20 ec2-user]# lspci -v -s 0000:09:00.0
09:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, NUMA node 0
Memory at 9d202000 (32-bit, non-prefetchable) [size=8K]
Memory at 9d200000 (32-bit, non-prefetchable) [size=8K]
Memory at 21fffc000000 (64-bit, prefetchable) [size=4M]
Capabilities: [40] Power Management version 3
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=132 Masked-
Capabilities: [100] #19
Capabilities: [150] Transaction Processing Hints
Kernel driver in use: igb_uio
Kernel modules: ena
[root@ip-172-31-41-20 ec2-user]# cat /sys/kernel/debug/x86/pat_memtype_list | grep 21fffc000000
uncached-minus @ 0x21fffc000000-0x21fffc400000
uncached-minus @ 0x21fffc000000-0x21fffc400000
I sent some packets with igb_uio
enabled 16 TX lcores from about Wed Aug 7 14:36:33 UTC 2024
to Wed Aug 7 14:37:33 UTC 2024
, sent 273365929 packets in total.
from amzn-drivers.
@shaibran What could be the reason why WC is not enabled correctly?
from amzn-drivers.
Tried with DPDK 23.11.1, sent 532783050 packets using 16 TX lcores from Wed Aug 7 14:51:40 UTC 2024
to Wed Aug 7 14:52:40 UTC 2024
.
WC is same with DPDK 24.07, not enabled properly.
[root@ip-172-31-41-20 ec2-user]# lspci -v -s 0000:09:00.0
09:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, NUMA node 0
Memory at 9d202000 (32-bit, non-prefetchable) [size=8K]
Memory at 9d200000 (32-bit, non-prefetchable) [size=8K]
Memory at 21fffc000000 (64-bit, prefetchable) [size=4M]
Capabilities: [40] Power Management version 3
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=132 Masked-
Capabilities: [100] #19
Capabilities: [150] Transaction Processing Hints
Kernel driver in use: igb_uio
Kernel modules: ena
[root@ip-172-31-41-20 ec2-user]# cat /sys/kernel/debug/x86/pat_memtype_list | grep 21fffc000000
uncached-minus @ 0x21fffc000000-0x21fffc400000
uncached-minus @ 0x21fffc000000-0x21fffc400000
from amzn-drivers.
Oops, sorry, but I was using wrong command.
I should use sudo insmod ./igb_uio.ko wc_activate=1
instead of wc_enabled=1
.
Will try again with this option.
from amzn-drivers.
I enabled WC properly this time.
[root@ip-172-31-41-20 ec2-user]# lspci -v -s 0000:09:00.0
09:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
Physical Slot: 2
Flags: fast devsel, NUMA node 0
Memory at 9d202000 (32-bit, non-prefetchable) [size=8K]
Memory at 9d200000 (32-bit, non-prefetchable) [size=8K]
Memory at 21fffc000000 (64-bit, prefetchable) [size=4M]
Capabilities: [40] Power Management version 3
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable- Count=132 Masked-
Capabilities: [100] #19
Capabilities: [150] Transaction Processing Hints
Kernel driver in use: igb_uio
Kernel modules: ena
[root@ip-172-31-41-20 ec2-user]# cat /sys/kernel/debug/x86/pat_memtype_list | grep 21fffc000000
write-combining @ 0x21fffc000000-0x21fffc400000
The TX packets per second improved much, was able to send 17.971 Mpps with 16 TX lcores.
The timestamp is from Wed Aug 7 15:16:08 UTC 2024
to Wed Aug 7 15:17:08 UTC 2024
.
However, I could not send 20Mpps (which I believe is a limit of the AWS instance) in any case which made me unhappy.
I was able to send basically same amount with only 8 TX lcores as well, meaning that the increase of TX lcores does not increase the total performance for some reason.
from amzn-drivers.
Also, there is a big difference when I enabled/disabled UDP port cycling.
When I sent from the fixed UDP source port to the fixed UDP destination port, the performance decreased a lot.
When I sent from continuously changing UDP source port to changing UDP destination port, the performance was much better than fixed UDP port.
I was wondering why. @shaibran
from amzn-drivers.
Another timeframe: from Wed Aug 7 15:30:17 UTC 2024
to Wed Aug 7 15:31:17 UTC 2024
.
Sent with 10 TX lcores, sent 1095894515 packets in total, which is equivalent with 18.264Mpps.
This result is almost same with when I used vfio-pci module.
from amzn-drivers.
Can I ask what is the optimized configuration to get the maximum pps for TX? @shaibran
In the c6in.metal
instance, I have 128 virtual cores on 2 NUMA nodes. Each NUMA node has one physical NIC attached.
I also know there are 32 TX and RX rings for each NIC.
During some experiments, I found that we could achieve the limit i.e. 20Mpps when I was using 32 TX lcores, and each TX lcore was sending packets from a fixed UDP port to a fixed UDP port. (But the destination port for each TX ring was different.)
But with 32 TX cores, when I added UDP port cycling for each packet, i.e. increasing the destination UDP port for each packet in each TX thread, the performance decreased a lot (lower than 15Mpps).
from amzn-drivers.
As documented, each EC2 instance has a maximum PPS performance based on its type and size. The metrics that AWS provides can be queried using aws ec2 describe-instance-types
, which include only bandwidth.
We tested the c6in metal instance and found that it met the Key Performance Indicators (KPI). A review of your instance's internal logs did not show any issues.
Matching applications and algorithms to the underlying architecture is beyond the PMD scope, but we can share some best practices:
- Identify the NUMA socket to which the adapter is attached and use only the lcores on that NUMA. Note that this varies from platform to platform and cannot be assumed to be sequential. Avoid using lcore 0, as it is the primary core used by Linux.
- Ensure the application handles pushbacks from the device. If the application floods the device, handling the dropped packets will consume CPU resources and lead to performance degradation.
- Track the instance allowance via xstats (e.g.,
pps_allowance_exceeded
).
All the best,
Shai
from amzn-drivers.
As documented, each EC2 instance has a maximum PPS performance based on its type and size. The metrics that AWS provides can be queried using
aws ec2 describe-instance-types
, which include only bandwidth.We tested the c6in metal instance and found that it met the Key Performance Indicators (KPI). A review of your instance's internal logs did not show any issues.
Matching applications and algorithms to the underlying architecture is beyond the PMD scope, but we can share some best practices:
- Identify the NUMA socket to which the adapter is attached and use only the lcores on that NUMA. Note that this varies from platform to platform and cannot be assumed to be sequential. Avoid using lcore 0, as it is the primary core used by Linux.
- Ensure the application handles pushbacks from the device. If the application floods the device, handling the dropped packets will consume CPU resources and lead to performance degradation.
- Track the instance allowance via xstats (e.g.,
pps_allowance_exceeded
).All the best, Shai
Thank you.
- I am sure I used only the lcores on the corresponding NUMA nodes. I admit that I used lcore 0, but not sure it would impact the performance too much. I can try to avoid using lcore 0 though.
- Can you please explain it in more detail? How can I know if there are pushback from the device and how can I avoid them or handle them properly?
- Can we get xstats when I am using DPDK? If I understand correctly, I can not get stats information using
ethtools
command on the DPDK-bind NICs.
You said each EC2 instance has a maximum PPS performance based on its type and size
, where can I find this value? If I understand correctly, I can only find bandwidth, not PPS limit.
You also mentioned, We tested the c6in metal instance and found that it met the Key Performance Indicators (KPI)
.
Can you tell me what KPI you met and what kind of test you conducted?
I am particularly interested about how you assigned lcores when using DPDK, and what values you used for TX descriptor numbers, etc and how many PPS you were able to reach.
from amzn-drivers.
-
The metrics that AWS provides externally are those available via the
aws ec2 describe-instance-types
command. Our KPIs are not available for sharing. -
All statistics should be retrieved from the application bound to the network interface; please refer to the provided instructions. Developing a DPDK application falls outside the scope of PMD expertise, so we recommend consulting the open-source DPDK documentation and example applications, such as
testpmd
. -
As you already observed during your testing, you should utilize the instance CPU resources in order to improve PPS (multiple flows).
from amzn-drivers.
@shaibran I have checked the performance benchmark test results here.
In test results of Test#10 Zero packet loss test on Intel server of c6in.8xlarge AWS cloud instance
, when the frame size is 500 bytes, the frame rate is 5.4Mpps and the line rate is 44.93%.
There is a note claiming that Throughput is limited by the AWS instance configuration on ENA NIC driver.
I wonder what exactly is this limit.
Is it a limit of PPS or BPS or something else?
Also in Test #10
of other similar test, why can't we achieve full line rate even when we send 1518 bytes of frame?
Additionally, which parameter did they (I am not sure if this test was done by an Intel or AWS engineer) possibly use for the TREX packet generator? Were they TCP or UDP? What were the distributions of IP addresses or ports of sent packets?
Thanks in advance.
from amzn-drivers.
This report was not crafted by AWS, please reach out to the publisher of that report for the technical details.
EC2 virtualized instance types have performance limiters across all metrics. However, these limiters are enforced at the underlying infrastructure level, not within the driver as the report suggests
As I wrote before, the performance metrics that AWS provides externally are those available via the aws ec2 describe-instance-types command. Our KPIs are not available for sharing.
from amzn-drivers.
Related Issues (20)
- [Support]: ena_netdev.o Error while executing the make command HOT 2
- [Support]: DPDK 21.11 onwards strange issues seen in AWS 'c5a' instance family HOT 3
- [Bug]: latency regression HOT 13
- [Bug]: Compilation of ena_linux v2.12.0 for kernels older than 6.8.0 HOT 9
- [Bug]: Compilation of ena_linux v2.12.0 for kernels > 6.8.4 (even 6.9.1) HOT 4
- [Documentation]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html doesn't mention `phc_enable` module option HOT 1
- [Support]: coalescing features HOT 5
- [Bug]: EFA: ibv_open_device() eventually fails when running in loop HOT 1
- [Support]: DPDK support for ENA 2.12.0 HOT 1
- [Bug]: EFA is incompatible with Mellanox rdma HOT 1
- [Support]: Commit with changelog entries does not change actual code? HOT 1
- [Support]: ENA_BUSY_SUPPORT only between 3.10.0 and before 4.5.0? HOT 1
- [Bug]: XDP loading fails on amazon/RHEL-9.3.0_HVM-20240229-x86_64 HOT 10
- [Support]: Problem with NAPI busy poll ENA and/or creating a working setup with 1 RX/TX queue HOT 4
- [Bug]: Does not build on Kernel 6.10 HOT 10
- [Bug]: Unexpected kernel dump on 6.9.9 - g5.4xlarge HOT 5
- [Support]: need help to remove SOFTIRQ_NET_TX HOT 10
- [Support]: Outdated ENA PMD driver HOT 1
- [Feature request]: BIG TCP support for IPv6/4 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amzn-drivers.