performance

Differences between revisions 9 and 10
Revision 9 as of 2018-09-06 18:34:29
Size: 2343
Editor: manjo
Comment:
Revision 10 as of 2019-09-05 06:15:00
Size: 4681
Editor: ikepanhc
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Ubuntu is built based on balance of hardware capability, performance and security. Here are several options you would like to use for best performance.
Line 5: Line 6:
Ubuntu currently supports 4k page size for all architectures except for ppc64el. 64k page sizes are beneficial to certain memory bound benchmarks, but there is a penalty, it might be wasteful if you are dealing with small data structures that have to be page aligned. Also, 64k page size could break compatibility with old ARMv7 binaries. 64k page size will need to reconsidered with the introduction of 52bit VA. Ubuntu currently supports 4k page size for all architectures except for ppc64el. 64k page sizes are beneficial to certain memory bound benchmarks, but there is a penalty, it might be wasteful if you are dealing with small data structures that have to be page aligned. Also, 64k page size could break compatibility with old ARMv7 binaries. 64k page size will need to reconsidered with the introduction of 52bit VA
Line 17: Line 18:
=== Pros ===
 * All I/O bandwidth increased

=== Cons ===
 * Low level DMA operation might cause security risk e.g. RDMA
 * Only available on arm64

=== Enable IOMMU passthrough ===
 * Append “iommu.passthrough=1 into GRUB_CMDLINE_LINE in /etc/default/grub update grub config and reboot the system.
{{{
sudo sed -i \ 's/^GRUB_CMDLINE_LINUX=\"/GRUB_CMDLINE_LINUX=\"iommu.passthrough=1 /' \ /etc/default/grub
sudo update-grub2
sudo reboot
}}}
Line 19: Line 35:

=== Pros ===
 * Avoid tasks from large number of brk/mmap to increase performance

=== Cons ===
 * Waste lots of memory
 * Need application support
Line 35: Line 58:

== CPU affinity ==
For single thread process, you can bind it to specific CPU. This is helpful when process and PCI device use CPU in the same NUMA node. Some application can bind with its parameter e.g. iperf3 or you need to `taskset` to set it manually.

=== Pros ===
 * Memory access of devices and tasks can be high speed cached

=== Cons ===
 * Only work on multiple NUMA system

=== Find NUMA group for devices ===
 * Find from PCI bus and device number
{{{
$ lspci | grep <device> | cut -d \ -f 1
7d:00.0
}}}
 * Find NUMA node in /sys
{{{
$ find /sys/devices -name numa_node | grep '7d:00.0' | xargs cat
0
}}}
}}}

=== Bind process on specific NUMA group ===
 * Reference the output of lscpu and find out CPU number
{{{
$ lscpu | grep NUMA
NUMA node(s): 4
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
}}}
 * Start process on selected CPU e.g. iperf3
{{{
iperf3 -sD -A 0
}}}
 * Or use taskset
{{{
taskset -p 1 <pid>
}}}

== Force CPU max frequency ==
Most of CPUs are capable of automatic changing its frequencies and use slower frequencies when it is idle. Use `cpufreq-set -r -g performance` to set it always on max frequency to avoid the latency when CPU changing its frequency

=== Pros ===
 * Easy to set

=== Cons ===
 * Not all architecture supported
 * CPU might slow down if overheated

=== Set CPU max frequency ===
 * Assuming a system with 72 CPU
{{{
for x in `seq 0 71`;do
    cpufreq-set -r -g performance -c $x
done
}}}

Performance Considerations

Ubuntu is built based on balance of hardware capability, performance and security. Here are several options you would like to use for best performance.

4k page size vs 64k page size

Ubuntu currently supports 4k page size for all architectures except for ppc64el. 64k page sizes are beneficial to certain memory bound benchmarks, but there is a penalty, it might be wasteful if you are dealing with small data structures that have to be page aligned. Also, 64k page size could break compatibility with old ARMv7 binaries. 64k page size will need to reconsidered with the introduction of 52bit VA

Improve performance benchmarks with 4k pages

There are ways to get comparable performance using 4k page size and avoid the penalties of 64k pages.

  • IOMMU Passthrough.
  • Hugetlbfs

IOMMU Passthrough

Setting iommu.passthrough to 1 on th kernel command line bypasses the IOMMU translation for DMA, setting it to 0 uses IOMMU translation for DMA. This will need to be set at the time of deployment (using preseeds) or by editing the appropriate grub configuration files and reboot the system for the changes to take effect.

It has been observed that on Cavium Thunder X2 setting the kernel command line parameter iommu.passthrough=1, Flexible I/O Tester Synthetic Benchmark (Fio) performance (with 4k page size) was comparable to that of 64k pages.

Pros

  • All I/O bandwidth increased

Cons

  • Low level DMA operation might cause security risk e.g. RDMA
  • Only available on arm64

Enable IOMMU passthrough

  • Append “iommu.passthrough=1 into GRUB_CMDLINE_LINE in /etc/default/grub update grub config and reboot the system.

sudo sed -i \ 's/^GRUB_CMDLINE_LINUX=\"/GRUB_CMDLINE_LINUX=\"iommu.passthrough=1 /' \ /etc/default/grub
sudo update-grub2
sudo reboot

Hugetlbfs

This is a runtime feature that can be enabled from userspace, and is currently supported by applications like Java, Qemu and benchmarks like Flexible I/O Tester Synthetic Benchmark (Fio).

Pros

  • Avoid tasks from large number of brk/mmap to increase performance

Cons

  • Waste lots of memory
  • Need application support

Enable hugetlbfs

  • Specify the number of hugepages (here I am using 512 as an example, you might need to change that depending on your use case.)

 sudo sysctl -w vm.nr_hugepages=512 

  • Mount the hugetlbfs

 sudo mkdir /hugetlbfs
 sudo mount -t hugetlbfs none /hugetlbfs 

Using hugetlbfs with Fio benchmark

With Fio benchmark you can enable mmaphuge for iomem and mem options.

 sudo touch hugepages/file
 sudo  fio -rw=read -blocksize=128k -iodepth=128 -buffered=0 -direct=1 -ioengine=libaio -runtime=180 -filename=/dev/nvme0n1 -name=test -time_based -group_reporting -numjobs=4 -iomem=mmaphuge -mem=mmaphuge:/home/ubuntu/hugepages/file -output=output.txt

CPU affinity

For single thread process, you can bind it to specific CPU. This is helpful when process and PCI device use CPU in the same NUMA node. Some application can bind with its parameter e.g. iperf3 or you need to taskset to set it manually.

Pros

  • Memory access of devices and tasks can be high speed cached

Cons

  • Only work on multiple NUMA system

Find NUMA group for devices

  • Find from PCI bus and device number

$ lspci | grep <device> | cut -d \  -f 1
7d:00.0
  • Find NUMA node in /sys

$ find /sys/devices -name numa_node | grep '7d:00.0' | xargs cat
0

}}}

Bind process on specific NUMA group

  • Reference the output of lscpu and find out CPU number

$ lscpu | grep NUMA
NUMA node(s):        4
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
  • Start process on selected CPU e.g. iperf3

iperf3 -sD -A 0
  • Or use taskset

taskset -p 1 <pid>

Force CPU max frequency

Most of CPUs are capable of automatic changing its frequencies and use slower frequencies when it is idle. Use cpufreq-set -r -g performance to set it always on max frequency to avoid the latency when CPU changing its frequency

Pros

  • Easy to set

Cons

  • Not all architecture supported
  • CPU might slow down if overheated

Set CPU max frequency

  • Assuming a system with 72 CPU

for x in `seq 0 71`;do
    cpufreq-set -r -g performance -c $x
done

ARM64/performance (last edited 2022-07-31 21:16:38 by xypron)