Running FIO with pmem engines
When we, the PMDK team, want to check performance of our library, either to see if there was any regression or if our tweaks did a good job, we run benchmarks. One of them is FIO. It helps us simulate synthetic traffic of reads and writes to a pmem device. In this blog post I will introduce this tool and explain how we commonly use it.
Preparing the environment
For starters, links to needed software:
And, of course, to execute any FIO tests using pmem engines you need PMDK installed in your system.
Tests can be run on emulated pmem or real hardware (devices
presented to system by NVDIMM Firmware Interface Table (NFIT), according to ACPI
Specification v6.0+). Either way you have to prepare device(s) before test
execution. First step is to configure a namespace (emulated pmem is delivered
already as a namespace). For this step you can use command
ndctl create-namespace. There are two main options to configure namespace in our
case, either you choose mode
devdax. Namespace configured with
fsdax is a block device with support of dax operations. It can host a
dax-capable file system (e.g. ext4 or xfs) and should be configured like:
ndctl create-namespace --type=pmem --mode=fsdax --region=X [--align=4k] mkfs.ext4 /dev/pmemX mkdir /mnt/pmem mount /dev/pmemX /mnt/pmem -o dax
In all cases regions should be created and maintained using administrative tool specific for your platform. Optional parameter align can be used to change alignment, while default for both mentioned modes is 2M.
Namespace with devdax mode is a character device and is a raw access analogue of filesystem-dax. Configuration is simpler, because there’s no need for file system preparation:
ndctl create-namespace --type=pmem --mode=devdax --region=X [--align=4k]
To sum up, when you execute the following commands:
ls /mnt/pmem ls /dev/dax\*
You should see mounted file system or dev-dax device available to use.
Installation of FIO is pretty straightforward and is described on its GitHub page. Execution is also not complicated, but it requires preparation of the actual benchmark workload (workload is a set of options describing how FIO will generate the traffic for measurements). We’ll see how workloads look for specific jobs below, but first a few general rules:
- jobs should run on a time-based manner with ramp-up time to generate traffic
before the measurements (if
time_basedparameter is not set FIO will run until requests transfer data of given size; duration of a job would be unpredictable),
- numjobs parameter describes how many threads/processes FIO will spawn to do the job. This is the place to scale your workload - you can use more or less jobs to simulate wanted traffic and/or achieve wanted consumption level of available CPU resources,
- blocksize parameter tells how big chunks of data will be transferred; this will influence bandwidth (bigger bs usually means higher BW, because there’s less requests and each request has its own overhead) and IOPS (lower bs usually means higher IOPS, because smaller requests are completed faster and therefore in a given unit of time more requests can be issued). For details on available parameters and options go to HOWTO page.
FIO is generating I/O traffic using engines specific for the job. To specify
which engine is used in a job there’s workload’s option
All of them are described in their c file and all have corresponding
examples. Few of them are related to persistent memory:
This engine reads/writes data using libpmem library. Works on a namespace
fsdax mode. Full example workload for generating traffic of
sequential reads using this engine can be found here. There are
additional comments within the jobfile to explain specific parameters.
It also uses libpmem library, but as the name suggest it is specified to work
with device-dax devices. Our full example workload DaxSeqR.fio
shows how to properly use FIO with /dev/dax. Since we don’t work on a “regular”
file we use a little trick to achieve more realistic measurements - we set a
separate space for each thread doing reads/writes using option
offset_increment=(int)GB. This way threads’ requests do not overlap each
other and results are not cached in the processor. In case FIO reads the same
part over and over (using different threads on the same space) it ends up not
reading from the device.
This engine is using libpmemblk library. Results delivered by this engine will
not show you the best performance of your hardware, only what this specific
library is capable of. While using this engine,
size of a file are given as part of
filename option, like here:
filename=/mnt/pmem6/testjob,512,1024000 #size=1024000M #bs=512
This is a bit different approach, comparing with other engines which use parameter “bs” and “size” (see commented part above). Full example workload doing traffic of sequential reads for pmemblk can be found here.
It’s the most “basic” of mentioned engines, because its purpose is just to read
from/write to a memory mapped region. It can be used with pmem, but is not
tailor-made. It generates traffic doing memcpy to/from memory region. Difference
between this engine and
libpmem is that it doesn’t use PMDK library and hence
doesn’t take advantages of functions specific for writing to persistent memory.
Again, full example working workload is available: MmapSeqR.fio
Execution and results
Command to run FIO is:
fio [options] [jobfile] ....
Since we use workloads defined in a file (as opposed to specifying parameters in command line), we execute FIO with command like:
numactl -N 0 fio --output=my_workload.json --output-format=json my_workload.fio
numactl command guarantees that processes are pinned to selected
numa node. The same can also be achieved by assigning CPU mask for FIO using
cpus_allowed. We chose JSON format to save our results in
it, for more convenient automatic parsing. Last parameter is our input file.
In the resulting file (here called my_workload.json) we look for a list of [“jobs”] and then a section with the name of our job (“jobname”). There is all the benchmark results that our simulated traffic has delivered, including i.a. bw (bandwidth averaged per second, in kiB), iops (IOPS count) and lat_ns (total latency in nanoseconds) divided into read/write sub-sections and additional data like CPU usage (in %). For details, please see “Interpreting the output” section in FIO’s HOWTO page (you can look at “normal” output, since it has similar attributes as in “json” formatted output).