PMDK man page

librpma

NAME

librpma - remote persistent memory access library

SYNOPSIS

      #include <librpma.h>
      cc ``` -lrpma

DESCRIPTION

librpma is a C library to simplify accessing persistent memory (PMem) on remote hosts over Remote Direct Memory Access (RDMA).

The librpma library provides two possible schemes of operation: Remote Memory Access and Messaging. Both of them are available over a connection established between two peers. Both of these schemes can make use of PMem as well as DRAM for the sake of building efficient and scalable Remote Persistent Memory Accessing (RPMA) applications.

REMOTE MEMORY ACCESS

The librpma library implements four basic API calls dedicated for accessing a remote memory:

All the above functions use the attribute flags to set the completion notification indicator:

All of these operations are considered as finished when the respective completion is generated.

DIRECT WRITE TO PMEM

Direct Write to PMem is a feature of a platform and its configuration which allows an RDMA-capable network interface to write data to platform's PMem in a persistent way. It may be impossible because of e.g. caching mechanisms existing on the data's way. When Direct Write to PMem is impossible, operating in the way assuming it is possible may corrupt data on PMem, so this is why Direct Write to PMem is not enabled by default.

On the current Intel platforms, the only thing you have to do in order to enable Direct Write to PMem is turning off Intel Direct Data I/O (DDIO). Sometimes, you can turn off DDIO either globally for the whole platform or for a specific PCIe Root Port. For details, please see the manual of your platform.

When you have a platform which allows Direct Write to PMem, you have to declare this is the case in your peer's configuration. The peer's configuration has to be transferred to all the peers which want to execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against the platform's PMem and applied to the connection object which safeguards access to PMem.

For details on how to use these APIs please see https://github.com/pmem/rpma/tree/master/examples/05-flush-to-persistent.

CLIENT OPERATION

A client is the active side of the process of establishing a connection. A role of the peer during the process of establishing connection does not determine direction of the data flow (neither via Remote Memory Access nor via Messaging). After establishing the connection both peers have the same capabilities.

The client, in order to establish a connection, has to perform the following steps:

After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.

The client, in order to close a connection, has to perform the following steps:

SERVER OPERATION

A server is the passive side of the process of establishing a connection. Note that after establishing the connection both peers have the same capabilities.

The server, in order to establish a connection, has to perform the following steps:

After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.

The server, in order to close a connection, has to perform the following steps:

When no more incoming connections are expected, the server can stop waiting for them:

MEMORY MANAGEMENT

Every piece of memory (either volatile or persistent) must be registered and its usage must be specified in order to be used in Remote Memory Access or Messaging. This can be done using the following memory management librpma functions:

A description of the registered memory region sometimes has to be transferred via network to the other side of the connection. In order to do that a network-transferable description of the provided memory region (called 'descriptor') has to be created using rpma_mr_get_descriptor(). On the other side of the connection the received descriptor should be decoded using rpma_mr_remote_from_descriptor(). It creates a remote memory region's structure that allows for Remote Memory Access.

MESSAGING

The librpma messaging API allows transferring messages (buffers of arbitrary data) between the peers. Transferring messages requires preparing buffers (memory regions) on the remote side to receive the sent data. The received data are written to those dedicated buffers and the sender does not have to have a respective remote memory region object to send a message. The memory buffers used for messaging have to be registered using rpma_mr_reg() prior to rpma_send() or rpma_recv() function call.

The librpma library implements the following messaging API:

All of these operations are considered as finished when the respective completion is generated.

COMPLETIONS

RDMA operations generate complitions that notify a user that the respective operation has been completed.

The following operations are available in librpma:

All operations generate completion on error. The operations posted with the RPMA_F_COMPLETION_ALWAYS flag also generate a completion on success. Completion codes are reused from the libibverbs library, where the IBV_WC_SUCCESS status indicates the successful completion of an operation. Completions are collected in the completion queue (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for more details on queues).

The librpma library implements the following API for handling completions:

PEER

A peer is an abstraction representing an RDMA-capable device. All other RPMA objects have to be created in the context of a peer. A peer allows one to:

At the beginning, in order to create a peer, a user has to obtain an RDMA device context by the given IPv4/IPv6 address using rpma_utils_get_ibv_context(). Then a new peer object can be created using rpma_peer_new() and deleted using rpma_peer_delete().

SYNCHRONOUS AND ASYNCHRONOUS MODES

By default, all endpoints and connections operate in the synchronous mode where:

are blocking calls. You can make those API calls non-blocking by modifying the respective file descriptors:

When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:

        int ret = fcntl(fd, F_GETFL);
        fcntl(fd, F_SETFL, flags | O_NONBLOCK);

Such change makes the respective API call non-blocking automatically.

The provided file descriptors can also be used for scalable I/O handling like epoll(7).

Please see the example showing how to make use of RPMA file descriptors: https://github.com/pmem/rpma/tree/master/examples/06-multiple-connections

QUEUES, PERFORMANCE AND RESOURCE USE

Remote Memory Access operations, Messaging operations and their Completions consume space in queues allocated in an RDMA-capable network interface (RNIC) hardware for each of the connections. You must be aware of the existence of these queues:

You must assume SQ and RQ entries occupy the place in their respective queue till:

You must also be aware that RNIC has limited resources so it is impossible to store a very long set of queues for many possibly existing connections. If all of the queues will not fit into RNIC's resources it will start using the platform's memory for this purpose. In this case, the performance will be degraded because of inevitable cache misses.

Because the length of queues has so profound impact on the performance of RPMA application you can configure the length of each of the queues separately for each of the connections:

When the connection configuration object is ready it has to be used for either rpma_conn_req_new() or rpma_ep_next_conn_req() for the settings to take effect.

THREAD SAFETY

Most of the core librpma API calls are thread-safe but there are also very important exceptions mainly related to connection's configuration, establishment and tear-down. Here you can find a complete list of NOT thread-safe API calls:

Other librpma API calls are thread-safe. However, creating RPMA library resources usually involves dynamic memory allocation and destroying resources usually involves a dynamic memory release. The same resource cannot be destroyed more than once, at any thread, and a resource cannot be used after it was destroyed. It is the user's responsibility to follow those rules and not doing so may result in a segmentation fault or undefined behaviour.

ON-DEMAND PAGING SUPPORT

On-Demand-Paging (ODP) is a technique that simplifies the memory registration process (for example, applications no longer need to pin down the underlying physical pages of the address space and track the validity of the mappings). On-Demand Paging is available if both the hardware and the kernel support it. The detailed description of ODP can be found here: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x

State of ODP support can be checked using the rpma_utils_ibv_context_is_odp_capable() function that queries the RDMA device context's capabilities and checks if it supports On-Demand Paging.

The librpma library uses ODP automatically if it is supported. ODP support is required to register PMem memory region mapped from File System DAX (FSDAX).

DEBUGGING AND ERROR HANDLING

If a librpma function may fail, it returns a negative error code. Checking if the returned value is non-negative is the only programmatically available way to verify if the API call succeeded. The exact meaning of all error codes is described in the manual of each function.

The librpma library implements the logging API which may give additional information in case of an error and during normal operation as well, according to the current logging threshold levels.

The function that will handle all generated log messages can be set using rpma_log_set_function(). The logging function can be either the default logging function (built into the library) or a user-defined, thread-safe, function. The default logging function can write messages to syslog(3) and stderr(3). The logging threshold level can be set or got using rpma_log_set_threshold() or rpma_log_get_threshold() respectively.

There is an example of the usage of the logging functions: https://github.com/pmem/rpma/tree/master/examples/log

EXAMPLES

See https://github.com/pmem/rpma/tree/master/examples for examples of using the librpma API.

ACKNOWLEDGEMENTS

librpma is built on the top of libibverbs and librdmacm APIs.

SEE ALSO

https://pmem.io/rpma/