librpma - remote persistent memory access library
#include <librpma.h>
cc ... -lrpma
librpma is a C library to simplify accessing persistent memory (PMem) on remote hosts over Remote Direct Memory Access (RDMA).
The librpma library provides two possible schemes of operation: Remote Memory Access and Messaging. Both of them are available over a connection established between two peers. Both of these schemes can make use of PMem as well as DRAM for the sake of building efficient and scalable Remote Persistent Memory Accessing (RPMA) applications.
The librpma library implements four basic API calls dedicated for accessing a remote memory:
rpma_read() - initiates transferring data from the remote memory to the local memory,
rpma_write() - initiates transferring data from the local memory to the remote memory),
rpma_atomic_write() - works like rpma_write(), but it allows transferring 8 bytes of data (RPMA_ATOMIC_WRITE_ALIGNMENT) and storing them atomically in the remote memory (see rpma_atomic_write(3) for details and restrictions), and:
rpma_flush() - initiates finalizing a transfer of data to the remote memory. Possible types of rpma_flush() operation:
RPMA_FLUSH_TYPE_PERSISTENT - flush data down to the persistent domain,
RPMA_FLUSH_TYPE_VISIBILITY - flush data deep enough to make it visible on the remote node.
All the above functions use the attribute flags to set the completion notification indicator:
RPMA_F_COMPLETION_ON_ERROR - generates the completion only on error
RPMA_F_COMPLETION_ALWAYS - generates the completion regardless of a result of the operation.
All of these operations are considered as finished when the respective completion is generated.
Direct Write to PMem is a feature of a platform and its configuration which allows an RDMA-capable network interface to write data to platform's PMem in a persistent way. It may be impossible because of e.g. caching mechanisms existing on the data's way. When Direct Write to PMem is impossible, operating in the way assuming it is possible may corrupt data on PMem, so this is why Direct Write to PMem is not enabled by default.
On the current Intel platforms, the only thing you have to do in order to enable Direct Write to PMem is turning off Intel Direct Data I/O (DDIO). Sometimes, you can turn off DDIO either globally for the whole platform or for a specific PCIe Root Port. For details, please see the manual of your platform.
When you have a platform which allows Direct Write to PMem, you have to declare this is the case in your peer's configuration. The peer's configuration has to be transferred to all the peers which want to execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against the platform's PMem and applied to the connection object which safeguards access to PMem.
rpma_peer_cfg_set_direct_write_to_pmem() - declare Direct Write to PMem support
rpma_peer_cfg_get_descriptor() - get the descriptor of the peer configuration
rpma_peer_cfg_from_descriptor() - create a peer configuration from the descriptor
rpma_conn_apply_remote_peer_cfg() - apply remote peer cfg to the connection
For details on how to use these APIs please see https://github.com/pmem/rpma/tree/main/examples/05-flush-to-persistent.
A client is the active side of the process of establishing a connection. A role of the peer during the process of establishing connection does not determine direction of the data flow (neither via Remote Memory Access nor via Messaging). After establishing the connection both peers have the same capabilities.
The client, in order to establish a connection, has to perform the following steps:
rpma_conn_req_new() - create a new outgoing connection request object
rpma_conn_req_connect() - initiate processing the connection request
rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event
After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.
The client, in order to close a connection, has to perform the following steps:
rpma_conn_disconnect() - initiate disconnection
rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event
rpma_conn_delete() - delete the closed connection
A server is the passive side of the process of establishing a connection. Note that after establishing the connection both peers have the same capabilities.
The server, in order to establish a connection, has to perform the following steps:
rpma_ep_listen() - create a listening endpoint
rpma_ep_next_conn_req() - obtain an incoming connection request
rpma_conn_req_connect() - initiate connecting the connection request
rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event
After establishing the connection both peers can perform Remote Memory Access and/or Messaging over the connection.
The server, in order to close a connection, has to perform the following steps:
rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event
rpma_conn_disconnect() - disconnect the connection
rpma_conn_delete() - delete the closed connection
When no more incoming connections are expected, the server can stop waiting for them:
Every piece of memory (either volatile or persistent) must be registered and its usage must be specified in order to be used in Remote Memory Access or Messaging. This can be done using the following memory management librpma functions:
rpma_mr_reg() which registers a memory region and creates a local memory registration object and
rpma_mr_dereg() which deregisters the memory region and deletes the local memory registration object.
A description of the registered memory region sometimes has to be transferred via network to the other side of the connection. In order to do that a network-transferable description of the provided memory region (called 'descriptor') has to be created using rpma_mr_get_descriptor(). On the other side of the connection the received descriptor should be decoded using rpma_mr_remote_from_descriptor(). It creates a remote memory region's structure that allows for Remote Memory Access.
The librpma messaging API allows transferring messages (buffers of arbitrary data) between the peers. Transferring messages requires preparing buffers (memory regions) on the remote side to receive the sent data. The received data are written to those dedicated buffers and the sender does not have to have a respective remote memory region object to send a message. The memory buffers used for messaging have to be registered using rpma_mr_reg() prior to rpma_send() or rpma_recv() function call.
The librpma library implements the following messaging API:
rpma_send() - initiates the send operation which transfers a message from the local memory to other side of the connection,
rpma_recv() - initiates the receive operation which prepares a buffer for a message sent from other side of the connection,
rpma_conn_req_recv() works as rpma_recv(), but it may be used before the connection is established.
All of these operations are considered as finished when the respective completion is generated.
RDMA operations generate complitions that notify a user that the respective operation has been completed.
The following operations are available in librpma:
IBV_WC_RDMA_READ - RMA read operation
IBV_WC_RDMA_WRITE - RMA write operation
IBV_WC_SEND - messaging send operation
IBV_WC_RECV - messaging receive operation
IBV_WC_RECV_RDMA_WITH_IMM - messaging receive operation for RMA write operation with immediate data
All operations generate completion on error. The operations posted with the RPMA_F_COMPLETION_ALWAYS flag also generate a completion on success. Completion codes are reused from the libibverbs library, where the IBV_WC_SUCCESS status indicates the successful completion of an operation. Completions are collected in the completion queue (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for more details on queues).
The librpma library implements the following API for handling completions:
rpma_conn_get_cq() gets the connection's main CQ,
rpma_conn_get_rcq() gets the connection's receive CQ,
rpma_cq_wait() waits for an incoming completion from the specified CQ (main or receive CQ) - if it succeeds the completion can be collected using rpma_cq_get_wc(),
rpma_cq_get_wc() receives the next available completion of an already posted operation.
A peer is an abstraction representing an RDMA-capable device. All other RPMA objects have to be created in the context of a peer. A peer allows one to:
establish connections (Client Operation)
register memory regions (Memory Management)
create endpoints for listening for incoming connections (Server Operation)
At the beginning, in order to create a peer, a user has to obtain an RDMA device context by the given IPv4/IPv6 address using rpma_utils_get_ibv_context(). Then a new peer object can be created using rpma_peer_new() and deleted using rpma_peer_delete().
By default, all endpoints and connections operate in the synchronous mode where:
rpma_ep_next_conn_req(),
rpma_cq_wait() and
rpma_conn_get_next_event()
are blocking calls. You can make those API calls non-blocking by modifying the respective file descriptors:
rpma_ep_get_fd() - provides a file descriptor for rpma_ep_next_conn_req()
rpma_cq_get_fd() - provides a file descriptor for rpma_cq_wait()
rpma_conn_get_event_fd() - provides a file descriptor for rpma_conn_get_next_event()
When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:
int ret = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
Such change makes the respective API call non-blocking automatically.
The provided file descriptors can also be used for scalable I/O handling like epoll(7).
Please see the example showing how to make use of RPMA file descriptors: https://github.com/pmem/rpma/tree/main/examples/06-multiple-connections
Remote Memory Access operations, Messaging operations and their Completions consume space in queues allocated in an RDMA-capable network interface (RNIC) hardware for each of the connections. You must be aware of the existence of these queues:
completion queue (CQ) where completions of operations are placed, either when a completion was required by a user (RPMA_F_COMPLETION_ALWAYS) or a completion with an error occurred. All Remote Memory Access operations and Messaging operations can consume CQ space.
send queue (SQ) where all Remote Memory Access operations and rpma_send() operations are placed before they are executed by RNIC.
receive queue (RQ) where rpma_recv() entries are placed before they are consumed by the rpma_send() coming from another side of the connection.
You must assume SQ and RQ entries occupy the place in their respective queue till:
a respective operation's completion is generated or
a completion of an operation, which was scheduled later, is generated.
You must also be aware that RNIC has limited resources so it is impossible to store a very long set of queues for many possibly existing connections. If all of the queues will not fit into RNIC's resources it will start using the platform's memory for this purpose. In this case, the performance will be degraded because of inevitable cache misses.
Because the length of queues has so profound impact on the performance of RPMA application you can configure the length of each of the queues separately for each of the connections:
rpma_conn_cfg_set_cq_size() - set length of CQ
rpma_conn_cfg_set_sq_size() - set length of SQ
rpma_conn_cfg_set_rq_size() - set length of RQ
When the connection configuration object is ready it has to be used for either rpma_conn_req_new() or rpma_ep_next_conn_req() for the settings to take effect.
Most of the core librpma API calls are thread-safe but there are also very important exceptions mainly related to connection's configuration, establishment and tear-down. Here you can find a complete list of NOT thread-safe API calls:
rpma_conn_apply_remote_peer_cfg()
rpma_conn_cfg_get_cq_size()
rpma_conn_cfg_get_rq_size()
rpma_conn_cfg_get_sq_size()
rpma_conn_cfg_get_timeout()
rpma_conn_cfg_set_cq_size()
rpma_conn_cfg_set_rq_size()
rpma_conn_cfg_set_sq_size()
rpma_conn_cfg_set_timeout()
rpma_conn_delete()
rpma_conn_disconnect()
rpma_conn_get_private_data()
rpma_conn_next_event()
rpma_conn_req_connect()
rpma_conn_req_delete()
rpma_conn_req_get_private_data()
rpma_conn_req_new()
rpma_ep_listen()
rpma_ep_next_conn_req()
rpma_ep_shutdown()
rpma_peer_cfg_get_descriptor()
rpma_peer_cfg_get_descriptor_size()
rpma_peer_cfg_get_direct_write_to_pmem()
rpma_peer_cfg_set_direct_write_to_pmem()
rpma_utils_get_ibv_context()
Other librpma API calls are thread-safe. However, creating RPMA library resources usually involves dynamic memory allocation and destroying resources usually involves a dynamic memory release. The same resource cannot be destroyed more than once, at any thread, and a resource cannot be used after it was destroyed. It is the user's responsibility to follow those rules and not doing so may result in a segmentation fault or undefined behaviour.
On-Demand-Paging (ODP) is a technique that simplifies the memory registration process (for example, applications no longer need to pin down the underlying physical pages of the address space and track the validity of the mappings). On-Demand Paging is available if both the hardware and the kernel support it. The detailed description of ODP can be found here:
https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
State of ODP support can be checked using the rpma_utils_ibv_context_is_odp_capable() function that queries the RDMA device context's capabilities and checks if it supports On-Demand Paging.
The librpma library uses ODP automatically if it is supported. ODP support is required to register PMem memory region mapped from File System DAX (FSDAX).
If a librpma function may fail, it returns a negative error code. Checking if the returned value is non-negative is the only programmatically available way to verify if the API call succeeded. The exact meaning of all error codes is described in the manual of each function.
The librpma library implements the logging API which may give additional information in case of an error and during normal operation as well, according to the current logging threshold levels.
The function that will handle all generated log messages can be set using rpma_log_set_function(). The logging function can be either the default logging function (built into the library) or a user-defined, thread-safe, function. The default logging function can write messages to syslog(3) and stderr(3). The logging threshold level can be set or got using rpma_log_set_threshold() or rpma_log_get_threshold() respectively.
There is an example of the usage of the logging functions: https://github.com/pmem/rpma/tree/main/examples/log
See https://github.com/pmem/rpma/tree/main/examples for examples of using the librpma API.
librpma is built on the top of libibverbs and librdmacm APIs.
Using of the API calls which are marked as deprecated should be avoided, because they will be removed in a new major release.
NOTE: API calls deprecated in 0.X release will be removed in 0.(X+1) release usually.
The contents of this web site and the associated GitHub repositories are BSD-licensed open source.