pmemobjfs - The simple FUSE based on libpmemobj
How to use it
NOTE: This is just an example implementation of file system in user space using the libpmemobj library and it is not considered to be production quality. Please do not use this file system to store your data you care about because it may be lost.
The definition of libpmemobj layout looks like this:
It consists of a root object and four typed OIDs. The
a typedef for the
uint8_t type in order to bind an unique type number for
this data structure. The typed OID for a
char is required in order to
allocate a fixed-length string from pmemobj pool. The rest of data
structures are described in details in the following chapters.
The main data structure of the pmemobjfs is the
struct objfs_super which
plays a role of a super-block in traditional file systems:
root_inode field holds the inode object of the root directory which is
created during creation of the file system layout.
block_size field holds the size of data block which the files content and
directory entries are stored in.
opened field is a tree map
of opened inodes. This map is required for handling the unlink operation on
The next important data structure used by the pmemobjfs is the
struct objfs_inode which represents a file system object.
It contains basic attributes of an object:
- file type and permissions flags,
- major and minor device numbers,
- time of last status change,
- time of last modification,
- time of last access (currently this field is not updated),
- user ID of owner,
- group ID of owner,
- number of references.
The inode may represent a file, directory or a symbolic link. It contains a separate structures for each inode type which holds essential information about the specific type of inode:
The data specific for directory object contains a doubly-linked list of directory entries.
The data specific for file object contains a tree map of blocks. The map key consist of block number and the value contains a PMEMoid to the data block.
The data specific for symbolic link contains a length of link and the link data.
struct objfs_dir_entry represents a directory entry. It contains a
persistent pointers to the neighbours, a pointer to corresponding inode and
The maximum length of the name of a directory entry is forced by the block size
specified when creating a file system. It is equal to
block_size - sizeof (struct objfs_dir_entry).
All operations which modifies the file system structure are performed within a transaction, which protects the pmemobjfs layout from being broken if power failure occurred during any operation.
In this chapter I would like to describe in details some of the most important operations performed on the file system.
NOTE: In current implementation it is recommended to mount the pmemobjfs with the -s option. In this case the FUSE works in single-threaded mode and there is no need for synchronization mechanisms.
Creating file system layout
To create the pmemobjfs layout you can use the
mkfs.pmemobjfs -s <size> -b <block size> /mnt/pmem/pmemobjfs.obj
By default it creates a file system layout with the minimal size required for
pmemobj pool and with block size equal to
512 - 64. The default value for
block size is chosen to such value in order to minimize the internal
fragmentation of allocated blocks. We must keep in mind the fact that in current
implementation the allocation and out-of-band headers are kept in one cache
line before the allocation. Although the default value is chosen with respect
to the internal layout of the pmemobj pool, it is not required to keep it in
mind when creating the file system. An arbitrary value specified for the block
size is valid and the pmemobjfs will work properly.
The file system layout is created within a transaction. The following listing shows the most important parts of the routing for creating the pmemobjfs layout:
At the beginning the pmemobj pool is created with specified name of layout,
size and mode. Next the root object is allocated when calling the
macro for the first time. According to the documentation we can be sure the
root object is zeroed. Next the root object is initialized within a
transaction. The tree map is created for opened inodes, the root inode is
created and the block size is stored. Due to the fact that all operations
are performed within the transaction we can be sure that either the
root object will be filled up entirely or won’t be at all. At the very end the
pmemobj pool is closed and as a result we have a pmemobjfs file system
Creating new directory
The following listing presents the most important operations performed when creating new directory on pmemobjfs file system:
After beginning a new transaction the new directory is allocated and
initialized. After creating the inode with new directory, the
struct objfs_dir_entry is allocated with the specified name and associated
newly created inode. The new directory entry is then added to the current
directory’s doubly-linked list of entries and modification time is updated.
pmemobjfs_new_dir function is presented on the following listing:
First of all the new inode is allocated with specified permissions and ownership and the directory specific data of inode is initialized. Next the current and parent directory entries are allocated and added to the newly created directory. Everything is done within a transaction. In this case the transaction will be nested because this function is called from inside other transaction, but according to the libpmemobj documentation if the outer transaction aborts all changes made within a nested transaction will be rolled back as well so we do not need to worry about committing the nested transaction before committing the outermost one.
Allocating file blocks
The next interesting operation is allocating the file blocks. The following listing shows how it is implemented:
The most important function is
either allocates new block or returns previously allocated block. In the latter
case the previously allocated block is added to the transaction’s undo log in
order to track all file’s modifications. The following listing shows the
implementation of this function:
pmemobjfs_file_get_block function returns a block at given offset or
OID_NULL if the block is missing.
functions are used in write and read operations respectively when operating
on file’s data.
The unlink operation utilizes two interesting mechanisms implemented with
the pmemobjfs. The first one is the inode’s reference counter which is
increased each time the given inode is referenced by other data structure.
The inode is freed when the reference counter is equal to zero. The functions
which operates on inode’s reference counter are
The unlink operation is really simple:
All the work is performed by the
The reference counter is decreased and the directory entry is removed from
the doubly-linked list of current directory and freed. The inode is freed if the
reference counter becomes zero after calling the
In case of unlinking an opened file the inode will not be freed immediately because the open operation increases the inode’s reference counter and adds the inode to the tree map of opened inodes:
Using those two mechanism it is really simple to implement the unlink operations with respect to opened files or directories and creating hard links.
Please note that hard links are not implemented currently due to some problems with the FUSE kernel module which cause the appropriate callback function is not called.
The pmemobjfs provides a feature of creating transactions. The current implementation is limited to creating a single transaction at a time for the whole file system, but this feature could be extended to more transactions, for specified directories or files. The transaction is controlled via the ioctl calls. For simplicity there have been developed three simple commands which do the required work:
pmemobjfs.tx_begin pmemobjfs.tx_abort pmemobjfs.tx_end
For the above commands the path to the pmemobjfs mount point or any other
directory must be given. After beginning the transaction all modifications
performed on the file system files, directories or links are tracked by the
libpmemobj transactions. It tracks all changes of attributes and data.
They are made persistent after calling the
All changes are visible immediately to the user but can be rolled back simply by
pmemobjfs.tx_abort command. The transaction can be aborted
implicitly if any exceptional situation occurred like for example out of memory
error when allocating file block.
NOTE: Aborting the transaction when other process is still working on the file system may lead to undefined behavior. For example if a new file was created within a transaction and the transaction is aborted while some other process is writing to the file leads to undefined behavior.
In this section I would like to present some performance tests results executed using the fio utility with the following configuration file:
[job1] ioengine=sync runtime=60 time_based=1 filesize=128M bs=448 rw=randrw
The block size value has been chosen in order to minimize internal fragmentation on pmemobjfs file system.
The tests were run on Fedora 22 distribution, kernel version 4.2.0 with DAX support and on the pmem block device.
The tests were run on the following file systems:
- ext4 + dax
- fusexmp_fh + ext4 + dax
- pmemobjfs (NTB)
The pmemobjfs (NTB) is a pmemobjfs version without tracking file blocks (PMEMOBJFSTRACK_BLOCKS=0). The _fusexmp_fh is a file system which redirects all operations to the root file system. It is available in the FUSE examples.
The results are presented in the following table:
|FS||READ BW [KB/s]||WRITE BW [KB/s]|
|ext4 + dax||232030||231333|
|fusexmp_fh + ext4 + dax||28687||28602|
The results shows quite huge overhead from the FUSE itself, but it shows that pmemobjfs has slightly better performance than the fusexmp_fh example file system which is quite good information for us :).
The pmemobjfs example shows how the libpmemobj API works in a real application. It can be used to run some performance tests using well known file system test suites. If you have any questions or ideas for improvement of the pmemobjfs please feel free to join a discussion on our Google Group.