8 27 2023
The purpose of writing this blog post is to share the work I have completed during the past 12 weeks of Google Summer of Code 2023.
Project: CRIU: Add support for memfd_secret file descriptors.
Feature implementation status: Complete
Feature implementation Pull Request: https://github.com/checkpoint-restore/criu/pull/2247
Feature implementation patch acceptance to upstream status: WIP
Complete code contributions list: https://github.com/checkpoint-restore/criu/commits?author=warusadura
As stated in the CRIU project GitHub page, "CRIU (stands for Checkpoint and Restore in Userspace) is a utility to checkpoint/restore Linux tasks. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at".
CRIU can checkpoint/restore a trivial hello_word process to running Linux containers. However, with the current implementation of CRIU, it can't checkpoint/restore a process that has a memfd_secret
file descriptor(s) opened. So, my project was to implement this feature. Before we proceed, let's try to comprehend, what a memfd_secret
file descriptor is.
As stated in the man
page of memfd_secret()
system call, "memfd_secret()
creates an anonymous RAM‐based file and returns a file descriptor that refers to it. The file provides a way to create and access memory regions with stronger protection than usual RAM‐based files and anonymous memory mappings".
Consider the following code example: dumpee
,
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
#define SECRET "Hello World"
#define SIZE 11
static int memfd_secret(unsigned int flags)
{
return syscall(SYS_memfd_secret, flags);
}
static void *secret_init(size_t size)
{
int fd;
void *secretmem = NULL;
fd = memfd_secret(0);
if (fd < 0)
return secretmem;
if (ftruncate(fd, size) < 0) {
close(fd);
return secretmem;
}
secretmem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (secretmem == MAP_FAILED) {
close(fd);
return secretmem;
}
return secretmem;
}
static void secret_fini(void *mem, size_t size)
{
munmap(mem, size);
}
int main(int argc, char *argv[])
{
void *secretmem;
fprintf(stdout, "pid: %d\n", getpid());
secretmem = secret_init(SIZE);
if (!secretmem)
perror("not supported operation");
memcpy(secretmem, SECRET, SIZE);
fprintf(stdout, "%d bytes of secret data stored successfully at %p\n", SIZE, secretmem);
pause();
secret_fini(secretmem, SIZE);
fprintf(stdout, "secret data successfully discarded\n");
return 0;
}
Essentially, what this program does is, write a string constant: Hello World
into secretmem
area denoted by memfd_secret
file descriptor and pause.
There is a good chance that you can't run this code and get an runtime error like Function not implemented
. This is because the secretmem
feature is off by default and the user must explicitly enable it at the boot time by setting the Kernel boot parameter, secretmem.enable=1
to enable the memfd_secret()
system call.
Let's execute this program program and examine its file descriptors,
./dumpee
pid: 186078
11 bytes of secret data stored successfully at 0x7ff75e817000
ls /proc/186078/fd/
0 1 2 3
Process 186078
has three file descriptors attached to it. As any Linux process, it has the usual stdin: 0
, stdout: 1
, stderr: 2
attached to it. Let's examine what's the 3
rd file descriptor is suppose to be,
stat /proc/186078/fd/3
File: /proc/186078/fd/3 -> /secretmem (deleted)
Size: 64 Blocks: 0 IO Block: 1024 symbolic link
Device: 0,22 Inode: 1000267 Links: 1
Access: (0700/lrwx------) Uid: ( 1000/ dhanuka) Gid: ( 1000/ dhanuka)
Access: 2023-08-26 17:26:51.819442448 +0530
Modify: 2023-08-26 17:26:49.807432713 +0530
Change: 2023-08-26 17:26:49.807432713 +0530
Birth: -
As shown the file descriptor 3
can be identified as memfd_secret
file descriptor (pointing to secretmem
). Since now we have a memfd_secret
file descriptor containing process: 186078
, let's try to checkpoint it with the latest CRIU release,
git checkout criu-dev
./criu/criu --version
Version: 3.18
sudo ./criu/criu dump -D dumpdir/ --shell-job -t 186078
Error (criu/proc_parse.c:467): Unknown shit 100600 (/secretmem (deleted))
Error (criu/proc_parse.c:694): Can't open 186078's mapfile link 7ff75e817000: No such device or address
Error (criu/cr-dump.c:1558): Collect mappings (pid: 186078) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.
As shown the checkpoint operation fails. Because, during a checkpoint operation (among other things) CRIU attempts to dump/save all the file descriptors attached to the process. In this case one file descriptor is memfd_secret()
. Since CRIU doesn't understand (yet) how to parse a memfd_secret()
file descriptor, the checkpoint operation fails with the subtle Unknown shit
error.
Now for the exciting part, let's try to checkpoint the 186078
process with CRIU with my feature implementation in place,
git checkout memfd-secret
sudo ./criu/criu dump -D dumpdir/ -v4 --shell-job -t 186078
...........
(00.225313) Writing image inventory (version 1)
(00.225403) Running post-dump scripts
(00.225408) Unfreezing tasks into 2
(00.225411) Unseizing 186078 into 2
(00.225565) Writing stats
(00.225619) Dumping finished successfully
As shown the checkpoint operation succeeds. After a successful checkpoint operation, it's recommended to check the dumpdir
directory to verify how the checkpoint operation went,
ls dumpdir
core-186078.img inventory.img pages-1.img timens-0.img
fdinfo-2.img memfd-secret.img pages-2.img tty-info.img
files.img mm-186078.img pstree.img
fs-186078.img pagemap-186078.img seccomp.img
ids-186078.img pagemap-secretmem-1000258.img stats-dump
dumpdir/pages-1.img
is a file of interest. Let's examine it,
hexdump -C dumpdir/pages-1.img
00000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 00 00 00 00 00 |Hello World.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000
Evidently, we have our secretmem
content: Hello World
in dumpdir/pages-1.img
. So, the checkpoint operation is a success. Now let's try to restore this checkpointed process,
sudo ./criu/criu restore -D dumpdir/ -v4 --shell-job
...........
(00.041186) Running pre-resume scripts
(00.041202) Restore finished successfully. Tasks resumed.
(00.041209) Writing stats
(00.041342) Running post-resume scripts
As shown, the restore operation succeeds as well. Let's look for this restored process and examine its file descriptors to verify the accuracy of the restore operation,
pgrep dumpee
186078
ls /proc/186078/fd
0 1 2 3
Evidently we have the same exact number of file descriptors attached to the restored process: 186078
. Let's examine the file descriptor 3
to see whether it's a memfd_secret
file descriptor or not.
stat /proc/186078/fd/3
File: /proc/186078/fd/3 -> /secretmem (deleted)
Size: 64 Blocks: 0 IO Block: 1024 symbolic link
Device: 0,22 Inode: 1037589 Links: 1
Access: (0700/lrwx------) Uid: ( 1000/ dhanuka) Gid: ( 1000/ dhanuka)
Access: 2023-08-26 18:07:25.787215250 +0530
Modify: 2023-08-26 18:07:24.019204721 +0530
Change: 2023-08-26 18:07:24.019204721 +0530
Birth: -
Bingo! The file descriptor 3
is indeed memfd_secret
! However, the restored process verification is not over yet. We have one last thing to check for. That's if the original secretmem
content/data: Hello World
is in this restored process's secretmem
area pointed by the 3
rd file descriptor or not. To do that let's perform a secondary checkpoint (to a new directory) against this restored process: 186078
and examine the content of pages-1.img
file,
pgrep dumpee
186078
sudo criu dump -D dumpdir2/ -v4 --shell-job -t 186078
hexdump -C dumpdir2/pages-1.img
00000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 00 00 00 00 00 |Hello World.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000
Evidently, not only the restored process has a memfd_secret
file descriptor attached to it, but also the original secretmem
content: Hello World
. So, the feature implementation is fully complete
and it works
.
I had a phenomenal experience working on this project under the mentorship of Alexander Mikhalitsyn and Mike Rapoport. Initially, I knew nothing about CRIU or secretmem
. I had to learn everything on the fly. It wasn't easy. But, I had help from my mentors and CRIU is an extremely well engineered project. The feature implementation was an iterative process. All I had to do was follow and fix one error at a time. In between I have learned how to read complex code: weird looking macros to complex function implementations, and debug and fix tricky errors. Overall, this was an enormous opportunity for me and I'm truly grateful for my mentors for their support and guidance.
Stay tuned for my next blog post.