< Home

GSoC 23: Contributing to Checkpoint and Restore in Userspace (CRIU) - Final report

8 27 2023

The purpose of writing this blog post is to share the work I have completed during the past 12 weeks of Google Summer of Code 2023.

Summary

Detailed Report

As stated in the CRIU project GitHub page, "CRIU (stands for Checkpoint and Restore in Userspace) is a utility to checkpoint/restore Linux tasks. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at".

CRIU can checkpoint/restore a trivial hello_word process to running Linux containers. However, with the current implementation of CRIU, it can't checkpoint/restore a process that has a memfd_secret file descriptor(s) opened. So, my project was to implement this feature. Before we proceed, let's try to comprehend, what a memfd_secret file descriptor is.

As stated in the man page of memfd_secret() system call, "memfd_secret() creates an anonymous RAM‐based file and returns a file descriptor that refers to it. The file provides a way to create and access memory regions with stronger protection than usual RAM‐based files and anonymous memory mappings".

Consider the following code example: dumpee,

#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>

#define SECRET "Hello World"
#define SIZE 11

static int memfd_secret(unsigned int flags)
{
    return syscall(SYS_memfd_secret, flags);
}

static void *secret_init(size_t size)
{
    int fd;
    void *secretmem = NULL;

    fd = memfd_secret(0);
    if (fd < 0)
        return secretmem;

    if (ftruncate(fd, size) < 0) {
        close(fd);
        return secretmem;
    }

    secretmem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (secretmem == MAP_FAILED) {
        close(fd);
        return secretmem;
    }

    return secretmem;
}

static void secret_fini(void *mem, size_t size)
{
    munmap(mem, size);
}

int main(int argc, char *argv[])
{
    void *secretmem;

    fprintf(stdout, "pid: %d\n", getpid());

    secretmem = secret_init(SIZE);
    if (!secretmem)
        perror("not supported operation");

        memcpy(secretmem, SECRET, SIZE);

    fprintf(stdout, "%d bytes of secret data stored successfully at %p\n", SIZE, secretmem);

    pause();

    secret_fini(secretmem, SIZE);
    fprintf(stdout, "secret data successfully discarded\n");

    return 0;
}

Essentially, what this program does is, write a string constant: Hello World into secretmem area denoted by memfd_secret file descriptor and pause.

There is a good chance that you can't run this code and get an runtime error like Function not implemented. This is because the secretmem feature is off by default and the user must explicitly enable it at the boot time by setting the Kernel boot parameter, secretmem.enable=1 to enable the memfd_secret() system call.

Let's execute this program program and examine its file descriptors,

./dumpee
pid: 186078
11 bytes of secret data stored successfully at 0x7ff75e817000
ls /proc/186078/fd/
0  1  2  3

Process 186078 has three file descriptors attached to it. As any Linux process, it has the usual stdin: 0, stdout: 1, stderr: 2 attached to it. Let's examine what's the 3rd file descriptor is suppose to be,

stat /proc/186078/fd/3
  File: /proc/186078/fd/3 -> /secretmem (deleted)
  Size: 64          Blocks: 0          IO Block: 1024   symbolic link
Device: 0,22    Inode: 1000267     Links: 1
Access: (0700/lrwx------)  Uid: ( 1000/ dhanuka)   Gid: ( 1000/ dhanuka)
Access: 2023-08-26 17:26:51.819442448 +0530
Modify: 2023-08-26 17:26:49.807432713 +0530
Change: 2023-08-26 17:26:49.807432713 +0530
 Birth: -

As shown the file descriptor 3 can be identified as memfd_secret file descriptor (pointing to secretmem). Since now we have a memfd_secret file descriptor containing process: 186078, let's try to checkpoint it with the latest CRIU release,

git checkout criu-dev

./criu/criu --version
Version: 3.18

sudo ./criu/criu dump -D dumpdir/ --shell-job -t 186078
Error (criu/proc_parse.c:467): Unknown shit 100600 (/secretmem (deleted))
Error (criu/proc_parse.c:694): Can't open 186078's mapfile link 7ff75e817000: No such device or address
Error (criu/cr-dump.c:1558): Collect mappings (pid: 186078) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.

As shown the checkpoint operation fails. Because, during a checkpoint operation (among other things) CRIU attempts to dump/save all the file descriptors attached to the process. In this case one file descriptor is memfd_secret(). Since CRIU doesn't understand (yet) how to parse a memfd_secret() file descriptor, the checkpoint operation fails with the subtle Unknown shit error.

Now for the exciting part, let's try to checkpoint the 186078 process with CRIU with my feature implementation in place,

git checkout memfd-secret

sudo ./criu/criu dump -D dumpdir/ -v4 --shell-job -t 186078

...........
(00.225313) Writing image inventory (version 1)
(00.225403) Running post-dump scripts
(00.225408) Unfreezing tasks into 2
(00.225411)     Unseizing 186078 into 2
(00.225565) Writing stats
(00.225619) Dumping finished successfully

As shown the checkpoint operation succeeds. After a successful checkpoint operation, it's recommended to check the dumpdir directory to verify how the checkpoint operation went,

ls dumpdir
core-186078.img  inventory.img                  pages-1.img  timens-0.img
fdinfo-2.img     memfd-secret.img               pages-2.img  tty-info.img
files.img        mm-186078.img                  pstree.img
fs-186078.img    pagemap-186078.img             seccomp.img
ids-186078.img   pagemap-secretmem-1000258.img  stats-dump

dumpdir/pages-1.img is a file of interest. Let's examine it,

hexdump -C dumpdir/pages-1.img 
00000000  48 65 6c 6c 6f 20 57 6f  72 6c 64 00 00 00 00 00  |Hello World.....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

Evidently, we have our secretmem content: Hello World in dumpdir/pages-1.img. So, the checkpoint operation is a success. Now let's try to restore this checkpointed process,

sudo ./criu/criu restore -D dumpdir/ -v4 --shell-job

...........
(00.041186) Running pre-resume scripts
(00.041202) Restore finished successfully. Tasks resumed.
(00.041209) Writing stats
(00.041342) Running post-resume scripts

As shown, the restore operation succeeds as well. Let's look for this restored process and examine its file descriptors to verify the accuracy of the restore operation,

pgrep dumpee
186078

ls /proc/186078/fd
0  1  2  3

Evidently we have the same exact number of file descriptors attached to the restored process: 186078. Let's examine the file descriptor 3 to see whether it's a memfd_secret file descriptor or not.

stat /proc/186078/fd/3
  File: /proc/186078/fd/3 -> /secretmem (deleted)
  Size: 64          Blocks: 0          IO Block: 1024   symbolic link
Device: 0,22    Inode: 1037589     Links: 1
Access: (0700/lrwx------)  Uid: ( 1000/ dhanuka)   Gid: ( 1000/ dhanuka)
Access: 2023-08-26 18:07:25.787215250 +0530
Modify: 2023-08-26 18:07:24.019204721 +0530
Change: 2023-08-26 18:07:24.019204721 +0530
 Birth: -

Bingo! The file descriptor 3 is indeed memfd_secret! However, the restored process verification is not over yet. We have one last thing to check for. That's if the original secretmem content/data: Hello World is in this restored process's secretmem area pointed by the 3rd file descriptor or not. To do that let's perform a secondary checkpoint (to a new directory) against this restored process: 186078 and examine the content of pages-1.img file,

pgrep dumpee
186078

sudo criu dump -D dumpdir2/ -v4 --shell-job -t 186078

hexdump -C dumpdir2/pages-1.img 
00000000  48 65 6c 6c 6f 20 57 6f  72 6c 64 00 00 00 00 00  |Hello World.....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

Evidently, not only the restored process has a memfd_secret file descriptor attached to it, but also the original secretmem content: Hello World. So, the feature implementation is fully complete and it works.

Closing thoughts

I had a phenomenal experience working on this project under the mentorship of Alexander Mikhalitsyn and Mike Rapoport. Initially, I knew nothing about CRIU or secretmem. I had to learn everything on the fly. It wasn't easy. But, I had help from my mentors and CRIU is an extremely well engineered project. The feature implementation was an iterative process. All I had to do was follow and fix one error at a time. In between I have learned how to read complex code: weird looking macros to complex function implementations, and debug and fix tricky errors. Overall, this was an enormous opportunity for me and I'm truly grateful for my mentors for their support and guidance.

Stay tuned for my next blog post.