.. SPDX-License-Identifier: GPL-2.0 ============ SandBox Mode ============ Introduction ============ The primary goal of SandBox Mode (SBM) is to reduce the impact of potential memory safety bugs in kernel code by decomposing the kernel. The SBM API allows to run each component inside an isolated execution environment. In particular, memory areas used as input and/or output are isolated from the rest of the kernel and surrounded by guard pages. Without arch hooks, this common base provides *weak isolation*. On architectures which implement the necessary arch hooks, SandBox Mode leverages hardware paging facilities and CPU privilege levels to enforce the use of only these predefined memory areas. With arch support, SBM can also recover from protection violations. This means that SBM forcibly terminates the sandbox and returns an error code (e.g. ``-EFAULT``) to the caller, so execution can continue. Such implementation provides *strong isolation*. A target function in a sandbox communicates with the rest of the kernel through a caller-defined interface, comprising read-only buffers (input), read-write buffers (output) and the return value. The caller can explicitly share other data with the sandbox, but doing so may reduce isolation strength. Protection of sensitive kernel data is currently out of scope. SandBox Mode is meant to run kernel code which would otherwise have full access to all system resources. SBM allows to impose a scoped access control policy on which resources are available to the sandbox. That said, protection of sensitive data is foreseen as a future goal, and that's why the API is designed to control not only memory writes but also memory reads. The expected use case for SandBox Mode is parsing data from untrusted sources, especially if the parsing cannot be reasonably done by a user mode helper. Keep in mind that a sandbox doesn't guarantee that the output data is correct. The result may be corrupt (e.g. as a result of an exploited bug) and where applicable, it should be sanitized before further use. Using SandBox Mode ================== SandBox Mode is an optional feature, enabled with ``CONFIG_SANDBOX_MODE``. However, the SBM API is always defined regardless of the kernel configuration. It will call a function with the best available isolation, which is: * *strong isolation* if both ``CONFIG_SANDBOX_MODE`` and ``CONFIG_ARCH_HAVE_SBM`` are set, * *weak isolation* if ``CONFIG_SANDBOX_MODE`` is set, but ``CONFIG_ARCH_HAVE_SBM`` is unset, * *no isolation* if ``CONFIG_SANDBOX_MODE`` is unset. Code which cannot safely run with no isolation should depend on the relevant config option(s). The API can be used like this: .. code-block:: c #include /* Function to be executed in a sandbox. */ static SBM_DEFINE_FUNC(my_func, const struct my_input *, in, struct my_output *, out) { /* Read from in, write to out. */ return 0; } int caller(...) { /* Declare a SBM instance. */ struct sbm sbm; /* Initialize SBM instance. */ sbm_init(&sbm); /* Execute my_func() using the SBM instance. */ err = sbm_call(&sbm, my_func, SBM_COPY_IN(&sbm, input, in_size), SBM_COPY_OUT(&sbm, output, out_size)); /* Clean up. */ sbm_destroy(&sbm); The return type of a sandbox mode function is always ``int``. The return value is zero on success and negative on error. That's because the SBM helpers return an error code (such as ``-ENOMEM``) if the call cannot be performed. If sbm_call() returns an error, you can use sbm_error() to decide whether the error was returned by the target function or because sandbox mode was aborted (or failed to run entirely). Public API ---------- .. kernel-doc:: include/linux/sbm.h :identifiers: sbm sbm_init sbm_destroy sbm_exec sbm_error SBM_COPY_IN SBM_COPY_OUT SBM_COPY_INOUT SBM_DEFINE_CALL SBM_DEFINE_THUNK SBM_DEFINE_FUNC sbm_call Arch Hooks ---------- These hooks must be implemented to select HAVE_ARCH_SBM. .. kernel-doc:: include/linux/sbm.h :identifiers: arch_sbm_init arch_sbm_destroy arch_sbm_exec arch_sbm_map_readonly arch_sbm_map_writable X86_64 Implementation ===================== The x86_64 implementation provides strong isolation and recovery from CPU exceptions. Sandbox mode runs in protection ring 3 (same as user mode). This means that: * sandbox code cannot execute privileged CPU instructions, * memory accesses are treated as user accesses. The thread stack is readable in sandbox mode, because an on-stack data structure is used by call helpers and thunks to pass target function arguments. However, it is not writable, and sandbox code runs on its own stack. The thread stack is not used by interrupt handlers either. Non-IST interrupt handlers run on a separate sandbox exception stack. The interrupt entry path modifies the saved pt_regs to make it appear as coming from kernel mode. The CR3 register is then switched to kernel mode. The interrupt exit path is modified to restore actual pt_regs and switch the CR3 register back to its sandbox mode value, overriding CR3 changes for page table isolation. Support for paravirtualized kernels is not (yet) provided. Current Limitations =================== This section lists know limitations of the current SBM implementation, which are planned to be removed in the future. Stack ----- There is no generic kernel API to run a function on an alternate stack, so SBM runs on the normal kernel stack by default. The kernel already offers self-protection against stack overflows and underflows as well as against overwriting on-stack data outside the current frame, but violations are usually fatal. This limitation can be solved for specific targets. Arch hooks can set up a separate stack and recover from stack frame overruns. Inherent Limitations ==================== This section lists limitations which are inherent to the concept. Explicit Code ------------- The main idea behind SandBox Mode is decomposition of one big program (the Linux kernel) into multiple smaller programs that can be sandboxed. AFAIK there is no way to automate this task for an existing code base in C. Given the performance impact of running code in a sandbox, this limitation may be perceived as a benefit. It is expected that sandbox mode is introduced only knowingly and only where safety is more important than performance. Complex Data ------------ Although data structures are not serialized and deserialized between kernel mode and sandbox mode, all directly and indirectly referenced data structures must be explicitly mapped into the sandbox, which requires some manual effort. Copying of input/output buffers also incurs some runtime overhead. This overhead can be reduced by sharing data directly with the sandbox, but the resulting isolation is weaker, so it may or may not be acceptable, depending on the overall safety requirements. Page Granularity ---------------- Since paging is used to enforce memory safety, page size is the smallest unit. Objects mapped into the sandbox must be aligned to a page boundary, and buffer overflows may not be detected if they fit into the same page. On the other hand, even though such writes are not detected, they do not corrupt kernel data, because only the output buffer is copied back to kernel mode, and the (corrupted) rest of the page is ignored. Transitions ----------- Transitions between kernel mode and sandbox mode are synchronous. That is, whenever entering or leaving sandbox mode, the currently running CPU executes the instructions necessary to save/restore its kernel-mode state. The API is generic enough to allow asynchronous transitions, e.g. to pass data to another CPU which is already running in sandbox mode. However, to see the benefits, a hypothetical implementation would require far-reaching changes in the kernel scheduler. This is (currently) out of scope.