PendSV Allows Simple and Efficient Context Switch on Cortex-M

There are many possible ways to implement context switch on a Cortex-M CPU. A popular method involves using the PendSV exception, as demonstrated by FreeRTOS and discussed in the book The Definitive Guide to Arm Cortex-M3 and Cortex-M4 Processors. In this post, we will explore the benefits of using PendSV for implementing context switch.

Note: In Arm terminology, both interrupts generated by the CPU (e.g., SysTick, SVC, PendSV) and those generated by peripherals (e.g., TIM, USART, DMA) are collectively called exceptions. Additionally, in this post, a task refers to a thread of execution.

Easier Concurrency Reasoning

Implementing context switch with PendSV allows us to centralize all context-switch-related code into the PendSV handler. Other external interrupts, SysTick, and SVC should never perform context switch. This design simplifies code logic by ensuring that no context switch will occur when handling exceptions other than PendSV. In other words, when a task is interrupted by an exception, the same task will resume execution after the exception is serviced, as long as the exception is not PendSV.

In contrast, concurrency reasoning becomes more challenging when multiple exception handlers can perform context switch. It becomes harder to prevent race conditions because there are more code paths where the running task can be preempted and switched out. Similarly, reasoning about reentrancy gets more complicated. Even worse, exceptions can nest, requiring extra logic to handle cases where a context switch occurs in a nested handler or where a context switch is interrupted by a nested handler that may trigger another context switch.

Better Performance

Restricting context switch to only PendSV can lead to better overall system performance. While it may seem counterintuitive since every other exception must chain a PendSV to perform a context switch, this approach actually reduces the overhead of saving registers for all other exception handlers.

On Cortex-M, performing a context switch requires saving all 16 general-purpose registers plus the status register xpsr. However, if we assume the exception handler will always return to the interrupted task, we only need to preserve r0-r3, r12, lr, pc, and xpsr, a total of 8 registers.

The reason for not needing to save r4-r11 is that these registers are callee-saved. If the invoked handler function needs these registers for computation, it will contain instructions to preserve them before use. Subsequently, the handler function will restore their previous values before exception return.

When the floating point unit (FPU) is enabled, the difference becomes even more significant. For a context switch, registers s0-s31 and fpscr must be saved. However, a handler that guarantees to return to the interrupted task only needs to save s0-s15 and fpscr, an additional difference of 16 registers.

Since exceptions other than PendSV are likely to be predominant, reducing the register preservation and restoration overhead for the majority of cases yields better overall system performance.

Even Better Performance by Cortex-M Hardware Features

Three features of Cortex-M further improve performance when we centralize context switch logic into PendSV:

Hardware register stacking
Optimized exception tail chaining
Lazy floating point register stacking

The first two features collectively reduce the overhead of chaining a PendSV after another exception. Cortex-M hardware is responsible for preserving r0-r3, r12, lr, pc, and xpsr upon an exception and restoring them during exception return, also called register stacking and unstacking. Optimized tail chaining occurs when the CPU notices another pending exception right before or during exception return. In this case, the CPU directly continues to executing the pending exception’s handler function, avoiding the redundant restoration and another immediate preservation of registers. This optimization also applies to s0-s15 and fpscr when the FPU is enabled.

The third feature, lazy stacking of floating point registers, postpones the stacking of floating point registers until the execution of the first floating point instruction in an exception handler. If no floating point instruction is executed in the handler, the floating point registers are not stacked, and the exception return only needs to restore r0-r3, r12, lr, pc, and xpsr.

Conclusion

An exception handler needs only to preserve and restore 8 registers (r0-r3, r12, lr, pc, and xpsr) if it always returns to the interrupted task and does not contain floating point instructions. In contrast, a handler that may perform a context switch must preserve and restore 50 registers (r0-r12, sp, lr, pc, xpsr, s0-s31, and fpscr). Thus, allowing only PendSV to perform context switches significantly speeds up all other exception handlers. Although it adds overhead by requiring other handlers to chain a PendSV for context switch, this overhead is mitigated by Cortex-M’s optimized exception tail chaining. Centralizing context switch logic to PendSV also simplifies concurrency reasoning.