A disappearing Service Processor

A disappearing Service Processor

Laura Abbott
10 min readintermediate
--
View Original

Overview

This article details Oxide Computer's debugging journey to find why their Service Processor (SP) would intermittently disappear from the network on their next-generation Cosmo sled. The root cause turned out to be mismatched memory attributes between the ARM Cortex-M7's Memory Protection Unit (MPU) configuration for unprivileged tasks and the default memory map used by the privileged kernel, causing the CPU to issue incorrectly-sized bus accesses to an FPGA over the Flexible Memory Controller (FMC) bus.

What You'll Learn

1

How mismatched ARM MPU memory attributes can cause hard-to-diagnose bus hangs on Cortex-M processors

2

Why the default ARM memory map assigns different caching attributes to different address ranges and how this affects peripheral access

3

How to use ARM vector catch resets to debug a hung CPU that cannot be halted via SWD

4

Why store buffer interactions between privilege levels can cause unexpected memory accesses on the FMC bus

5

How to systematically narrow down hardware/software bugs when network-based debugging is unavailable

Prerequisites & Requirements

  • Understanding of ARM Cortex-M architecture including MPU, memory maps, and privilege levels
  • Familiarity with embedded systems concepts such as RTOS task scheduling, stack sizing, and bus protocols
  • Basic understanding of CPU caching, store buffers, and memory ordering
  • Experience with SWD debug probes for ARM microcontrollers(optional)
  • Familiarity with FPGA interfaces and parallel bus protocols(optional)

Key Questions Answered

What causes an ARM Cortex-M7 Service Processor to become unresponsive on the network?
In this case, the CPU hung because mismatched memory attributes between the MPU (used by unprivileged tasks mapping FMC as Device Memory) and the default memory map (used by the privileged kernel mapping FMC as Normal Cached) caused the CPU to issue incorrectly-sized bus accesses to an FPGA over the FMC bus. The FPGA expected 32-bit accesses, but 8-bit or 16-bit accesses from cache writebacks caused the bus to hang indefinitely.
How do mismatched ARM MPU memory attributes cause bus hangs?
When a task running in unprivileged mode issues a store to an FMC address mapped as Device Memory, the store enters the processor's store buffer. If an interrupt switches to privileged mode (using the default memory map where that address is Normal Cached), the store hits the cache instead. The cache may then write back data using different access sizes than the peripheral expects, violating bus protocol and causing the CPU to hang waiting for a bus acknowledgement that never comes.
How can you debug a hung ARM CPU that won't respond to SWD debug halt?
ARM CPUs support vector catch, which can be configured to halt the CPU on reset before executing the first instruction. By triggering a vector catch reset, the team was able to unstick the hung CPU without trampling existing state. While the running register state and program counter were lost, the rest of the Hubris state in RAM was preserved across the reset, allowing inspection of task states and memory contents.
What is the fix for FMC bus hangs caused by mismatched ARM memory attributes?
The fix was changing the FMC base address to a section of the ARM default memory map that has matching attributes (Device Memory, no caching), so both privileged and unprivileged accesses use the same memory type. The STM32H7 FMC supports changing its base address to appear in this correctly-attributed section, likely specifically to avoid this class of problem. No instances of the issue occurred after this fix was merged.
How does Hubris OS handle task isolation and what are its stack overflow risks?
Hubris uses the ARM Memory Protection Unit (MPU) to provide isolation between tasks and enforce privilege levels. Unprivileged tasks run with MPU-enforced memory regions, while the kernel uses the default memory map. While writing Hubris in Rust eliminates buffer overflows, stack overflows remain a risk because Hubris requires manual stack sizing for tasks. The emit-stack-sizes compiler feature helps detect undersized stacks, but edge cases still occur.
What symptoms indicate a Service Processor has become unresponsive in an Oxide rack?
Key symptoms include: the SP stops broadcasting on the management network, no increases in network data counters from the SP, fans spinning at a constant elevated rate (indicating the fan controller fell back to emergency full power mode since the SP handles thermal control), and the host AMD CPU remaining alive (proving the system still had power). The issue was only reproducible inside a rack, not on a standalone sled.
How did the measured boot reset change help reproduce the disappearing SP bug?
The measured boot work required the SP to reset itself multiple times in a row at first bootup for security properties related to the Root of Trust (RoT) hash verification. This change dramatically improved reproduction rate from potentially 24+ hours down to approximately 10-20 minutes, because the repeated resets increased the probability of hitting the race condition where a store buffer entry crosses the privilege boundary with mismatched memory attributes.
Why can't a CPU hung on a bus access be halted via debug probe?
When a CPU is waiting for a bus acknowledgement that never arrives (due to a peripheral not responding), the debug halt mechanism through SWD may be unable to interrupt this state. The CPU is effectively stuck in a hardware-level wait state that is below the level of software debug interrupts. Vector catch on reset provides an alternative by halting the CPU at a known good state after a reset signal forces the bus transaction to complete.

Key Statistics & Figures

Kernel stack margin
512 bytes
Considered relatively large, making kernel stack overflow an unlikely cause of the issue
Original reproduction time
24+ hours
Time needed to reproduce the SP disappearing issue before the measured boot change
Improved reproduction time
10-20 minutes
After the measured boot change that caused multiple SP resets at bootup

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Operating System
Hubris
Oxide's custom embedded operating system running on the Service Processor with task-based architecture
Programming Language
Rust
Language used to write Hubris, eliminating bug classes such as buffer overflows
Microcontroller
Stm32h7
ARM Cortex-M7 based Service Processor MCU with Flexible Memory Controller
CPU Architecture
Arm Cortex-m7
Processor core used in the Service Processor with MPU and cache support
Hardware
Fpga
Controls system components like host flash, connected via parallel bus through the FMC
Debug Interface
Swd
Serial Wire Debug protocol used for manufacturing debug and emergency diagnostics
Hardware
Arm Mpu
Memory Protection Unit providing task isolation and privilege enforcement in Hubris
Hardware Interface
Fmc
STM32H7 Flexible Memory Controller providing parallel bus interface to the FPGA

Key Actionable Insights

1
Always ensure memory attributes are consistent across all privilege levels when using the ARM MPU. When the MPU maps a peripheral region as Device Memory for unprivileged tasks, the privileged kernel's default memory map must also treat that same address range as Device Memory. Mismatched attributes can cause the CPU to issue unexpected access sizes through cache writebacks.
The ARM ARMv7-M reference manual section A3.5.7 explicitly warns about mismatched memory attributes and states that 'preservation of the size of accesses' is lost. ARM strongly recommends software does not use mismatched attributes for aliases of the same location.
2
Use observable indicators (like LED blinking) as a debugging fallback when network-based debugging is unavailable. Converting a chassis LED from always-on to blinking provided the team with physical evidence of whether the SP was still making progress, even without network access.
This is particularly valuable in embedded systems deployed in racks or other environments where physical access is limited and network connectivity is the primary debugging channel.
3
When debugging intermittent hardware/software interaction bugs, focus on finding changes that improve reproduction rate before attempting root cause analysis. The team's measured boot change accidentally improved reproduction from 24+ hours to 10-20 minutes, which was critical to enabling the many experiments needed to isolate the root cause.
A higher reproduction rate allows for rapid hypothesis testing through systematic elimination of potential causes, even when individual experiments may not be immediately fruitful.
4
Be aware that modern CPUs can perform memory accesses that are invisible to the programmer, particularly through cache writebacks and store buffer drains. The running program counter may have no relation to the address being accessed on the bus, making traditional debugging approaches like examining the program counter misleading.
This is especially relevant in systems mixing cached and uncached memory regions, or when privilege level transitions occur (such as interrupt entry) that change the effective memory attributes.
5
When using an FPGA connected via a parallel bus interface like the STM32H7 FMC, place the FMC base address in an address region where the default ARM memory map assigns Device Memory attributes. The STM32H7 specifically supports remapping the FMC base address to a correctly-attributed region to avoid cache-related bus access issues.
This avoids the class of bugs where kernel-mode code (or hardware mechanisms like cache writebacks) inadvertently accesses the FPGA through a Normal Cached memory mapping, potentially issuing wrong-sized bus transactions.
6
Use ARM vector catch as a diagnostic tool when standard debug halt fails on a hung CPU. Configure the CPU to halt on reset before executing the first instruction, then trigger a reset. This preserves RAM state while unsticking the processor, allowing post-mortem analysis of OS task states and memory contents even after a bus hang.
This technique is especially useful when the CPU is stuck waiting for a bus acknowledgement and cannot respond to normal SWD halt requests. The register state is lost but RAM contents remain intact.

Common Pitfalls

1
Using mismatched memory attributes for the same physical address across different ARM privilege levels. When the MPU maps a peripheral as Device Memory for unprivileged tasks but the default memory map treats the same address as Normal Cached for the kernel, store buffer entries can be promoted to cached status during privilege transitions, resulting in wrong-sized bus accesses that hang the CPU.
The ARMv7-M architecture manual explicitly warns against this in section A3.5.7 but the practical implications are subtle and difficult to diagnose. Always verify that all address aliases use consistent memory attributes.
2
Assuming the program counter indicates the source of a bus hang. Modern CPUs with caches and store buffers can issue memory accesses to addresses completely unrelated to the currently executing instruction. When a cache writeback causes a bus hang, the program counter points to whatever code was running when the writeback happened, not to the code that originally stored the data.
This led the debugging team astray initially, as vector catch dumps showed tasks that weren't accessing the FMC. The actual source was cache writebacks of earlier stores that crossed a privilege boundary.
3
Placing FMC-connected peripherals at base addresses where the ARM default memory map assigns Normal Cached attributes. Even if your task-level MPU configuration correctly maps the peripheral as Device Memory, the kernel and hardware cache mechanisms may access the same address through the default map with different attributes, causing subtle and intermittent failures.
The STM32H7 specifically supports remapping FMC base addresses to correctly-attributed regions, suggesting this is a known class of issue that the silicon vendor anticipated.
4
Prematurely concluding a hardware timing fix has resolved an intermittent issue. The team initially found and fixed FPGA timing constraints that weren't being met, and assumed the inconsistent vector catch dumps were due to caching effects. The real root cause (mismatched memory attributes) persisted and resurfaced weeks later when a different code change increased the reproduction rate.
When debugging intermittent issues, ensure the root cause is fully understood and the fix has a clear causal explanation. Multiple issues can coexist, and fixing one may mask another.

Related Concepts

Arm Memory Protection Unit (mpu)
Arm Default Memory Map
Cache Coherency And Store Buffers
Device Memory Vs Normal Cached Memory Types
Embedded Rtos Task Scheduling And Priority
Stack Overflow Detection In Embedded Systems
Swd Debugging And Vector Catch
Fpga Parallel Bus Interfaces
Measured Boot And Root Of Trust
Hardware/Software Co-debugging
Flexible Memory Controller (fmc)
Armv7-m Architecture