A disappearing Service Processor
Overview
This article details Oxide Computer's debugging journey to find why their Service Processor (SP) would intermittently disappear from the network on their next-generation Cosmo sled. The root cause turned out to be mismatched memory attributes between the ARM Cortex-M7's Memory Protection Unit (MPU) configuration for unprivileged tasks and the default memory map used by the privileged kernel, causing the CPU to issue incorrectly-sized bus accesses to an FPGA over the Flexible Memory Controller (FMC) bus.
What You'll Learn
How mismatched ARM MPU memory attributes can cause hard-to-diagnose bus hangs on Cortex-M processors
Why the default ARM memory map assigns different caching attributes to different address ranges and how this affects peripheral access
How to use ARM vector catch resets to debug a hung CPU that cannot be halted via SWD
Why store buffer interactions between privilege levels can cause unexpected memory accesses on the FMC bus
How to systematically narrow down hardware/software bugs when network-based debugging is unavailable
Prerequisites & Requirements
- Understanding of ARM Cortex-M architecture including MPU, memory maps, and privilege levels
- Familiarity with embedded systems concepts such as RTOS task scheduling, stack sizing, and bus protocols
- Basic understanding of CPU caching, store buffers, and memory ordering
- Experience with SWD debug probes for ARM microcontrollers(optional)
- Familiarity with FPGA interfaces and parallel bus protocols(optional)
Key Questions Answered
What causes an ARM Cortex-M7 Service Processor to become unresponsive on the network?
How do mismatched ARM MPU memory attributes cause bus hangs?
How can you debug a hung ARM CPU that won't respond to SWD debug halt?
What is the fix for FMC bus hangs caused by mismatched ARM memory attributes?
How does Hubris OS handle task isolation and what are its stack overflow risks?
What symptoms indicate a Service Processor has become unresponsive in an Oxide rack?
How did the measured boot reset change help reproduce the disappearing SP bug?
Why can't a CPU hung on a bus access be halted via debug probe?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Always ensure memory attributes are consistent across all privilege levels when using the ARM MPU. When the MPU maps a peripheral region as Device Memory for unprivileged tasks, the privileged kernel's default memory map must also treat that same address range as Device Memory. Mismatched attributes can cause the CPU to issue unexpected access sizes through cache writebacks.The ARM ARMv7-M reference manual section A3.5.7 explicitly warns about mismatched memory attributes and states that 'preservation of the size of accesses' is lost. ARM strongly recommends software does not use mismatched attributes for aliases of the same location.
2Use observable indicators (like LED blinking) as a debugging fallback when network-based debugging is unavailable. Converting a chassis LED from always-on to blinking provided the team with physical evidence of whether the SP was still making progress, even without network access.This is particularly valuable in embedded systems deployed in racks or other environments where physical access is limited and network connectivity is the primary debugging channel.
3When debugging intermittent hardware/software interaction bugs, focus on finding changes that improve reproduction rate before attempting root cause analysis. The team's measured boot change accidentally improved reproduction from 24+ hours to 10-20 minutes, which was critical to enabling the many experiments needed to isolate the root cause.A higher reproduction rate allows for rapid hypothesis testing through systematic elimination of potential causes, even when individual experiments may not be immediately fruitful.
4Be aware that modern CPUs can perform memory accesses that are invisible to the programmer, particularly through cache writebacks and store buffer drains. The running program counter may have no relation to the address being accessed on the bus, making traditional debugging approaches like examining the program counter misleading.This is especially relevant in systems mixing cached and uncached memory regions, or when privilege level transitions occur (such as interrupt entry) that change the effective memory attributes.
5When using an FPGA connected via a parallel bus interface like the STM32H7 FMC, place the FMC base address in an address region where the default ARM memory map assigns Device Memory attributes. The STM32H7 specifically supports remapping the FMC base address to a correctly-attributed region to avoid cache-related bus access issues.This avoids the class of bugs where kernel-mode code (or hardware mechanisms like cache writebacks) inadvertently accesses the FPGA through a Normal Cached memory mapping, potentially issuing wrong-sized bus transactions.
6Use ARM vector catch as a diagnostic tool when standard debug halt fails on a hung CPU. Configure the CPU to halt on reset before executing the first instruction, then trigger a reset. This preserves RAM state while unsticking the processor, allowing post-mortem analysis of OS task states and memory contents even after a bus hang.This technique is especially useful when the CPU is stuck waiting for a bus acknowledgement and cannot respond to normal SWD halt requests. The register state is lost but RAM contents remain intact.