Posted on May 12, 2025 by Gabe Parmer
I recently went to CPSWeek, a “multi-conference” broadly focused on building high-confidence and trustworthy computational systems that control the physical world. Think autonomous vehicles, robots, the smart grid, airplanes, and good-ol simple embedded systems. The CPSWeek conference include
I’ll cover a lot of ground in this post. Feel free to skip ahead.
Most CPSWeek papers are concerned with understanding how to create trustworthy CPS systems. “Trustworthy” here is a load-bearing term, and differs in definition across conferences. For example, in RTAS, it often means that we want to have predictable timing properties of our systems, so that we can, with limited hardware, accurately control the physical system.
There was an interesting debate that focused on the motion:
This House contends that the inherent complexity of modern engineering challenges renders exhaustive mathematical analysis overkill, and that an iterative, adaptive design approach should be prioritized—even for life-critical systems.
The most important word in the motion is “prioritized”. Without that, it is easy to vacillate between both sides.
The arguments for and against were complex and reasonable. My take-aways (which might not map well to the arguments) follow.
Side “prioritize progress”:
The best representation of implications for the community was from Anthony Rowe. He shows a sequence of (AI generated) images with a T-rex representing the CPS community, and a meteor representing ML. 1. The meteor speeds toward the world of CPS dinosaurs, leaving only time till they are wiped out, 2. a proposal for what the community should do with the T-rex hopping on, ridding, and hugging the ML meteor, and 3. the next pane showing the T-rex juicing the ML meteor for all its worth.
Side “formal methods”:
Summary. Given that Neural Network approaches have unambiguously won in most domains, this argument should be taken seriously. That said, it should be weighted against safety concerns. In the end, society will define the risk thresholds that will likely determine if the formal methods side has legs. History shows that we have a very low threshold for airplane crashes and nuclear power incidents. However, as CPSes impact our day-to-day lives in vehicles, will the threshold change? We’re certainly OK with some level of car accidents.
For the community, I don’t see a world in which people don’t submit NN work. And there will be reviewers receptive to that. The risk is that the conferences largely receive low-quality NN work, and lose relevance due to that – what differentiates CPS NN work from the rest? There will also be quite a few reviewers receptive to traditional methodologies. If formal methods end up having no place in modern system design, this is an existential risk.
I share this without much skin in the game. I build systems, and will use any applications to evaluate our system that are interesting. I deeply appreciate both formal methods, and ML approaches.
Debate outcome. The formal methods side “won” the debate as they converted more people to their side during the debate. I’m sure each of you might read in to that as positive or negative, as a affirmation for formal methods momentum, or a harbinger of the community’s demise. Time will tell.
RTAS is one of the top three or four real-time and embedded conferences and is the one most focused on system implementation. As such, it is often the most interesting to me. I spent most of my CPSWeek time at RTAS.
RTAS is focused relatively generally around systems that explicitly consider and are designed around latency properties. This includes real-time systems (in which we often want to ensure that computations complete by a deadline), but also other latency-sensitive systems such as edge infrastructure. The past few years, the Call-for-Papers (CfP) had a somewhat vague definition of what work was to be considered for the conference (mainly, is embedded work that doesn’t explicitly consider latency in-scope?), but I believe it will be broadened to a definition that will admit embedded work. I consider the past years a regression, and I hope we are returning to something approximating the older CfP wording.
The RTAS program had a number of interesting papers. I’ll start with our papers, which are quite interesting to me 😉.
Esma presented our work that is a collaboration with Björn Brandenburg on SPR: Shielded Processor Reservations with Bounded Management Overhead. SPR identifies a number of attacks on the reservation systems that are supposed to provide temporal isolation on modern systems, including the core mechanisms in cgroups
, SCHED_DEADLINE
in Linux, Xen, and seL4. We observe that these systems tend to have strong theoretical properties, but that the implementations are susceptible to attacks. These mechanisms use budgets to track rates of thread execution, decreasing the budget with execution. When the budget is expended, the thread is suspended until a replenishment of the budget. The replenishment processing requires timer interrupts, and scheduler logic to process each replenishment, and 2/3 of our attacks focus on this. They
to cause the high-priority task to be massively delayed in its processing. This is harmful in, for example, autonomous vehicles where the processing of sensor input, or pedestrian detection can be arbitrarily delayed.
We previously published the thundering herd attack on seL4’s reservation mechanism. We essentially force the scheduler to process many attacker thread’s replenishments at exactly the time a higher-priority thread should execute. Additionally, we introduce another attack that causes a cascade of many timer interrupts to process attacker thread replenishments during higher-priority thread execution. Last, we show that higher-priority threads can cause lower-priority threads to make only stunted progress with their reservation by constantly preeempting and causing cache-interference.
SPR prevents all of these attacks by
Together SPR hopefully represents the final-say in how to enable reservations that not only provide rate-limited security properties, but are themselves efficient and safe. Esma implemented and evaluated SPR in our Composite scheduler component (in Composite, schedulers are in user-level protection domains). We’ll pull it into the Composite mainline branch after some cleanup.
This was Esma’s first academic presentation and she did a great job!
Wenyuan presented our work on Janus: OS Support for a Secure, Fast Control-Plane. A lot of previous work has focused on decoupling the control- and data-plane, and optimizing the heck out of the data-plane (see Arrakis and Ix, for example). But in systems that require the dense deployment of multiple tenants on shared hardware (e.g. at the edge), the control-plane deserves love too. Janus enables
The core idea is best captured with a picture:
We achieve this using x86’s Memory Protection Keys (MPK), which provide instructions for the user-level switching of protection domains. MPK has been used in quite a few systems (e.g. Erim, Hodor, Donkeys, Endokernel, Underbridge, \(\mu\)switch, etc…), but Janus provides a number of unique contributions as the only system to:
Results. The results are particularly strong.
First, we show that we can implement L4-style IPC (synchronous rendezvous between threads) as a custom control policy in the (user-level) scheduler component that is faster than IPC in seL4. This is a surprising result, as our policy requires IPC to the scheduler, and the scheduler dispatching between threads. Naively, if we implement fast (L4) IPC using both IPC and dispatch logic in Composite, it should be slower, but in Janus it is not!
Second, we show that Janus can transparently increase the performance of complex systems. Benchmarks in the security-centric, multi-protection domain, Patina RTOS show that we can achieve up to 3x faster performance than Composite, and 6.5x faster than comparable operations in Linux.
Finally, providing \(\mu\)-second service to memcached
, we show that we can get 5x throughput increases, and multiple order of magnitude 99p latency decreases by combining our MPK-based fast control-plane with custom policy. These are performance increases on the level of Shinjuku & Shenango, while enabling strong isolation. The evaluation includes multi-tenant scenarios similar to Splinter.
It takes a village. This is work has been quite the journey. It started around FOUR years ago, and was performed by a large (for our lab) collection of students.
x86_64
(yes, we were a dinosaur before), updating DPDK to work with the system and define a many-core usage of DPDK that scales effectively, and porting memcached
to the system while enabling multi-tenant client protection domains to harness the service. Much of this work also contributed to and enabled his byways research.All of these individuals are top-class researchers, but are also spectacular low-level system hackers.
Wenyuan and Xinyu are looking for a job, so please reach out if this sounds interesting. Wenyuan’s looking in the US, and Xinyu’s looking in China.
A sampling of other interesting papers (I have a strong bias toward implementation work, sorry if I didn’t sample your work!!!):
If you presented any of these works, or have links to your papers, let me know and I’ll update this list.
Many more papers were presented by professors than in any conference I remember. This was a sad reflection on the challenges for lead students to get a VISA to present their own work. This is sad from multiple perspectives:
I’ll just say that I hope these issues are resolved sooner than later.
This multi-conference has an interesting history of recent updates, and potential future changes:
There have been many years of relatively stability in the conference line up, yet massive changes this year, and potentially in the future. The emphasis for everyone I talked to was on merging conferences, not “killing” them. But for me this is a distinction without a difference. In the end, we’re taking multiple conference venues that each accept publications, and merging them into a conference that accepts fewer. So I’m going to argue that we’re simply killing conferences.
And I think this is brilliant. Academics get “credit” and positive reinforcement for creating conferences. It shows “leadership”. I’d argue strongly that outside of massively expanding fields (e.g. current ML), we should not be creating conferences. Why?
Each conference needs a set of papers that are competitively selected via peer review. That means each conference requires a program committee of volunteer researchers who are willing to spend time reviewing submissions (often four-per-paper). There are only so many volunteer hours out there, so this eats into the global pool.
Much worse, there’s only a finite number of papers generated each year, and only some small fraction of them are generally “strong” (say at maximum 25% of those submitted to a conference). When we create conferences, we’re often just providing a venue for those papers that aren’t as strong to be published. This might (in some countries) help satisfy grant and departmental “bean counting”, but I’d argue it has massively hurtful to the community. When a massive amount of work is published in a community that is not of a high standard, the community suffers. It devalues the average publication in the community, and makes it impossible for those outside of the community to understand where to find strong, relevant research.
So while I don’t know the background behind killing IPSN and IoTDI (and potentially pruning out one of HSCC/ICCPS), I applaud the community for doing so. When submission rates, and quality submissions go below a count that can sustain quality, that conference will only drag down the reputation of the community (note: I don’t know if that’s what happened here).
While I was at CPSWeek, I learned that USENIX ATC was also axed. “Great, another conference killed”, right? No. USENIX ATC (henceforth USENIX) has a long history of strong publication as “the hacker’s conference”. The attendance to USENIX was falling off since 2020, it seems, despite strong submission numbers, and strong conference outputs. This is a strange case where people didn’t want to go to the conference, but the output was quite strong.
I believe this is a canary in the metaphorical coal-mine of the academic conference world. Do we need to update our view of conferences?
I haven’t been to a huge number of conferences since Covid. I forgot how much they simply don’t feel like the right way to do science anymore. I very much enjoy:
These are quite valuable. Things I do not enjoy:
I’m not convinced that we’re landing on the good side of these trade-off. At the core, I don’t find a synchronous approach to be necessary for effective CS research dissemination. That said, I’m hesitate to take this argument too seriously as it would hurt student integration into the community.
I believe that peer-review is valuable (though in many domains it is buckling under the pressure of thousands of submissions). But we can maintain peer review without having a physical conference. We can have conferences that look a little more like journals. Our time might be better used by promoting our own work online, and creating online communities for doing so effectively.