The Cost of Indirection in Rust

We’ve all heard the warning: “Every extra function call adds overhead. Inline it!” In Rust async code, that worry is almost always false.

Look at the size of that arm:

async fn handle_event(&self, event: Event) -> Result<()> {
    match event.kind {
        EventKind::Suspend => {
            // … > 20 lines of app behavior … 
            }
        // … other arms …
    }
}

The author had hot context in his head when he wrote it and now you’ll see his bias to justify it. And make others justify it too — expecting teammates, contributors, and his future self to accept the cost: lost readability and maintainability. All to satisfy that hot context for no concrete win once it goes cold.

Since the Suspend arm has grown beyond a handful of lines, someone proposes extracting it. You get a clean call site and the logic in a named function:

async fn handle_event(&self, event: Event) -> Result<()> {
    match event.kind {
        EventKind::Suspend => self.handle_suspend(event).await,
        // … other arms …
    }
}

async fn handle_suspend(&self, event: Event) -> Result<()> {
    // … > 20 lines of app behavior …
}

Then someone on the team raises an eyebrow. “Isn’t that an extra function call? Indirection has a cost.” Another member quickly nods.

They’re not wrong in principle. But are they right in practice?

What’s the compiler’s opinion on this? Let’s think through it carefully and measure it.

First, the technically correct abstract concern: extracting code into an async function means:

Parameter passing and ABI compliance.
Runtime future state machine setup (e.g. pointing to a created state, similar to a jump table).
Pinning a self reference (needed when borrowing across awaits).
Pushing and popping a stack frame — with adequate pointer nailing.
Extra runtime indirections for scheduling queues, wakers, and the event loop.

These are all things that exist. The question is whether they matter relative to everything else going on.

In our snippet, the Suspend arm does what handle_suspend does; the match is already a branch. A match on an enum compiles to a jump table or a chain of comparisons. You’ve already paid for a branch. Adding a function call inside one arm adds roughly one more indirect jump and frame setup — that’s noise compared to whatever actual work Suspend is supposed to do.

When you await an async function, the compiler doesn’t necessarily allocate a new object on the heap or introduce dynamic dispatch (you’d rarely Box the future). What happens is that it merges the callee’s state into the parent Future’s state machine. That can make the abstraction genuinely free: your async future gets flattened into the callee’s state machine. The extra await point you’re worried about may compile down to the same state transitions the inline version would have produced anyway.

Are you letting the wrong details dominate code design?

What you are actually trying to get your system to achieve is what influences execution the most. If the Suspend path involves I/O, a lock, or any kind of allocation, you’re measuring nanosecond noise against microsecond signal. A function call boundary is on the order of a few cycles. A syscall is thousands. Don’t treat them as the same kind of cost. You know these don’t belong in the same conversation.

In release mode, with optimizations enabled, the compiler will often inline small extracted functions automatically. The two versions — inline and extracted — can produce identical assembly.

You can check this yourself.

/// Work done entirely inside this function (no extra call).
#[no_mangle]
fn do_the_work_inlined() -> u64 {
    let mut acc = 0u64;
    for i in 1..=10 {
        acc = acc.wrapping_mul(i).wrapping_add(12345);
    }
    acc
}

/// Same work, but behind an extracted helper (one extra function call).
#[no_mangle]
fn do_the_work_extracted() -> u64 {
    do_some_work()
}

#[no_mangle]
fn do_some_work() -> u64 {
    let mut acc = 0u64;
    for i in 1..=10 {
        acc = acc.wrapping_mul(i).wrapping_add(12345);
    }
    acc
}

Emit assembly locally:

cargo rustc --release -- --emit asm

And look at the output (it will be an asm .s file in target/release/deps). If the assembly of do_the_work_inlined and do_the_work_extracted is the same, the debate is over before it started.

Folks are arguing about a distinction that doesn’t survive compilation.

Now the thesis is clear: Indirection is not the enemy of performance — it’s a misconception that makes engineers chase invisible gains.

For completeness let’s acquire a sense of dimension and see what numbers we get experimentally.

cargo bench

Criterion will give you mean execution times and confidence intervals. If the error bars overlap, the difference is not real and the team would be letting their discussion be dominated by noise.

If you need to take a performance regression seriously in that situation, we need to remember that microbenchmarks can mislead. For a more honest picture, run your actual application under a profiler:

valgrind --tool=callgrind
perf record or perf report, or flamegraph
dtrace or Instruments (Time Profiler) on macOS

Look at where time is actually being spent. If handle_suspend doesn’t appear as a hotspot — or appears but at a fraction of a percent — the conversation is over.

When indirection actually matters?

There are cases where call overhead is worth thinking about:

Tight inner loops — a function called millions of times per second in a tight loop, where cache pressure and branch prediction genuinely matter.
dyn Trait — dynamic dispatch is a real indirection: a vtable lookup that the compiler cannot inline through.
Explicit indirection in performance-critical paths — same idea: the compiler loses visibility and can’t optimize across the boundary.

All cases where you are in a CPU intensive blocking task that, if you’re not careful, could starve all the others.

But none of these apply to a match arm in an async function that handles discrete events. If your Suspend arm ran ten million times per second in a tight loop, you’d have other things to worry about first.

Be honest: are you really building a systems-level program where micro-optimizations have weight, or an application where system-level behavior dominates? Sometimes the problem we’re optimizing for isn’t even stable yet.

Then what about the cost that isn’t paid at runtime?

That the measurement doesn’t show up in a profiler doesn’t mean it’s any less real. These human-centric costs compound over time and outweigh wildly a few nanoseconds of execution.

Every developer who opens that function pays a comprehension tax.

Cognitive load isn’t free. Every code review takes longer. Every future change carries a higher risk of introducing a bug because the context is harder to hold in your head. These costs compound over months in a way that a few nanoseconds of function call overhead never will.

And what about the next lines of code that will be added there? Will it make understanding the rules and system behavior ramifications easier or harder in that Suspend arm?

Rust’s design philosophy is explicit about this: write clean abstractions, trust the optimizer, and reach for #[inline] or #[inline(always)] only when you have measured a real problem and have no better option.

This is what good engineering is about.

In practice, the extra call cost is zero in release builds and most times statistically insignificant. The real cost of removing indirection is the lost clarity, testability, and developer productivity that you pay for a handful of nanoseconds justified with transient personal convenience.

So next time someone asks whether to inline or extract, answer with the numbers from your release benchmark and remind them that performance wins are measured in seconds of I/O, not in a single call. Keep your code readable; let the compiler handle the rest.

Besides, is your system reaction to that event even unit testable? How do you test it clearly without one method that encapsulates that behavior?

Extract the darn function. Name it well. Provide meaning to it. You do that and now you have a place to add a comment about the rules that apply when your system needs to react to that case. The most valuable comments are those that help you understand system behavior fast.

And if you’re worried, then measure; if a profiler ever tells you that a specific call is a bottleneck, you’ll have the data to justify the tradeoff. Until then, optimize for the people reading the code. That includes you when you need to own how it works. And by the way, the AI agents will benefit from that as much or even more.

Maintainability and understandability only show up when you’re deliberate about them. Extracting meaning into well-named functions is how you practice that. Code aesthetics are a feature and they affect team and agentic coding performance, just not the kind you measure in the runtime.

And be warned: some will resist this and surrender to the convenience of their current mental context, betting they’ll “remember” how they did it. Time will make that bet age badly. It’s 2026 — other AI agents are already in execution loops, disciplined to code better than that.

And when executing coding becomes radically cheap, how do you think your system behavior is driven? What’s the weight of understanding system behavior quickly in this new coding economy?

So I don’t know you, but you’ll never see me sacrificing meaning on the altar of “winning one less indirection.”