Performance Optimization in Rust
A comprehensive guide to writing high-performance Rust code, including profiling tools, common bottlenecks, and optimization techniques.
Table of Contents
- Zero-Cost Abstractions
- Profiling Tools
- Memory Optimization
- Avoiding Allocations
- Iterator Optimization
- Concurrency and Parallelism
- Common Bottlenecks
- Compiler Optimizations
- Benchmarking
Zero-Cost Abstractions
What Are Zero-Cost Abstractions?
Rust's abstractions have no runtime overhead - the abstraction costs nothing compared to hand-written code.
// High-level iterator code
let sum: i32 = (0..1000)
.filter(|x| x % 2 == 0)
.map(|x| x * 2)
.sum();
// Compiles to essentially the same machine code as:
let mut sum = 0;
for i in 0..1000 {
if i % 2 == 0 {
sum += i * 2;
}
}
Generics (Monomorphization)
Generic code is specialized at compile time - no virtual dispatch overhead.
// Generic function
fn add<T: std::ops::Add<Output = T>>(a: T, b: T) -> T {
a + b
}
// At compile time, generates separate functions:
// fn add_i32(a: i32, b: i32) -> i32 { a + b }
// fn add_f64(a: f64, b: f64) -> f64 { a + b }
let x = add(5, 10); // Uses add_i32
let y = add(5.0, 10.0); // Uses add_f64
Smart Pointers
// Box has same performance as raw pointer
let boxed = Box::new(42);
let value = *boxed; // Zero-cost dereference
// Rc/Arc only pay for reference counting when cloned
let rc = Rc::new(42);
let rc2 = Rc::clone(&rc); // Increment ref count (small cost)
let value = *rc; // Dereference is zero-cost
Profiling Tools
cargo-flamegraph
Installation:
cargo install flamegraph
Usage:
# Profile your application
cargo flamegraph
# Profile specific binary
cargo flamegraph --bin my_binary
# Profile tests
cargo flamegraph --test my_test
# With specific arguments
cargo flamegraph --bin my_binary -- arg1 arg2
Output: Generates flamegraph.svg showing time spent in each function.
perf (Linux)
# Record performance data
cargo build --release
perf record --call-graph=dwarf ./target/release/my_binary
# View report
perf report
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Instruments (macOS)
# Build with debug symbols
cargo build --release --bin my_binary
# Launch with Instruments
instruments -t "Time Profiler" ./target/release/my_binary
cargo-profdata
# Install
cargo install cargo-profdata
# Profile
cargo profdata -- run --release
valgrind (Memory profiling)
# Check for memory leaks
valgrind --leak-check=full ./target/debug/my_binary
# Profile cache usage
valgrind --tool=cachegrind ./target/release/my_binary
# Profile heap usage
valgrind --tool=massif ./target/release/my_binary
heaptrack (Heap profiling)
# Record heap allocations
heaptrack ./target/release/my_binary
# Analyze results
heaptrack_gui heaptrack.my_binary.12345.gz
Memory Optimization
Use Appropriate Data Structures
// Bad: Vec for small, fixed-size arrays
let coords: Vec<f64> = vec![x, y, z]; // Heap allocation
// Good: Array for fixed size
let coords: [f64; 3] = [x, y, z]; // Stack allocation
// Bad: String for small strings
let status: String = String::from("OK"); // Heap allocation
// Good: &str for literals
let status: &str = "OK"; // No allocation
// Consider: smallvec for small vectors
use smallvec::{SmallVec, smallvec};
let small: SmallVec<[i32; 4]> = smallvec![1, 2, 3]; // Stack if ≤4 items
Pre-allocate Capacity
// Bad: Growing vector causes reallocations
let mut vec = Vec::new();
for i in 0..1000 {
vec.push(i); // Multiple reallocations
}
// Good: Pre-allocate
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
vec.push(i); // No reallocations
}
// For HashMap
let mut map = HashMap::with_capacity(100);
Reuse Allocations
// Bad: Allocating in loop
for _ in 0..1000 {
let mut buffer = Vec::new();
read_data(&mut buffer);
process(&buffer);
} // Buffer dropped and reallocated each iteration
// Good: Reuse buffer
let mut buffer = Vec::new();
for _ in 0..1000 {
buffer.clear(); // Keep capacity
read_data(&mut buffer);
process(&buffer);
}
Use Compact Data Structures
// Less efficient: Each field aligned
struct Large {
a: u8, // 1 byte + 7 padding
b: u64, // 8 bytes
c: u8, // 1 byte + 7 padding
d: u64, // 8 bytes
} // Total: 32 bytes
// More efficient: Group by size
struct Compact {
b: u64, // 8 bytes
d: u64, // 8 bytes
a: u8, // 1 byte
c: u8, // 1 byte + 6 padding
} // Total: 24 bytes
// Check size
assert_eq!(std::mem::size_of::<Large>(), 32);
assert_eq!(std::mem::size_of::<Compact>(), 24);
Use References Instead of Cloning
// Bad: Unnecessary clones
fn process_names(names: Vec<String>) {
for name in names.iter() {
let upper = name.clone().to_uppercase(); // Bad
println!("{}", upper);
}
}
// Good: Work with references
fn process_names(names: &[String]) {
for name in names {
let upper = name.to_uppercase(); // to_uppercase creates new String
println!("{}", upper);
}
}