Chapter 31: Performance Engineering

PRE-ALPHA WARNING: This is a pre-alpha version of The Sigma Book. Content may be incomplete, inaccurate, or subject to change. Do not use as a source of truth. For authoritative information, consult the official repositories:

sigmastate-interpreter — Reference Scala implementation

sigma-rust — Rust implementation

ergo — Ergo node

Prerequisites

Chapter 12 for evaluation model fundamentals that define hot paths
Chapter 7 for serialization patterns to optimize
Chapter 13 for understanding cost accounting overhead

Learning Objectives

By the end of this chapter, you will be able to:

Identify performance-critical paths in script interpretation
Apply Zig's comptime for zero-cost abstractions and type dispatch
Design data structures using Struct-of-Arrays (SoA) for cache efficiency
Use arena allocators to batch allocations and reduce overhead
Implement SIMD and vectorization for throughput-critical operations
Profile and benchmark interpreter components systematically

Performance Architecture

Script interpretation requires processing thousands of transactions per block¹²:

Performance Critical Paths
══════════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────────┐
│                    Transaction Flow                             │
│                                                                 │
│   Block (1000+ txs)                                             │
│       │                                                         │
│       ├── Tx 1: 3 inputs × (deserialize + evaluate + verify)    │
│       ├── Tx 2: 1 input × (deserialize + evaluate + verify)     │
│       ├── Tx 3: 5 inputs × (deserialize + evaluate + verify)    │
│       └── ...                                                   │
│                                                                 │
│   Hot paths (per input):                                        │
│     • Deserialization: ~50-200 opcode parses                    │
│     • Evaluation: ~100-500 operations                           │
│     • Proof verification: 1-10 EC operations                    │
└─────────────────────────────────────────────────────────────────┘

Performance Targets:
  Deserialization:   < 100 µs per script
  Evaluation:        < 500 µs per script
  Verification:      < 2 ms per input
  Total per block:   < 5 seconds

Comptime Optimization

Zig's comptime enables zero-cost abstractions³⁴:

/// Compile-time type dispatch eliminates runtime branching
fn evalOperation(comptime op: OpCode, args: []const Value) !Value {
    return switch (op) {
        .plus => evalPlus(args),
        .minus => evalMinus(args),
        .multiply => evalMultiply(args),
        // All branches resolved at compile time
        inline else => |o| evalGeneric(o, args),
    };
}

/// Comptime-generated lookup tables
const op_costs = blk: {
    var costs: [256]u32 = undefined;
    for (0..256) |i| {
        costs[i] = computeCost(@enumFromInt(i));
    }
    break :blk costs;
};

/// Zero-cost field access via comptime offset calculation
fn getField(comptime T: type, comptime field: []const u8, ptr: *const T) *const @TypeOf(@field(T{}, field)) {
    const offset = @offsetOf(T, field);
    const byte_ptr: [*]const u8 = @ptrCast(ptr);
    return @ptrCast(@alignCast(byte_ptr + offset));
}

Structure data for cache efficiency. The Array-of-Structs to Struct-of-Arrays transformation is a semantics-preserving isomorphism: Array[n](A × B) ≅ Array[n](A) × Array[n](B). Both represent the same data with identical behavior, but different memory layouts yield dramatically different cache performance:

/// Bad: Array of Structs (AoS) - poor cache locality for iteration
const ValueAoS = struct {
    tag: ValueTag,      // 1 byte
    padding: [7]u8,     // 7 bytes padding
    data: [8]u8,        // 8 bytes payload
}; // 16 bytes per value, only 9 used

/// Good: Struct of Arrays (SoA) - excellent cache locality
const ValueStore = struct {
    tags: []ValueTag,           // Packed tags
    data: [][8]u8,              // Packed payloads
    len: usize,

    /// Iterate tags without touching payload
    pub fn countType(self: *const ValueStore, target: ValueTag) usize {
        var count: usize = 0;
        for (self.tags) |tag| {
            count += @intFromBool(tag == target);
        }
        return count;
    }

    /// Access specific value
    pub fn get(self: *const ValueStore, idx: usize) Value {
        return Value.decode(self.tags[idx], self.data[idx]);
    }
};

Memory Layout Analysis

Cache Line Utilization
══════════════════════════════════════════════════════════════════

Array of Structs (AoS):
┌──────────────────────────────────────────────────────────────────┐
│ Cache Line (64 bytes)                                            │
├──────────────────────────────────────────────────────────────────┤
│ Value[0] │ Value[1] │ Value[2] │ Value[3] │                      │
│ 16 bytes │ 16 bytes │ 16 bytes │ 16 bytes │                      │
│ T+P+D    │ T+P+D    │ T+P+D    │ T+P+D    │ Wasted               │
└──────────────────────────────────────────────────────────────────┘
Tag iteration: 25% cache utilization (touches only 1 byte per 16)

Struct of Arrays (SoA):
┌──────────────────────────────────────────────────────────────────┐
│ Tags Cache Line (64 bytes)                                       │
├──────────────────────────────────────────────────────────────────┤
│ T[0] T[1] T[2] ... T[63]                                         │
│ 64 tags in single cache line                                     │
└──────────────────────────────────────────────────────────────────┘
Tag iteration: 100% cache utilization (64 values per fetch)

Speedup: ~4x for tag-only operations

Arena Allocators

Batch allocations reduce overhead:

const ArenaAllocator = std.heap.ArenaAllocator;

/// Evaluation context with arena for temporary allocations
const EvalContext = struct {
    arena: ArenaAllocator,
    constants: []const Constant,
    env: Environment,

    pub fn init(backing: Allocator) EvalContext {
        return .{
            .arena = ArenaAllocator.init(backing),
            .constants = &[_]Constant{},
            .env = Environment.init(),
        };
    }

    /// All temporary allocations use arena
    pub fn allocTemp(self: *EvalContext, comptime T: type, n: usize) ![]T {
        return self.arena.allocator().alloc(T, n);
    }

    /// Single deallocation frees all temps
    pub fn reset(self: *EvalContext) void {
        _ = self.arena.reset(.retain_capacity);
    }

    pub fn deinit(self: *EvalContext) void {
        self.arena.deinit();
    }
};

/// Usage: batch evaluation without per-operation allocations
fn evaluateScript(tree: *const ErgoTree, allocator: Allocator) !Value {
    var ctx = EvalContext.init(allocator);
    defer ctx.deinit();

    for (tree.ops) |op| {
        try evalOp(op, &ctx);
    }

    ctx.reset(); // Free all temps at once
    return ctx.result;
}

Loop Optimization

Efficient iteration patterns:

/// Unrolled loop for fixed-size operations
fn hashBlock(data: *const [64]u8, state: *[8]u32) void {
    // Process 16 words per iteration, unrolled
    comptime var i: usize = 0;
    inline while (i < 64) : (i += 4) {
        const w0 = std.mem.readInt(u32, data[i..][0..4], .big);
        const w1 = std.mem.readInt(u32, data[i + 4..][0..4], .big);
        const w2 = std.mem.readInt(u32, data[i + 8..][0..4], .big);
        const w3 = std.mem.readInt(u32, data[i + 12..][0..4], .big);
        round(state, w0);
        round(state, w1);
        round(state, w2);
        round(state, w3);
    }
}

/// Vectorized collection operations
fn sumValues(values: []const i64) i64 {
    const Vec = @Vector(4, i64);
    var sum_vec: Vec = @splat(0);

    var i: usize = 0;
    while (i + 4 <= values.len) : (i += 4) {
        const chunk: Vec = values[i..][0..4].*;
        sum_vec += chunk;
    }

    // Reduce vector to scalar
    var sum = @reduce(.Add, sum_vec);

    // Handle remainder
    while (i < values.len) : (i += 1) {
        sum += values[i];
    }

    return sum;
}

Memoization

Cache expensive computations⁵⁶:

/// Generic memoization with comptime key type
fn Memoized(comptime K: type, comptime V: type) type {
    return struct {
        cache: std.AutoHashMap(K, V),

        const Self = @This();

        pub fn init(allocator: Allocator) Self {
            return .{ .cache = std.AutoHashMap(K, V).init(allocator) };
        }

        pub fn getOrCompute(
            self: *Self,
            key: K,
            compute: *const fn (K) V,
        ) V {
            const result = self.cache.getOrPut(key) catch unreachable;
            if (!result.found_existing) {
                result.value_ptr.* = compute(key);
            }
            return result.value_ptr.*;
        }

        pub fn reset(self: *Self) void {
            self.cache.clearRetainingCapacity();
        }
    };
}

/// Type method resolution memoization
const MethodCache = Memoized(struct { type_code: u8, method_id: u8 }, *const Method);

var method_cache: MethodCache = undefined;

fn resolveMethod(type_code: u8, method_id: u8) *const Method {
    return method_cache.getOrCompute(
        .{ .type_code = type_code, .method_id = method_id },
        computeMethod,
    );
}

String Interning

Avoid repeated string allocations:

const StringInterner = struct {
    table: std.StringHashMap([]const u8),
    arena: ArenaAllocator,

    pub fn init(allocator: Allocator) StringInterner {
        return .{
            .table = std.StringHashMap([]const u8).init(allocator),
            .arena = ArenaAllocator.init(allocator),
        };
    }

    /// Return interned string (pointer stable for lifetime)
    pub fn intern(self: *StringInterner, str: []const u8) []const u8 {
        if (self.table.get(str)) |existing| {
            return existing;
        }

        // Allocate permanent copy
        const copy = self.arena.allocator().dupe(u8, str) catch unreachable;
        self.table.put(copy, copy) catch unreachable;
        return copy;
    }
};

// Variable names are interned for fast comparison
fn lookupVar(env: *const Environment, name: []const u8) ?Value {
    const interned = global_interner.intern(name);
    return env.bindings.get(interned);
}

SIMD for Crypto

Vectorized elliptic curve operations:

/// SIMD-accelerated field multiplication (mod p)
fn mulModP(a: *const [4]u64, b: *const [4]u64) [4]u64 {
    // Use vector operations where available
    if (comptime std.Target.current.cpu.arch.isX86()) {
        return mulModP_avx2(a, b);
    } else if (comptime std.Target.current.cpu.arch.isAARCH64()) {
        return mulModP_neon(a, b);
    } else {
        return mulModP_scalar(a, b);
    }
}

fn mulModP_avx2(a: *const [4]u64, b: *const [4]u64) [4]u64 {
    // AVX2 implementation using 256-bit vectors
    const va: @Vector(4, u64) = a.*;
    const vb: @Vector(4, u64) = b.*;

    // Schoolbook multiplication with vector operations
    // ... (optimized implementation)

    return result;
}

Profiling and Benchmarking

Built-in profiling support:

const Timer = struct {
    start: i128,

    pub fn init() Timer {
        return .{ .start = std.time.nanoTimestamp() };
    }

    pub fn elapsed(self: *const Timer) u64 {
        const now = std.time.nanoTimestamp();
        return @intCast(now - self.start);
    }
};

/// Benchmark harness
fn benchmark(
    comptime name: []const u8,
    comptime iterations: usize,
    comptime warmup: usize,
    func: *const fn () void,
) void {
    // Warmup
    for (0..warmup) |_| {
        func();
    }

    // Measure
    const timer = Timer.init();
    for (0..iterations) |_| {
        func();
    }
    const total_ns = timer.elapsed();

    const ns_per_op = total_ns / iterations;
    const ops_per_sec = @as(f64, 1_000_000_000) / @as(f64, @floatFromInt(ns_per_op));

    std.debug.print("{s}: {} ns/op ({d:.0} ops/sec)\n", .{
        name,
        ns_per_op,
        ops_per_sec,
    });
}

// Usage
test "benchmark deserialization" {
    benchmark("deserialize_script", 10000, 1000, struct {
        fn run() void {
            _ = deserialize(test_script);
        }
    }.run);
}

Memory Profiling

Track allocations in debug builds:

const DebugAllocator = struct {
    backing: Allocator,
    total_allocated: usize = 0,
    total_freed: usize = 0,
    allocation_count: usize = 0,

    pub fn allocator(self: *DebugAllocator) Allocator {
        return .{
            .ptr = self,
            .vtable = &.{
                .alloc = alloc,
                .resize = resize,
                .free = free,
            },
        };
    }

    fn alloc(ctx: *anyopaque, len: usize, ptr_align: u8, ret_addr: usize) ?[*]u8 {
        const self: *DebugAllocator = @ptrCast(@alignCast(ctx));
        self.total_allocated += len;
        self.allocation_count += 1;
        return self.backing.rawAlloc(len, ptr_align, ret_addr);
    }

    // ... other methods

    pub fn report(self: *const DebugAllocator) void {
        std.debug.print("Allocations: {}\n", .{self.allocation_count});
        std.debug.print("Total allocated: {} bytes\n", .{self.total_allocated});
        std.debug.print("Total freed: {} bytes\n", .{self.total_freed});
        std.debug.print("Leaked: {} bytes\n", .{self.total_allocated - self.total_freed});
    }
};

Performance Patterns

Optimization Decision Tree
══════════════════════════════════════════════════════════════════

Is operation in hot path?
│
├── NO → Optimize for clarity, not speed
│
└── YES → Profile first, then:
    │
    ├── CPU-bound?
    │   ├── Use comptime for dispatch
    │   ├── Unroll small loops
    │   ├── Use SIMD where applicable
    │   └── Inline critical functions
    │
    ├── Memory-bound?
    │   ├── Use SoA layout
    │   ├── Pool/arena allocators
    │   ├── Reduce allocations
    │   └── Prefetch data
    │
    └── Allocation-bound?
        ├── Arena allocators
        ├── Object pools
        ├── String interning
        └── Stack allocation where safe

Performance Checklist

When writing performance-critical code:

// ✓ Use comptime for type-level decisions
const Handler = comptime getHandler(op);

// ✓ Pre-compute lookup tables
const costs = comptime computeCostTable();

// ✓ Use SoA for iterated data
const Store = struct { tags: []Tag, values: []Value };

// ✓ Arena allocators for batch operations
var arena = ArenaAllocator.init(allocator);
defer arena.deinit();

// ✓ Inline hot functions
pub inline fn addCost(self: *CostAccum, cost: u32) !void

// ✓ Avoid allocations in tight loops
for (items) |item| {
    // Process without allocation
}

// ✓ Use vectors for parallel data
const Vec4 = @Vector(4, u64);

// ✓ Profile before optimizing
// std.debug.print("elapsed: {} ns\n", .{timer.elapsed()});

Summary

Comptime enables zero-cost abstractions and compile-time dispatch
Data-oriented design (SoA) improves cache efficiency 4x+
Arena allocators batch deallocations for throughput
Loop unrolling and SIMD accelerate hot paths
Memoization caches expensive computations
String interning reduces allocation pressure
Profile first before optimizing—measure, don't guess

Next: Chapter 32: v6 Protocol Features

Scala: perf-style-guide.md (HOTSPOT patterns)

Rust: Performance-oriented design throughout sigma-rust crates

⁵

Scala: MemoizedFunc.scala

⁶

Rust: Memoization patterns in ergotree-interpreter

Scala: CErgoTreeEvaluator.scala (fixedCostOp)

⁴

Rust: eval.rs (cost tracking)