Performance
Performance characteristics, benchmarks against PUC-Rio Lua 5.1.1, and optimization history.
Goal: PUC-Rio Parity
The target is matching PUC-Rio Lua 5.1.1 (compiled with -O2) on the
official test suite. PUC-Rio Lua is written in C and represents the
performance floor for a Lua 5.1 implementation.
Benchmark Environment
| Property | Value |
|---|---|
| CPU | AMD Ryzen 7 8840U w/ Radeon 780M Graphics |
| OS | Fedora Linux 43 (kernel 6.18) |
| Rust | Edition 2024, --release profile |
| PUC-Rio | Lua 5.1.1, compiled with gcc -O2 -DLUA_USE_LINUX |
| Runs | 10 per test, median reported |
| Date | 2026-02-23 |
Per-Test Results (ms, median of 10 runs)
Tests from the PUC-Rio test suite run
individually. main.lua and big.lua are excluded: main.lua tests
CLI features via os.execute (environment-dependent), and big.lua
requires a coroutine wrapper set by all.lua.
| Test | PUC-Rio | rilua | Ratio |
|---|---|---|---|
| gc.lua | 70 | 85 | 1.21x |
| db.lua | 16 | 30 | 1.88x |
| calls.lua | 7 | 9 | 1.29x |
| strings.lua | 3 | 3 | 1.00x |
| literals.lua | 3 | 3 | 1.00x |
| attrib.lua | 4 | 4 | 1.00x |
| locals.lua | 4 | 6 | 1.50x |
| constructs.lua | 252 | 583 | 2.31x |
| code.lua | 2 | 2 | 1.00x |
| nextvar.lua | 13 | 28 | 2.15x |
| pm.lua | 11 | 11 | 1.00x |
| api.lua | 3 | 3 | 1.00x |
| events.lua | 3 | 3 | 1.00x |
| vararg.lua | 2 | 2 | 1.00x |
| closure.lua | 5 | 8 | 1.60x |
| errors.lua | 135 | 148 | 1.10x |
| math.lua | 5 | 6 | 1.20x |
| sort.lua | 55 | 98 | 1.78x |
| verybig.lua | 115 | 217 | 1.89x |
| files.lua | 12 | 13 | 1.08x |
| Sum | 720 | 1262 | 1.75x |
Interpretation
rilua is 1.75x slower than PUC-Rio Lua overall. Most tests are within 1.0-1.5x. Four tests account for the majority of the gap:
- constructs.lua (2.31x, +331ms): heavy control-flow constructs, deeply nested loops and conditionals. This test stresses the VM dispatch loop.
- nextvar.lua (2.15x, +15ms): table iteration (
next,pairs), global table manipulation. Stresses table hash traversal. - verybig.lua (1.89x, +102ms): large function compilation and execution with many locals and upvalues.
- db.lua (1.88x, +14ms): debug library operations,
getinfo,getlocal, hook management. - sort.lua (1.78x, +43ms):
table.sortwith comparison callbacks. Function call overhead per comparison.
Tests at or near parity (1.0-1.1x): strings.lua, literals.lua,
attrib.lua, code.lua, pm.lua, api.lua, events.lua,
vararg.lua, files.lua.
Combined Runner
bench-all.lua runs all 20 standalone tests sequentially in a single
interpreter session (like all.lua but without main.lua/big.lua
and without the dump/undump dofile override).
| Runner | PUC-Rio | rilua | Ratio |
|---|---|---|---|
| bench-all.lua | 792 | 1529 | 1.93x |
The combined runner is slower than the sum of individual tests (1.93x vs 1.75x). Running all tests in a single interpreter session accumulates more live objects across test boundaries, increasing GC work per cycle.
Reproducing
Build both interpreters and run the benchmark script:
# Build PUC-Rio Lua 5.1.1
cd lua-5.1.1 && make linux && cd ..
# Build rilua
cargo build --release
# Run benchmarks (default: 10 runs per test)
./scripts/benchmark-tests.sh [runs]
Optimization History
Starting from ~15.4s on the full suite, four optimization phases reduced runtime to ~2.6s (83% total reduction).
Phase 1: Lexer and Parser (~7% improvement)
- Keyword lookup:
matchdispatch replacing binary search on sorted array - Parser advance:
mem::replacereplacingToken::clone - Lexer: fast-path byte-slice scanning for common characters
- GC traverse: zero-allocation indexed access for tables and closures
Phase 2: Constant Pool (~68% reduction)
- Hash-based constant pool deduplication replacing O(n) linear scan
- Mirrors PUC-Rio’s
addkapproach usingluaH_setonfs->h ConstantKeyenum:Num(u64)/Bool(bool)/Str(Vec<u8>)- 15.4s -> 4.9s
Phase 3: GC and VM Inlining (~12% reduction)
#[inline]on hot GC arena and collector methodssweep_partial: direct assignment replacingmem::replaceon dead pathGCSWEEPMAX: 40 -> 80 to amortize dispatch overheadtraverse_thread: indexed access replacingVecclone allocationCallInfo.is_luacache: eliminates arena lookups in traceback- 4.9s -> 4.3s
Phase 4: SoA Sweep Layout (~46% reduction)
- Parallel
Vec<u8>color array (Structure-of-Arrays layout) - Sweep reads 1 byte per slot instead of loading full
Entry<T>(~72 bytes for tables) - Iterator-based sweep: eliminates per-access bounds checks
- 4.9s -> 2.6s (10-run median)
Profiling
Requirements
- Linux with
perfinstalled (linux-tools-commonor equivalent) cargo-flamegraph:cargo install flamegraph
Generating Flamegraphs
Build with debug symbols in release mode (already configured in
Cargo.toml via [profile.release] debug = true if needed):
# Profile a specific test file
cargo flamegraph -- -e "dofile('lua-5.1-tests/constructs.lua')"
# Profile the full test suite
cd lua-5.1-tests
RILUA_TEST_LIB=1 cargo flamegraph -- all.lua
Flamegraph SVGs are interactive. Open them in a browser to click-zoom into specific call stacks and search for function names.
Generated flamegraphs go in flamegraphs/ (gitignored).
Using perf Directly
cargo build --release
perf record -g --call-graph dwarf target/release/rilua lua-5.1-tests/constructs.lua
perf report
Benchmarks
Criterion Microbenchmarks
benches/interpreter.rs contains criterion benchmarks covering:
- State creation: empty, base libs, full stdlib
- Compilation: minimal, loops, functions, tables
- VM execution: arithmetic loops, fibonacci, string concat, tables, closures, metatable dispatch
- GC: full collect, allocation churn, incremental stepping
- String interning: unique strings, dedup hits
- Table operations: integer keys, string keys, mixed Lua ops
- End-to-end: compile+run, coroutine cycles
Run with:
cargo bench
Results go to target/criterion/. Use --save-baseline and
--baseline flags to compare across changes.
PUC-Rio Full Suite Benchmark
The primary wall-clock benchmark:
cargo build --release
./scripts/bench-puc-rio.sh [binary] [runs]
Arguments:
binary: path to rilua binary (default:target/release/rilua)runs: number of runs (default: 5)
Output: min, median, and max times. Prints median to stdout.
Regression Gate
scripts/perf-gate.sh compares the current build against the stored
baseline with a configurable threshold (default 5%).
./scripts/perf-gate.sh [baseline_ms] [threshold_pct]
If no arguments are given, reads .perf-baseline and uses 5%.
The script:
- Builds release
- Runs
bench-puc-rio.shwith 5 iterations - Compares median against
baseline + baseline * threshold / 100 - Exits 0 (pass) or 1 (regression detected)
After a confirmed improvement, update the baseline:
./scripts/bench-puc-rio.sh > .perf-baseline
Optimization Priorities
Based on the per-test benchmarks, these areas offer the largest potential gains, ordered by impact:
1. VM Dispatch (constructs.lua: 2.31x, +331ms)
constructs.lua is the heaviest test and the largest absolute gap.
It exercises the main execute() loop with deeply nested control flow.
- Instruction dispatch: the
match-based dispatch inexecute()is the hot path. Layout optimization, opcode reordering to improve branch prediction, and reducing per-instruction overhead would have the highest impact. - FORPREP/FORLOOP specialization: integer-only fast path for
numeric
forloops when bounds are integers.
2. Table Operations (nextvar.lua: 2.15x, sort.lua: 1.78x)
- Hash traversal:
next()andpairs()iteration speed.nextvar.luahammers these. - Comparison callback overhead:
sort.luacalls a Lua comparison function per element pair. Reducing function call setup/teardown cost would help.
3. Compilation (verybig.lua: 1.89x, +102ms)
- AST allocation: heap-allocated AST nodes dropped after
compilation. A pool or arena built from
Vec-based storage could reduce allocation pressure. - Constant folding: limited constant folding during compilation could reduce VM work for arithmetic-heavy code.
4. GC Under Sustained Load (bench-all.lua: 1.93x)
The combined runner is 10% slower relative to PUC-Rio than the sum of individual tests (1.93x vs 1.75x). This indicates GC overhead grows disproportionately with accumulated state. Incremental GC tuning and sweep efficiency under high object counts are the targets here.
5. Lower-Priority Opportunities
- String concatenation: batching consecutive
CONCAToperations to reduce intermediate allocations. - Generational GC: nursery for young objects, tenured for survivors. Would reduce per-cycle work for allocation-heavy programs.
- Hash function: alternative hash functions could reduce collision rates for specific workloads.