How To Benchmark and Optimize Unsterwerx Performance
Unsterwerx compresses enterprise document collections by 86% or more while keeping every operation auditable. But how fast is it on your hardware, with your data? The benchmark command answers that question with hard numbers: throughput per stage, storage breakdown, and trust chain verification. In this tutorial you will run benchmarks, isolate specific pipeline stages, export results for CI, compare runs against a baseline, and tune configuration parameters to match your workload.
Prerequisites
- Unsterwerx v0.5.4 or newer installed and available on your
PATH. See Installation if you have not set it up yet. - An existing data directory with ingested and normalized documents. If you are starting from scratch, work through How To Ingest and Normalize Enterprise Documents first.
- Basic familiarity with the terminal and JSON output.
Step 1 - Run a Full Benchmark
The benchmark command measures every stage of the TCA pipeline against your existing data. It operates on a temporary copy of your database, so your production data is never modified.
Run a full benchmark with the default 3 runs:
unsterwerx benchmark
Unsterwerx will execute each pipeline stage three times and average the results. The output is a single table covering timing, throughput, and storage:
Run 1/3 (in-place copy)...
Run 2/3 (in-place copy)...
Run 3/3 (in-place copy)...
Unsterwerx Benchmark (3 runs averaged)
==============================================================
Dataset: 1,131 docs / 2.1 GB
Stage Time Throughput Notes
--------------------------------------------------------
Similarity 564ms 1535.4 docs/s 866 docs -> 117 pairs (99.97% reduction)
Search (vs grep) 494ms 8.1 queries/s 314 hits (grep baseline skipped)
Classify 5.9s 191.6 docs/s 800 rule(s) active
Storage:
Original size: 2.1 GB
Universal Data: 74 MB (96.5% compaction)
DB + indexes: 208 MB
Diff artifacts: 4 MB
Total footprint: 285 MB (86.5% reduction)
Trust Chain: 991 events, integrity OK
Wall clock: 22.9s
==============================================================
There is a lot here. Break it down section by section.
Dataset. 1,131 documents totaling 2.1 GB of original file data. This is the raw material the pipeline processes.
Stage timings. Each row shows one pipeline stage with its wall-clock time, throughput, and a notes column with stage-specific details. Similarity analysis processed 866 canonical documents at 1,535 docs/s and found 117 candidate pairs. Search benchmarked full-text queries at 8.1 queries/s across 314 hits. Classification evaluated 800 active rules at 191.6 docs/s.
Storage. This is the headline metric. The Universal Data Set (the normalized canonical form of all ingested documents) compacts 2.1 GB down to 74 MB. That is 96.5% compaction. Add the database indexes (208 MB) and diff artifacts (4 MB), and the total footprint is 285 MB. An 86.5% reduction from the original files.
Trust Chain. Every operation Unsterwerx performs is logged to an append-only audit chain. The benchmark verifies the integrity of all 991 events. "integrity OK" means no gaps, no tampering, no broken hashes.
Note: The benchmark runs against a temporary copy of your database. Your production data, audit log, and indexes are untouched.
Step 2 - Benchmark Specific Stages
A full benchmark covers all stages. When you are investigating a specific bottleneck, narrow the scope with --stages and control the run count with --runs.
Benchmark just the similarity, search, and classify stages over 3 runs:
unsterwerx benchmark --stages similarity,search,classify --runs 3
Run 1/3 (in-place copy)...
Run 2/3 (in-place copy)...
Run 3/3 (in-place copy)...
Unsterwerx Benchmark (3 runs averaged)
==============================================================
Dataset: 1,131 docs / 2.1 GB
Stage Time Throughput Notes
--------------------------------------------------------
Similarity 564ms 1535.4 docs/s 866 docs -> 117 pairs (99.97% reduction)
Search (vs grep) 494ms 8.1 queries/s 314 hits (grep baseline skipped)
Classify 5.9s 191.6 docs/s 800 rule(s) active
Storage:
Original size: 2.1 GB
Universal Data: 74 MB (96.5% compaction)
DB + indexes: 208 MB
Diff artifacts: 4 MB
Total footprint: 285 MB (86.5% reduction)
Trust Chain: 991 events, integrity OK
Wall clock: 22.9s
==============================================================
The storage section always appears because it is a property of the dataset, not a benchmarked stage. The stage timings table only includes the stages you requested.
Available stage names are: ingest, canonical, similarity, diff, classify, archive, search, reconstruct. The aliases normalize, parse, and extract map to canonical; denormalize maps to reconstruct.
Increasing --runs improves statistical reliability at the cost of wall-clock time. For quick spot-checks, --runs 1 is fine. For numbers you want to report or compare, use 5 or more.
Step 3 - Export Benchmarks as JSON for CI
The table format is readable for humans. For CI pipelines, monitoring dashboards, or automated regression detection, export as JSON.
Run a benchmark with JSON output:
unsterwerx benchmark --json --runs 5
{
"dataset_docs": 1131,
"dataset_bytes": 2218177562,
"runs": 5,
"stages": [
{
"name": "Search (vs grep)",
"duration_secs": 0.49346166680000003,
"throughput_value": 8.105999450654796,
"throughput_unit": "queries/s",
"notes": "314 hits (grep baseline skipped)"
}
],
"storage": {
"original_bytes": 2218177562,
"canonical_bytes": 77332291,
"db_bytes": 217681920,
"diff_bytes": 4191439
},
"trust_chain_events": 991,
"trust_chain_ok": true,
"wall_clock_secs": 5.021894792,
"peak_rss_kb": null
}
Every field is a raw number. No formatted strings, no human-friendly units. dataset_bytes is exact byte count (2,218,177,562 bytes = 2.1 GB). canonical_bytes is the Universal Data Set size (77,332,291 bytes = ~74 MB). duration_secs is the averaged wall-clock time for that stage across all runs.
The trust_chain_ok boolean is the field your CI pipeline should assert on. If it ever returns false, something has corrupted the audit chain and you need to investigate immediately.
Save the output to a file for later comparison:
unsterwerx benchmark --json --runs 5 > benchmark-$(date +%Y%m%d).json
You can pipe to jq to extract specific values in a script:
unsterwerx benchmark --json | jq '.storage.canonical_bytes'
77332291
Or build a CI gate around compaction ratio:
result=$(unsterwerx benchmark --json)
original=$(echo "$result" | jq '.storage.original_bytes')
canonical=$(echo "$result" | jq '.storage.canonical_bytes')
ratio=$(echo "scale=4; 1 - ($canonical / $original)" | bc)
echo "Compaction: ${ratio}"
Step 4 - Compare Runs Against a Baseline
When you upgrade Unsterwerx, change configuration, or add documents, you want to know whether performance got better or worse. The --baseline flag compares the current run against a previously saved JSON report.
First, save a baseline:
unsterwerx benchmark --json --runs 5 > baseline.json
Later, after making changes, compare against it:
unsterwerx benchmark --baseline baseline.json --runs 5
The output table will include delta columns showing the difference from the baseline for each stage and storage metric. Regressions are immediately visible.
This workflow fits naturally into release testing. Before deploying a new version of Unsterwerx, run a benchmark against the previous version's baseline. If any stage regresses beyond your tolerance, investigate before proceeding.
Note: The baseline file must be a JSON report produced by unsterwerx benchmark --json. The comparison is stage-by-stage, so the stages in both runs should match for meaningful deltas.
Step 5 - Tune Configuration for Your Workload
Unsterwerx stores its configuration in TOML format in the data directory. The config command lets you inspect and modify settings without editing files by hand.
Start by reviewing the similarity section, which controls the MinHash/LSH parameters for near-duplicate detection:
unsterwerx config get similarity
[similarity]
lsh_bands = 32
lsh_rows = 4
num_hashes = 128
shingle_k = 3
threshold = 0.3
Five parameters govern similarity analysis. The one you are most likely to adjust is threshold, the Jaccard similarity cutoff. At 0.3 (the default), Unsterwerx considers two documents similar if they share 30% of their content shingles. This is deliberately permissive to catch loose variants and paraphrased duplicates.
Check the current threshold value:
unsterwerx config get similarity.threshold
similarity.threshold = 0.3
If your dataset produces too many false-positive similarity pairs, raise the threshold. A value of 0.5 means documents must share 50% of their content shingles to be flagged as similar:
unsterwerx config set similarity.threshold 0.5
similarity.threshold = 0.5
Verify the change took effect:
unsterwerx config get similarity.threshold
similarity.threshold = 0.5
Now re-run the similarity benchmark to measure the impact:
unsterwerx benchmark --stages similarity --runs 3
A higher threshold will typically produce fewer candidate pairs and faster execution, at the cost of missing looser near-duplicates. A lower threshold catches more candidates but may surface noise. The right value depends on your documents and your tolerance for false positives.
Here are the other tunable parameters worth knowing:
| Parameter | Default | Effect |
|---|---|---|
similarity.shingle_k | 3 | Tokens per shingle. Lower values catch fine-grained similarity; higher values require more exact overlap. |
similarity.num_hashes | 128 | MinHash signature size. More hashes improve accuracy but increase memory and compute. Must equal lsh_bands * lsh_rows. |
similarity.lsh_bands | 32 | LSH bands. More bands increase recall (find more pairs) at the cost of precision. |
similarity.lsh_rows | 4 | Rows per band. More rows increase precision (fewer false positives) at the cost of recall. |
storage.zstd_level | 3 | Zstandard compression level for diff payloads. Range 1-22. Higher compresses better but is slower. |
storage.journal_mode | wal | SQLite journal mode. WAL is strongly recommended for read concurrency. |
Warning: The constraint num_hashes = lsh_bands * lsh_rows must hold. If you change lsh_bands or lsh_rows, update num_hashes to match, or the similarity engine will reject the configuration.
After any configuration change, re-run benchmark --stages similarity to measure the effect. Tune, benchmark, compare. That is the cycle.
Conclusion
You have measured Unsterwerx pipeline performance end-to-end, isolated individual stages for targeted profiling, exported benchmark data as JSON for CI integration, compared runs against a baseline to catch regressions, and tuned the similarity engine for your specific workload.
The headline numbers from this dataset tell the story: 2.1 GB of enterprise documents compressed to a 285 MB total footprint (86.5% reduction), with the Universal Data Set itself at just 74 MB (96.5% compaction). Similarity analysis runs at 1,535 docs/s. Full-text search handles 8.1 queries/s. And the trust chain verification confirms all 991 audit events are intact.
These numbers are your baseline. Save them. Compare against them after every upgrade, every configuration change, every major data ingest.
This is the final tutorial in the series. If you are new to Unsterwerx, start from the beginning with How To Ingest and Normalize Enterprise Documents and work through each tutorial in order. For deeper reference on any command, see the Commands section. For a complete list of all configuration parameters, see the Configuration Reference.