Unsterwerx

How To Benchmark and Optimize Unsterwerx Performance

Unsterwerx compresses enterprise document collections by 86% or more while keeping every operation auditable. But how fast is it on your hardware, with your data? The benchmark command answers that question with hard numbers: throughput per stage, storage breakdown, and trust chain verification. In this tutorial you will run benchmarks, isolate specific pipeline stages, export results for CI, compare runs against a baseline, and tune configuration parameters to match your workload.

Prerequisites

Step 1 - Run a Full Benchmark

The benchmark command measures every stage of the TCA pipeline against your existing data. It operates on a temporary copy of your database, so your production data is never modified.

Run a full benchmark with the default 3 runs:

bash
unsterwerx benchmark

Unsterwerx will execute each pipeline stage three times and average the results. The output is a single table covering timing, throughput, and storage:

text
Run 1/3 (in-place copy)...
Run 2/3 (in-place copy)...
Run 3/3 (in-place copy)...

Unsterwerx Benchmark (3 runs averaged)
==============================================================
  Dataset:           1,131 docs / 2.1 GB

  Stage                Time       Throughput       Notes
  --------------------------------------------------------
  Similarity           564ms      1535.4 docs/s    866 docs -> 117 pairs (99.97% reduction)
  Search (vs grep)     494ms      8.1 queries/s    314 hits (grep baseline skipped)
  Classify             5.9s       191.6 docs/s     800 rule(s) active

  Storage:
    Original size:   2.1 GB
    Universal Data:     74 MB   (96.5% compaction)
    DB + indexes:      208 MB
    Diff artifacts:      4 MB
    Total footprint:   285 MB   (86.5% reduction)

  Trust Chain:       991 events, integrity OK
  Wall clock:        22.9s
==============================================================

There is a lot here. Break it down section by section.

Dataset. 1,131 documents totaling 2.1 GB of original file data. This is the raw material the pipeline processes.

Stage timings. Each row shows one pipeline stage with its wall-clock time, throughput, and a notes column with stage-specific details. Similarity analysis processed 866 canonical documents at 1,535 docs/s and found 117 candidate pairs. Search benchmarked full-text queries at 8.1 queries/s across 314 hits. Classification evaluated 800 active rules at 191.6 docs/s.

Storage. This is the headline metric. The Universal Data Set (the normalized canonical form of all ingested documents) compacts 2.1 GB down to 74 MB. That is 96.5% compaction. Add the database indexes (208 MB) and diff artifacts (4 MB), and the total footprint is 285 MB. An 86.5% reduction from the original files.

Trust Chain. Every operation Unsterwerx performs is logged to an append-only audit chain. The benchmark verifies the integrity of all 991 events. "integrity OK" means no gaps, no tampering, no broken hashes.

Note: The benchmark runs against a temporary copy of your database. Your production data, audit log, and indexes are untouched.

Step 2 - Benchmark Specific Stages

A full benchmark covers all stages. When you are investigating a specific bottleneck, narrow the scope with --stages and control the run count with --runs.

Benchmark just the similarity, search, and classify stages over 3 runs:

bash
unsterwerx benchmark --stages similarity,search,classify --runs 3
text
Run 1/3 (in-place copy)...
Run 2/3 (in-place copy)...
Run 3/3 (in-place copy)...

Unsterwerx Benchmark (3 runs averaged)
==============================================================
  Dataset:           1,131 docs / 2.1 GB

  Stage                Time       Throughput       Notes
  --------------------------------------------------------
  Similarity           564ms      1535.4 docs/s    866 docs -> 117 pairs (99.97% reduction)
  Search (vs grep)     494ms      8.1 queries/s    314 hits (grep baseline skipped)
  Classify             5.9s       191.6 docs/s     800 rule(s) active

  Storage:
    Original size:   2.1 GB
    Universal Data:     74 MB   (96.5% compaction)
    DB + indexes:      208 MB
    Diff artifacts:      4 MB
    Total footprint:   285 MB   (86.5% reduction)

  Trust Chain:       991 events, integrity OK
  Wall clock:        22.9s
==============================================================

The storage section always appears because it is a property of the dataset, not a benchmarked stage. The stage timings table only includes the stages you requested.

Available stage names are: ingest, canonical, similarity, diff, classify, archive, search, reconstruct. The aliases normalize, parse, and extract map to canonical; denormalize maps to reconstruct.

Increasing --runs improves statistical reliability at the cost of wall-clock time. For quick spot-checks, --runs 1 is fine. For numbers you want to report or compare, use 5 or more.

Step 3 - Export Benchmarks as JSON for CI

The table format is readable for humans. For CI pipelines, monitoring dashboards, or automated regression detection, export as JSON.

Run a benchmark with JSON output:

bash
unsterwerx benchmark --json --runs 5
json
{
  "dataset_docs": 1131,
  "dataset_bytes": 2218177562,
  "runs": 5,
  "stages": [
    {
      "name": "Search (vs grep)",
      "duration_secs": 0.49346166680000003,
      "throughput_value": 8.105999450654796,
      "throughput_unit": "queries/s",
      "notes": "314 hits (grep baseline skipped)"
    }
  ],
  "storage": {
    "original_bytes": 2218177562,
    "canonical_bytes": 77332291,
    "db_bytes": 217681920,
    "diff_bytes": 4191439
  },
  "trust_chain_events": 991,
  "trust_chain_ok": true,
  "wall_clock_secs": 5.021894792,
  "peak_rss_kb": null
}

Every field is a raw number. No formatted strings, no human-friendly units. dataset_bytes is exact byte count (2,218,177,562 bytes = 2.1 GB). canonical_bytes is the Universal Data Set size (77,332,291 bytes = ~74 MB). duration_secs is the averaged wall-clock time for that stage across all runs.

The trust_chain_ok boolean is the field your CI pipeline should assert on. If it ever returns false, something has corrupted the audit chain and you need to investigate immediately.

Save the output to a file for later comparison:

bash
unsterwerx benchmark --json --runs 5 > benchmark-$(date +%Y%m%d).json

You can pipe to jq to extract specific values in a script:

bash
unsterwerx benchmark --json | jq '.storage.canonical_bytes'
text
77332291

Or build a CI gate around compaction ratio:

bash
result=$(unsterwerx benchmark --json)
original=$(echo "$result" | jq '.storage.original_bytes')
canonical=$(echo "$result" | jq '.storage.canonical_bytes')
ratio=$(echo "scale=4; 1 - ($canonical / $original)" | bc)
echo "Compaction: ${ratio}"

Step 4 - Compare Runs Against a Baseline

When you upgrade Unsterwerx, change configuration, or add documents, you want to know whether performance got better or worse. The --baseline flag compares the current run against a previously saved JSON report.

First, save a baseline:

bash
unsterwerx benchmark --json --runs 5 > baseline.json

Later, after making changes, compare against it:

bash
unsterwerx benchmark --baseline baseline.json --runs 5

The output table will include delta columns showing the difference from the baseline for each stage and storage metric. Regressions are immediately visible.

This workflow fits naturally into release testing. Before deploying a new version of Unsterwerx, run a benchmark against the previous version's baseline. If any stage regresses beyond your tolerance, investigate before proceeding.

Note: The baseline file must be a JSON report produced by unsterwerx benchmark --json. The comparison is stage-by-stage, so the stages in both runs should match for meaningful deltas.

Step 5 - Tune Configuration for Your Workload

Unsterwerx stores its configuration in TOML format in the data directory. The config command lets you inspect and modify settings without editing files by hand.

Start by reviewing the similarity section, which controls the MinHash/LSH parameters for near-duplicate detection:

bash
unsterwerx config get similarity
text
[similarity]
lsh_bands = 32
lsh_rows = 4
num_hashes = 128
shingle_k = 3
threshold = 0.3

Five parameters govern similarity analysis. The one you are most likely to adjust is threshold, the Jaccard similarity cutoff. At 0.3 (the default), Unsterwerx considers two documents similar if they share 30% of their content shingles. This is deliberately permissive to catch loose variants and paraphrased duplicates.

Check the current threshold value:

bash
unsterwerx config get similarity.threshold
text
similarity.threshold = 0.3

If your dataset produces too many false-positive similarity pairs, raise the threshold. A value of 0.5 means documents must share 50% of their content shingles to be flagged as similar:

bash
unsterwerx config set similarity.threshold 0.5
text
similarity.threshold = 0.5

Verify the change took effect:

bash
unsterwerx config get similarity.threshold
text
similarity.threshold = 0.5

Now re-run the similarity benchmark to measure the impact:

bash
unsterwerx benchmark --stages similarity --runs 3

A higher threshold will typically produce fewer candidate pairs and faster execution, at the cost of missing looser near-duplicates. A lower threshold catches more candidates but may surface noise. The right value depends on your documents and your tolerance for false positives.

Here are the other tunable parameters worth knowing:

ParameterDefaultEffect
similarity.shingle_k3Tokens per shingle. Lower values catch fine-grained similarity; higher values require more exact overlap.
similarity.num_hashes128MinHash signature size. More hashes improve accuracy but increase memory and compute. Must equal lsh_bands * lsh_rows.
similarity.lsh_bands32LSH bands. More bands increase recall (find more pairs) at the cost of precision.
similarity.lsh_rows4Rows per band. More rows increase precision (fewer false positives) at the cost of recall.
storage.zstd_level3Zstandard compression level for diff payloads. Range 1-22. Higher compresses better but is slower.
storage.journal_modewalSQLite journal mode. WAL is strongly recommended for read concurrency.

Warning: The constraint num_hashes = lsh_bands * lsh_rows must hold. If you change lsh_bands or lsh_rows, update num_hashes to match, or the similarity engine will reject the configuration.

After any configuration change, re-run benchmark --stages similarity to measure the effect. Tune, benchmark, compare. That is the cycle.

Conclusion

You have measured Unsterwerx pipeline performance end-to-end, isolated individual stages for targeted profiling, exported benchmark data as JSON for CI integration, compared runs against a baseline to catch regressions, and tuned the similarity engine for your specific workload.

The headline numbers from this dataset tell the story: 2.1 GB of enterprise documents compressed to a 285 MB total footprint (86.5% reduction), with the Universal Data Set itself at just 74 MB (96.5% compaction). Similarity analysis runs at 1,535 docs/s. Full-text search handles 8.1 queries/s. And the trust chain verification confirms all 991 audit events are intact.

These numbers are your baseline. Save them. Compare against them after every upgrade, every configuration change, every major data ingest.

This is the final tutorial in the series. If you are new to Unsterwerx, start from the beginning with How To Ingest and Normalize Enterprise Documents and work through each tutorial in order. For deeper reference on any command, see the Commands section. For a complete list of all configuration parameters, see the Configuration Reference.