benchmark
Benchmarks the Unsterwerx TCA pipeline stages with timing, throughput, and storage metrics. Supports two modes: in-place (benchmarks existing data) and fresh (ingests from a source directory).
Usage
bash
unsterwerx benchmark [OPTIONS] [SOURCE]
Arguments
| Argument | Required | Description |
|---|---|---|
SOURCE | No | Source directory for fresh mode. Omit for in-place mode |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--runs | -n | integer | 3 | Number of benchmark runs to average |
--stages | -s | string | all | Comma-separated stages: ingest, canonical, similarity, diff, classify, archive, search, reconstruct. Aliases: normalize/parse/extract map to canonical, denormalize maps to reconstruct |
--format | string | table | Output format: table or json | |
--json | flag | Shortcut for --format json | ||
--baseline | path | Path to previous JSON report for delta comparison |
Examples
In-place benchmark (existing data)
bash
unsterwerx benchmark --stages canonical,similarity
Run 1/3 (in-place copy)...
Run 2/3 (in-place copy)...
Run 3/3 (in-place copy)...
Unsterwerx Benchmark (3 runs averaged)
==============================================================
Dataset: 2,074 docs / 2.7 GB
Stage Time Throughput Notes
--------------------------------------------------------
Normalize (NACs) 1m14s 36.9 MB/s CSV: 574ms, DOCX: 8.1s, PDF: 56.6s
Similarity 3.9s 462.5 docs/s 1,806 docs -> 373 pairs
Storage:
Original size: 2.7 GB
Universal Data: 94 MB (96.6% compaction)
DB + indexes: 234 MB
Diff artifacts: 4 MB
Total footprint: 332 MB (87.9% reduction)
Trust Chain: 148 events, integrity OK
Peak RSS: 5909 MB
Wall clock: 4m3s
==============================================================
JSON output
bash
unsterwerx benchmark --format json --stages canonical
{
"dataset_docs": 2074,
"dataset_bytes": 2877788871,
"runs": 3,
"stages": [ ... ],
"storage": {
"original_bytes": 2877788871,
"canonical_bytes": 98261419,
"db_bytes": 245743616,
"diff_bytes": 4208599
},
"trust_chain_events": 148,
"trust_chain_ok": true,
"wall_clock_secs": 256.27,
"peak_rss_kb": 4609244
}
Compare against baseline
bash
unsterwerx benchmark --format json --stages canonical > baseline.json
unsterwerx benchmark --baseline baseline.json --stages canonical
Notes
- In-place mode (no
SOURCEargument): operates on a temporary copy of the existing database. Theingeststage is silently skipped in this mode since documents are already registered. - Fresh mode (with
SOURCEargument): creates a temporary data directory, ingests from scratch, and runs all requested stages. - The
searchstage benchmarks full-text search performance. Prerequisites:ingest,canonical. - The
reconstructstage benchmarks point-in-time document reconstruction. Prerequisites:ingest,canonical. - The archive stage always runs in dry-run mode during benchmarks to prevent moving original files.
- Peak RSS is read from
/proc/self/status(VmHWM). - Use
--baselinewith a previous JSON report to see performance deltas between runs.