Unsterwerx

Quick Start

This guide takes you from ingest to search and status checks in a few minutes.

1. Ingest Documents

Point Unsterwerx at a directory containing your documents:

bash
unsterwerx ingest /path/to/documents

Use --dry-run to preview what would be ingested without writing to the database:

bash
unsterwerx ingest --dry-run /path/to/documents
Dry Run: would ingest 2873 files

Filter by extension or file size:

bash
unsterwerx ingest --dry-run -e pdf /path/to/documents
Dry Run: would ingest 1184 files

2. Check Status

See how many documents were ingested and their status:

bash
unsterwerx status
Unsterwerx Status
══════════════════════════════════════════
  Data directory:  /home/user/.unsterwerx
  Total documents:     2074
  Total size:        2.7 GB
  Indexed (FTS5):      1807
  Audit events:         148
══════════════════════════════════════════

3. Find Duplicates

Run similarity analysis to detect exact and near-duplicate documents:

bash
unsterwerx similarity
Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          371
  Exact duplicates:          97
  Threshold:               0.30
══════════════════════════════════

4. Score Document Pairs and Build Vectors

Build Bayesian Business Intelligence scores that go beyond Jaccard similarity:

bash
unsterwerx knowledge build --evaluate
unsterwerx knowledge vectors build
Building semantic features...
  Corpus: 1807 docs, 2939590 unique terms (IDF snapshot #1)

Training Bayesian model...
  Bootstrap labels: 318 positive, 636 negative
  Model trained: run #1, P(dup)=0.301, P(unrel)=0.699

Scoring candidates...
Candidates scored: 371

Evaluation:
  Post-train consistency: 100.0%

Improve results with feedback:

bash
unsterwerx knowledge labels add --label duplicate_or_same_concept <DOC_A> <DOC_B>

This Business Intelligence pass scores candidate pairs already staged in the Universal Data Module, then clusters them into knowledge vectors for higher-level review.

Preview Business Intelligence dedup candidates inside vectors:

bash
unsterwerx knowledge dedup scan --threshold 0.8

Apply dedup only after reviewing the plan, then rebuild vectors in the Universal Data Module:

bash
unsterwerx knowledge dedup apply --confirm
unsterwerx knowledge vectors build

5. Search Content

Search across all canonical document content with full-text search:

bash
unsterwerx search "policy"
Search Results (5 matches)
══════════════════════════════════════════════════════════════
  1. Homeowners Policy Packet [d3d2da43]
     HOMEOWNERS POLICY PACKET  IMPORTANT MESSAGES...

  2. DODI Standards [455d5bb1]
     Establishing Policy in DoDIs...
══════════════════════════════════════════════════════════════

6. Classify Documents

Apply classification rules and view results:

bash
unsterwerx classify
unsterwerx classify --show a1b2c3d4
Classifications for a1b2c3d4...
══════════════════════════════════════════
  cv              (62%) via rule 'seed-cv' at 2026-02-25
══════════════════════════════════════════

7. View Audit Trail

Every operation is recorded in an append-only hash-chained audit log:

bash
unsterwerx audit --verify
Verifying audit hash chain...
Chain verified: 142 events, integrity OK

Next Steps