Unsterwerx

End-to-End Workflow

This guide walks through the full Unsterwerx pipeline from ingestion to archival.

Step 1: Ingest Documents

Start by scanning a directory tree for documents:

bash
unsterwerx ingest /path/to/documents

This registers all supported files (PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown, SQL) in the database with their SHA-256 content hashes. Duplicate files are automatically skipped.

Tip: Use --dry-run first to preview what will be ingested without modifying the database.

Step 2: Run Similarity Analysis

Find duplicate and near-duplicate documents:

bash
unsterwerx similarity

This step:

  1. Extracts canonical markdown from all unprocessed documents (the NAC pipeline)
  2. Generates MinHash signatures from text shingles
  3. Applies LSH banding to find candidate pairs
  4. Computes Jaccard similarity scores for all candidate pairs

Step 3: Build Knowledge Scores

Score similarity candidates using Bayesian Business Intelligence:

bash
unsterwerx knowledge build --evaluate

This builds TF-IDF semantic features from the Universal Data Set, trains a Naive Bayes model on bootstrap labels, and computes posterior probabilities for each candidate pair stored in the Universal Data Module. The --evaluate flag shows model accuracy metrics.

To improve scoring with human feedback:

bash
# Review top-scored pairs and provide feedback
unsterwerx knowledge labels add --label duplicate_or_same_concept <DOC_A> <DOC_B>
unsterwerx knowledge labels add --label unrelated <DOC_A> <DOC_B>

# Rebuild: model automatically retrains with new feedback
unsterwerx knowledge build --evaluate

See the Knowledge Scoring Guide for detailed tuning advice.

Step 4: Build Knowledge Vectors

Cluster related documents into knowledge vectors in the Universal Data Module so you can inspect groups instead of isolated pairs:

bash
unsterwerx knowledge vectors build
unsterwerx knowledge vectors list --limit 20

Step 5: Review and Apply BI Dedup

Use the vector graph plus Bayesian posterior to apply Business Intelligence hierarchy rules and identify redundant versions:

bash
# Preview first
unsterwerx knowledge dedup scan --threshold 0.8

# Execute after review
unsterwerx knowledge dedup apply --confirm

# Refresh vector membership and edges
unsterwerx knowledge vectors build

knowledge dedup apply marks removed documents as deduplicated in the Shared Sandbox, merges their provenance onto the kept document, and stores rollback diffs in the Universal Data Module when canonical content is available.

Step 6: Compute Diffs

Compare similar document pairs to see exactly what changed:

bash
unsterwerx diff --all

This computes structural diffs for all candidate pairs identified by similarity analysis. View specific diffs with:

bash
unsterwerx diff --doc-a <ID> --doc-b <ID>

Step 7: Search Content

Search across all canonical document content:

bash
unsterwerx search "cybersecurity"

Full-text search is powered by SQLite FTS5 and returns results ranked by relevance with content snippets.

Step 8: Set Up Classification Rules

Define rules to automatically classify documents by type:

bash
unsterwerx rules add \
    --name "my-contracts" \
    --class contract \
    --filename-pattern "(?i)(contract|agreement)" \
    --content-pattern "(?i)(hereby\s+agree|terms\s+and\s+conditions)" \
    --match-all

View active rules:

bash
unsterwerx rules list

Step 9: Classify Documents

Apply classification rules to all documents:

bash
unsterwerx classify

View classification results for a specific document:

bash
unsterwerx classify --show <DOCUMENT_ID>

Step 10: Set Up Retention Policies

Define retention policies per document class:

bash
unsterwerx rules policy \
    --name "contract-retention" \
    --class contract \
    --retention-years 7 \
    --immutable \
    --action move

Step 11: Archive

Apply retention policies to move or delete documents past their retention period:

bash
unsterwerx archive --dry-run    # Preview first
unsterwerx archive              # Execute

Step 12: Reconstruct

Export any document from the canonical store:

bash
unsterwerx reconstruct <DOCUMENT_ID> -o output.md
unsterwerx reconstruct <DOCUMENT_ID> -o output.pdf -f pdf

Step 13: Verify Audit Trail

Confirm the integrity of all operations:

bash
unsterwerx audit --verify
Verifying audit hash chain...
Chain verified: 142 events, integrity OK

Monitoring

Check overall system status at any time:

bash
unsterwerx status --detailed

Run benchmarks to measure pipeline performance:

bash
unsterwerx benchmark --stages canonical,similarity