End-to-End Workflow

This guide walks through the full Unsterwerx pipeline from ingestion to archival.

Step 1: Ingest Documents

Start by scanning a directory tree for documents:

bash

unsterwerx ingest /path/to/documents

This registers all supported files (PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown, SQL) in the database with their SHA-256 content hashes. Duplicate files are automatically skipped.

Tip: Use --dry-run first to preview what will be ingested without modifying the database.

Step 2: Run Similarity Analysis

Find duplicate and near-duplicate documents:

bash

unsterwerx similarity

This step:

Extracts canonical markdown from all unprocessed documents (the NAC pipeline)
Generates MinHash signatures from text shingles
Applies LSH banding to find candidate pairs
Computes Jaccard similarity scores for all candidate pairs

Step 3: Build Knowledge Scores

Score similarity candidates using Bayesian Business Intelligence:

bash

unsterwerx knowledge build --evaluate

This builds TF-IDF semantic features from the Universal Data Set, trains a Naive Bayes model on bootstrap labels, and computes posterior probabilities for each candidate pair stored in the Universal Data Module. The --evaluate flag shows model accuracy metrics.

To improve scoring with human feedback:

bash

# Review top-scored pairs and provide feedback
unsterwerx knowledge labels add --label duplicate_or_same_concept <DOC_A> <DOC_B>
unsterwerx knowledge labels add --label unrelated <DOC_A> <DOC_B>

# Rebuild: model automatically retrains with new feedback
unsterwerx knowledge build --evaluate

See the Knowledge Scoring Guide for detailed tuning advice.

Step 4: Build Knowledge Vectors

Cluster related documents into knowledge vectors in the Universal Data Module so you can inspect groups instead of isolated pairs:

bash

unsterwerx knowledge vectors build
unsterwerx knowledge vectors list --limit 20

Step 5: Review and Apply BI Dedup

Use the vector graph plus Bayesian posterior to apply Business Intelligence hierarchy rules and identify redundant versions:

bash

# Preview first
unsterwerx knowledge dedup scan --threshold 0.8

# Execute after review
unsterwerx knowledge dedup apply --confirm

# Refresh vector membership and edges
unsterwerx knowledge vectors build

knowledge dedup apply marks removed documents as deduplicated in the Shared Sandbox, merges their provenance onto the kept document, and stores rollback diffs in the Universal Data Module when canonical content is available.

Step 6: Compute Diffs

Compare similar document pairs to see exactly what changed:

bash

unsterwerx diff --all

This computes structural diffs for all candidate pairs identified by similarity analysis. View specific diffs with:

bash

unsterwerx diff --doc-a <ID> --doc-b <ID>

Step 7: Search Content

Search across all canonical document content:

bash

unsterwerx search "cybersecurity"

Full-text search is powered by SQLite FTS5 and returns results ranked by relevance with content snippets.

Step 8: Set Up Classification Rules

Define rules to automatically classify documents by type:

bash

unsterwerx rules add \
    --name "my-contracts" \
    --class contract \
    --filename-pattern "(?i)(contract|agreement)" \
    --content-pattern "(?i)(hereby\s+agree|terms\s+and\s+conditions)" \
    --match-all

View active rules:

bash

unsterwerx rules list

Step 9: Classify Documents

Apply classification rules to all documents:

bash

unsterwerx classify

View classification results for a specific document:

bash

unsterwerx classify --show <DOCUMENT_ID>

Step 10: Set Up Retention Policies

Define retention policies per document class:

bash

unsterwerx rules policy \
    --name "contract-retention" \
    --class contract \
    --retention-years 7 \
    --immutable \
    --action move

Step 11: Archive

Apply retention policies to move or delete documents past their retention period:

bash

unsterwerx archive --dry-run    # Preview first
unsterwerx archive              # Execute

Step 12: Reconstruct

Export any document from the canonical store:

bash

unsterwerx reconstruct <DOCUMENT_ID> -o output.md
unsterwerx reconstruct <DOCUMENT_ID> -o output.pdf -f pdf

Step 13: Verify Audit Trail

Confirm the integrity of all operations:

bash

unsterwerx audit --verify

Verifying audit hash chain...
Chain verified: 142 events, integrity OK

Monitoring

Check overall system status at any time:

bash

unsterwerx status --detailed

Run benchmarks to measure pipeline performance:

bash

unsterwerx benchmark --stages canonical,similarity