End-to-End Workflow
This guide walks through the full Unsterwerx pipeline from ingestion to archival.
Step 1: Ingest Documents
Start by scanning a directory tree for documents:
unsterwerx ingest /path/to/documents
This registers all supported files (PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown, SQL) in the database with their SHA-256 content hashes. Duplicate files are automatically skipped.
--dry-run first to preview what will be ingested without modifying the database.Step 2: Run Similarity Analysis
Find duplicate and near-duplicate documents:
unsterwerx similarity
This step:
- Extracts canonical markdown from all unprocessed documents (the NAC pipeline)
- Generates MinHash signatures from text shingles
- Applies LSH banding to find candidate pairs
- Computes Jaccard similarity scores for all candidate pairs
Step 3: Build Knowledge Scores
Score similarity candidates using Bayesian Business Intelligence:
unsterwerx knowledge build --evaluate
This builds TF-IDF semantic features from the Universal Data Set, trains a Naive Bayes model on bootstrap labels, and computes posterior probabilities for each candidate pair stored in the Universal Data Module. The --evaluate flag shows model accuracy metrics.
To improve scoring with human feedback:
# Review top-scored pairs and provide feedback
unsterwerx knowledge labels add --label duplicate_or_same_concept <DOC_A> <DOC_B>
unsterwerx knowledge labels add --label unrelated <DOC_A> <DOC_B>
# Rebuild: model automatically retrains with new feedback
unsterwerx knowledge build --evaluate
See the Knowledge Scoring Guide for detailed tuning advice.
Step 4: Build Knowledge Vectors
Cluster related documents into knowledge vectors in the Universal Data Module so you can inspect groups instead of isolated pairs:
unsterwerx knowledge vectors build
unsterwerx knowledge vectors list --limit 20
Step 5: Review and Apply BI Dedup
Use the vector graph plus Bayesian posterior to apply Business Intelligence hierarchy rules and identify redundant versions:
# Preview first
unsterwerx knowledge dedup scan --threshold 0.8
# Execute after review
unsterwerx knowledge dedup apply --confirm
# Refresh vector membership and edges
unsterwerx knowledge vectors build
knowledge dedup apply marks removed documents as deduplicated in the Shared Sandbox, merges their provenance onto the kept document, and stores rollback diffs in the Universal Data Module when canonical content is available.
Step 6: Compute Diffs
Compare similar document pairs to see exactly what changed:
unsterwerx diff --all
This computes structural diffs for all candidate pairs identified by similarity analysis. View specific diffs with:
unsterwerx diff --doc-a <ID> --doc-b <ID>
Step 7: Search Content
Search across all canonical document content:
unsterwerx search "cybersecurity"
Full-text search is powered by SQLite FTS5 and returns results ranked by relevance with content snippets.
Step 8: Set Up Classification Rules
Define rules to automatically classify documents by type:
unsterwerx rules add \
--name "my-contracts" \
--class contract \
--filename-pattern "(?i)(contract|agreement)" \
--content-pattern "(?i)(hereby\s+agree|terms\s+and\s+conditions)" \
--match-all
View active rules:
unsterwerx rules list
Step 9: Classify Documents
Apply classification rules to all documents:
unsterwerx classify
View classification results for a specific document:
unsterwerx classify --show <DOCUMENT_ID>
Step 10: Set Up Retention Policies
Define retention policies per document class:
unsterwerx rules policy \
--name "contract-retention" \
--class contract \
--retention-years 7 \
--immutable \
--action move
Step 11: Archive
Apply retention policies to move or delete documents past their retention period:
unsterwerx archive --dry-run # Preview first
unsterwerx archive # Execute
Step 12: Reconstruct
Export any document from the canonical store:
unsterwerx reconstruct <DOCUMENT_ID> -o output.md
unsterwerx reconstruct <DOCUMENT_ID> -o output.pdf -f pdf
Step 13: Verify Audit Trail
Confirm the integrity of all operations:
unsterwerx audit --verify
Verifying audit hash chain...
Chain verified: 142 events, integrity OK
Monitoring
Check overall system status at any time:
unsterwerx status --detailed
Run benchmarks to measure pipeline performance:
unsterwerx benchmark --stages canonical,similarity