Unsterwerx
Unsterwerx is a document-domain implementation of the Trusted Client-Centric Application Architecture (US Patent US9069626B2). It ingests common document formats into a local Shared Sandbox, normalizes them into a Universal Data Set, finds duplicates and near-duplicates, computes structural diffs, and supports temporal reconstruction under Business Intelligence and User Intelligence policy control.
Features
- Ingest thousands of documents from any directory tree
- Detect exact and near-duplicate documents via MinHash + LSH
- Extract searchable canonical markdown from every supported format
- Diff structural changes between similar document versions
- Search the entire corpus with full-text search (SQLite FTS5)
- Classify documents with regex-based rules and retention policies
- Import from external sources: ChatGPT, Notion, Obsidian, Telegram
- Reconstruct documents from canonical store as markdown or PDF
- Cluster and compact related content in the Universal Data Module with Bayesian Business Intelligence, knowledge vectors, plus BI dedup
- Audit every operation with an append-only hash-chained log
- Benchmark the full pipeline with detailed performance metrics
Quick Start
bash
curl -fsSL https://unsterwerx.run/install.sh | sh
unsterwerx ingest /path/to/documents
unsterwerx similarity
unsterwerx search "data architecture"
unsterwerx status --detailed
Commands
| Command | Description |
|---|---|
| ingest | Ingest files from a source directory |
| status | Show system and document status |
| reindex | Rebuild full-text search index (FTS5) |
| similarity | Run similarity analysis on ingested documents |
| diff | Compute diffs between similar document pairs |
| search | Search canonical document content |
| reconstruct | Reconstruct a document from canonical store |
| classify | Classify documents using rules |
| archive | Archive documents per retention policies |
| audit | View and verify audit log |
| rules | Manage classification rules |
| knowledge | Bayesian scoring, vector graphs, BI dedup |
| import | Import data from external sources |
| jobs | Manage background ingest and import jobs |
| config | Manage configuration |
| benchmark | Benchmark the TCA pipeline |
| upgrade | Check for and install the latest release |