Architecture
Unsterwerx is a single Rust binary organized into 18 logical modules. Documents flow through a pipeline from ingestion to archival, with every operation recorded in the audit trail.
Pipeline
ingest → parse → canonical → similarity → knowledge → diff → classify → archive
↓ ↑
search (FTS5) user feedback
↓
reconstruct
- Ingest: scans directories, computes SHA-256 hashes, registers documents in the database, and deduplicates by content hash
- Parse: extracts raw text from PDF, DOCX, XLSX, PPTX, TXT, and CSV using format-specific parsers (NACs: Normalization, Abstraction, Compaction)
- Canonical: transforms parsed text into structural elements and stores canonical markdown in content-addressable storage (CAS)
- Similarity: generates MinHash signatures from text shingles, applies LSH banding, and computes Jaccard similarity scores
- Knowledge: builds TF-IDF semantic features, trains a Naive Bayes model on bootstrap labels and user feedback, then scores pairs with posterior
P(duplicate | features) - Diff: computes structural diffs between similar document pairs using LCS alignment and stores compressed diff payloads in CAS
- Classify: applies regex-based classification rules to assign document classes with weighted confidence
- Archive: enforces retention policies by moving or deleting documents per class-specific rules
Module Map
| Module | Purpose |
|---|---|
core/ | Error types, config, document types |
storage/ | SQLite DB, migrations, CAS filesystem |
audit/ | Append-only hash-chain audit log |
ingest/ | File scanning, SHA-256 hashing, registration |
parse/ | PDF, DOCX, XLSX, PPTX, TXT, CSV text extraction |
similarity/ | MinHash signatures + LSH banding |
canonical/ | Markdown extraction, FTS5 search indexing |
diff/ | Content + structural diffing, temporal tracking |
bayes/ | Naive Bayes model training, inference, evaluation |
semantic/ | TF-IDF features, corpus IDF computation |
rules/ | Classification rules + retention policies |
temporal/ | Version timeline + point-in-time resolution |
reconstruct/ | Tera templates, markdown/PDF output |
archive/ | Move/delete per retention policy |
cli/ | 17 subcommands (plus 1 hidden worker) |
knowledge/ | Bayesian classification, vector graphs, BI deduplication |
benchmark.rs | Pipeline performance measurement and regression detection |
import/ | Multi-source adapters (local, ChatGPT, Notion, Obsidian, Telegram) |
Key Design Decisions
- Single binary crate: no workspace, no microservices. Modules provide logical separation.
- SQLite + WAL: all metadata lives in a single database file with write-ahead logging for concurrent reads.
- Content-addressable storage: canonical markdown and diff payloads are stored by SHA-256 hash prefix, enabling storage-level deduplication.
- Streaming hashing: files are hashed with 8 KB streaming buffers to avoid loading them fully into memory.
- Append-only audit: every mutation is logged with hash chaining, and the chain can be verified at any time.