Architecture

Unsterwerx is a single Rust binary organized into 18 logical modules. Documents flow through a pipeline from ingestion to archival, with every operation recorded in the audit trail.

Pipeline

ingest → parse → canonical → similarity → knowledge → diff → classify → archive
                     ↓                        ↑
                  search (FTS5)         user feedback
                     ↓
                reconstruct

Ingest: scans directories, computes SHA-256 hashes, registers documents in the database, and deduplicates by content hash
Parse: extracts raw text from PDF, DOCX, XLSX, PPTX, TXT, and CSV using format-specific parsers (NACs: Normalization, Abstraction, Compaction)
Canonical: transforms parsed text into structural elements and stores canonical markdown in content-addressable storage (CAS)
Similarity: generates MinHash signatures from text shingles, applies LSH banding, and computes Jaccard similarity scores
Knowledge: builds TF-IDF semantic features, trains a Naive Bayes model on bootstrap labels and user feedback, then scores pairs with posterior P(duplicate | features)
Diff: computes structural diffs between similar document pairs using LCS alignment and stores compressed diff payloads in CAS
Classify: applies regex-based classification rules to assign document classes with weighted confidence
Archive: enforces retention policies by moving or deleting documents per class-specific rules

Module Map

Module	Purpose
`core/`	Error types, config, document types
`storage/`	SQLite DB, migrations, CAS filesystem
`audit/`	Append-only hash-chain audit log
`ingest/`	File scanning, SHA-256 hashing, registration
`parse/`	PDF, DOCX, XLSX, PPTX, TXT, CSV text extraction
`similarity/`	MinHash signatures + LSH banding
`canonical/`	Markdown extraction, FTS5 search indexing
`diff/`	Content + structural diffing, temporal tracking
`bayes/`	Naive Bayes model training, inference, evaluation
`semantic/`	TF-IDF features, corpus IDF computation
`rules/`	Classification rules + retention policies
`temporal/`	Version timeline + point-in-time resolution
`reconstruct/`	Tera templates, markdown/PDF output
`archive/`	Move/delete per retention policy
`cli/`	17 subcommands (plus 1 hidden worker)
`knowledge/`	Bayesian classification, vector graphs, BI deduplication
`benchmark.rs`	Pipeline performance measurement and regression detection
`import/`	Multi-source adapters (local, ChatGPT, Notion, Obsidian, Telegram)

Key Design Decisions

Single binary crate: no workspace, no microservices. Modules provide logical separation.
SQLite + WAL: all metadata lives in a single database file with write-ahead logging for concurrent reads.
Content-addressable storage: canonical markdown and diff payloads are stored by SHA-256 hash prefix, enabling storage-level deduplication.
Streaming hashing: files are hashed with 8 KB streaming buffers to avoid loading them fully into memory.
Append-only audit: every mutation is logged with hash chaining, and the chain can be verified at any time.