Unsterwerx

Architecture

Unsterwerx is a single Rust binary organized into 18 logical modules. Documents flow through a pipeline from ingestion to archival, with every operation recorded in the audit trail.

Pipeline

ingest → parse → canonical → similarity → knowledge → diff → classify → archive
                     ↓                        ↑
                  search (FTS5)         user feedback
                     ↓
                reconstruct
  1. Ingest: scans directories, computes SHA-256 hashes, registers documents in the database, and deduplicates by content hash
  2. Parse: extracts raw text from PDF, DOCX, XLSX, PPTX, TXT, and CSV using format-specific parsers (NACs: Normalization, Abstraction, Compaction)
  3. Canonical: transforms parsed text into structural elements and stores canonical markdown in content-addressable storage (CAS)
  4. Similarity: generates MinHash signatures from text shingles, applies LSH banding, and computes Jaccard similarity scores
  5. Knowledge: builds TF-IDF semantic features, trains a Naive Bayes model on bootstrap labels and user feedback, then scores pairs with posterior P(duplicate | features)
  6. Diff: computes structural diffs between similar document pairs using LCS alignment and stores compressed diff payloads in CAS
  7. Classify: applies regex-based classification rules to assign document classes with weighted confidence
  8. Archive: enforces retention policies by moving or deleting documents per class-specific rules

Module Map

ModulePurpose
core/Error types, config, document types
storage/SQLite DB, migrations, CAS filesystem
audit/Append-only hash-chain audit log
ingest/File scanning, SHA-256 hashing, registration
parse/PDF, DOCX, XLSX, PPTX, TXT, CSV text extraction
similarity/MinHash signatures + LSH banding
canonical/Markdown extraction, FTS5 search indexing
diff/Content + structural diffing, temporal tracking
bayes/Naive Bayes model training, inference, evaluation
semantic/TF-IDF features, corpus IDF computation
rules/Classification rules + retention policies
temporal/Version timeline + point-in-time resolution
reconstruct/Tera templates, markdown/PDF output
archive/Move/delete per retention policy
cli/17 subcommands (plus 1 hidden worker)
knowledge/Bayesian classification, vector graphs, BI deduplication
benchmark.rsPipeline performance measurement and regression detection
import/Multi-source adapters (local, ChatGPT, Notion, Obsidian, Telegram)

Key Design Decisions