Unsterwerx

Data Storage

All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.

Directory Layout

~/.unsterwerx/
├── unsterwerx.db          # SQLite database (WAL mode)
├── canonical/             # CAS markdown files (SHA-256 prefix dirs)
│   ├── 0a/
│   │   └── 0a1b2c3d...   # canonical markdown content
│   └── ff/
│       └── ff9e8d7c...
├── diffs/                 # CAS diff payloads (zstd compressed)
├── archive/               # Archived original documents
└── templates/             # User Tera templates for reconstruction

SQLite Database

The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:

TablePurpose
documentsDocument registry with hash, status, metadata
canonical_recordsLinks documents to CAS markdown content
canonical_ftsFTS5 full-text search index
similarity_signaturesMinHash signatures per document
similarity_candidatesSimilar document pairs with Jaccard scores
diff_recordsDiff metadata and CAS references
classification_rulesRegex-based classification patterns
document_classificationsClassification results per document
retention_policiesRetention rules per document class
source_hierarchy_rulesTrust weight rules by source class
knowledge_sourcesRegistered import source adapters
import_batchesImport batch tracking
import_itemsIndividual import item records
document_provenanceSource linkage for imported documents
audit_eventsAppend-only hash-chained audit log

Content-Addressable Storage (CAS)

Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:

Diff payloads are additionally compressed with zstd (level 3 by default).

Document Lifecycle

Documents progress through these statuses:

StatusMeaning
canonicalText extracted, canonical markdown stored in CAS
classifiedClassification rules applied, document class assigned
errorParse or extraction failed (corrupt file, invalid format)
image_onlyScanned PDF with no extractable text
unsupportedFile format has no parser (e.g., .doc, .ppt, .xls legacy formats)
deduplicatedRemoved from the active set by knowledge dedup
dismissedMarked unrecoverable by the user

Overriding the Data Directory

Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:

bash
unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status