Unsterwerx

Data Storage

All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.

Directory Layout

~/.unsterwerx/
├── unsterwerx.db          # SQLite database (WAL mode)
├── canonical/             # CAS markdown files (SHA-256 prefix dirs)
│   ├── 0a/
│   │   └── 0a1b2c3d...   # canonical markdown content
│   └── ff/
│       └── ff9e8d7c...
├── diffs/                 # CAS diff payloads (zstd compressed)
├── archive/               # Archived original documents
└── templates/             # User Tera templates for reconstruction

SQLite Database

The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:

TablePurpose
documentsDocument registry with hash, status, metadata
canonical_recordsLinks documents to CAS markdown content
canonical_ftsFTS5 full-text search index
similarity_signaturesMinHash signatures per document
similarity_candidatesSimilar document pairs with Jaccard scores
diff_recordsDiff metadata and CAS references
classification_rulesRegex-based classification patterns
document_classificationsClassification results per document
retention_policiesRetention rules per document class
source_hierarchy_rulesTrust weight rules by source class
knowledge_sourcesRegistered import source adapters
import_batchesImport batch tracking
import_itemsIndividual import item records
document_provenanceSource linkage for imported documents
audit_eventsAppend-only hash-chained audit log

Content-Addressable Storage (CAS)

Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:

Diff payloads are additionally compressed with zstd (level 3 by default).

Document Lifecycle

Documents progress through these statuses:

StatusMeaning
canonicalText extracted, canonical markdown stored in CAS
classifiedClassification rules applied, document class assigned
errorParse or extraction failed (corrupt file, invalid format)
image_onlyScanned PDF with no extractable text
unsupportedFile format has no parser (e.g., .doc, .ppt, .xls legacy formats)
deduplicatedRemoved from the active set by knowledge dedup
dismissedMarked unrecoverable by the user

Overriding the Data Directory

Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:

bash
unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status

Mounted Storage

Mounted source folders are valid ingest inputs. Unsterwerx can scan and normalize documents from NAS, SMB, NFS, sshfs, Google Drive, and similar mounted document stores.

The live Shared Sandbox has stricter requirements. SQLite locking, CAS writes, and atomic config updates need filesystem behavior that many mounts only partly provide. In storage.data_dir_mode = "auto", Unsterwerx runs directly on local filesystems and switches mounted or unknown data directories to mirror mode.

Mirror mode keeps live runtime state in a local directory and treats the requested mounted data directory as the publish target. After successful mutating commands, Unsterwerx publishes a SQLite snapshot and storage artifacts back to that target.

Inspect the active storage plan:

bash
unsterwerx --data-dir /Volumes/Archive/unsterwerx storage status

Retry a publish:

bash
unsterwerx --data-dir /Volumes/Archive/unsterwerx storage publish

Configure an explicit local runtime mirror:

toml
[storage]
data_dir_mode = "mirror"
runtime_dir = "/Users/alex/.unsterwerx/runtime/archive"