Data Storage
All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.
Directory Layout
~/.unsterwerx/
├── unsterwerx.db # SQLite database (WAL mode)
├── canonical/ # CAS markdown files (SHA-256 prefix dirs)
│ ├── 0a/
│ │ └── 0a1b2c3d... # canonical markdown content
│ └── ff/
│ └── ff9e8d7c...
├── diffs/ # CAS diff payloads (zstd compressed)
├── archive/ # Archived original documents
└── templates/ # User Tera templates for reconstruction
SQLite Database
The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:
| Table | Purpose |
|---|---|
documents | Document registry with hash, status, metadata |
canonical_records | Links documents to CAS markdown content |
canonical_fts | FTS5 full-text search index |
similarity_signatures | MinHash signatures per document |
similarity_candidates | Similar document pairs with Jaccard scores |
diff_records | Diff metadata and CAS references |
classification_rules | Regex-based classification patterns |
document_classifications | Classification results per document |
retention_policies | Retention rules per document class |
source_hierarchy_rules | Trust weight rules by source class |
knowledge_sources | Registered import source adapters |
import_batches | Import batch tracking |
import_items | Individual import item records |
document_provenance | Source linkage for imported documents |
audit_events | Append-only hash-chained audit log |
Content-Addressable Storage (CAS)
Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:
- Automatic deduplication: identical content is stored once
- Integrity verification: the filename is the content hash
- Efficient lookups:
O(1)by hash
Diff payloads are additionally compressed with zstd (level 3 by default).
Document Lifecycle
Documents progress through these statuses:
| Status | Meaning |
|---|---|
canonical | Text extracted, canonical markdown stored in CAS |
classified | Classification rules applied, document class assigned |
error | Parse or extraction failed (corrupt file, invalid format) |
image_only | Scanned PDF with no extractable text |
unsupported | File format has no parser (e.g., .doc, .ppt, .xls legacy formats) |
deduplicated | Removed from the active set by knowledge dedup |
dismissed | Marked unrecoverable by the user |
Overriding the Data Directory
Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:
unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status
Mounted Storage
Mounted source folders are valid ingest inputs. Unsterwerx can scan and normalize documents from NAS, SMB, NFS, sshfs, Google Drive, and similar mounted document stores.
The live Shared Sandbox has stricter requirements. SQLite locking, CAS writes, and atomic config updates need filesystem behavior that many mounts only partly provide. In storage.data_dir_mode = "auto", Unsterwerx runs directly on local filesystems and switches mounted or unknown data directories to mirror mode.
Mirror mode keeps live runtime state in a local directory and treats the requested mounted data directory as the publish target. After successful mutating commands, Unsterwerx publishes a SQLite snapshot and storage artifacts back to that target.
Inspect the active storage plan:
unsterwerx --data-dir /Volumes/Archive/unsterwerx storage status
Retry a publish:
unsterwerx --data-dir /Volumes/Archive/unsterwerx storage publish
Configure an explicit local runtime mirror:
[storage]
data_dir_mode = "mirror"
runtime_dir = "/Users/alex/.unsterwerx/runtime/archive"