Data Storage

All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.

Directory Layout

~/.unsterwerx/
├── unsterwerx.db          # SQLite database (WAL mode)
├── canonical/             # CAS markdown files (SHA-256 prefix dirs)
│   ├── 0a/
│   │   └── 0a1b2c3d...   # canonical markdown content
│   └── ff/
│       └── ff9e8d7c...
├── diffs/                 # CAS diff payloads (zstd compressed)
├── archive/               # Archived original documents
└── templates/             # User Tera templates for reconstruction

SQLite Database

The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:

Table	Purpose
`documents`	Document registry with hash, status, metadata
`canonical_records`	Links documents to CAS markdown content
`canonical_fts`	FTS5 full-text search index
`similarity_signatures`	MinHash signatures per document
`similarity_candidates`	Similar document pairs with Jaccard scores
`diff_records`	Diff metadata and CAS references
`classification_rules`	Regex-based classification patterns
`document_classifications`	Classification results per document
`retention_policies`	Retention rules per document class
`source_hierarchy_rules`	Trust weight rules by source class
`knowledge_sources`	Registered import source adapters
`import_batches`	Import batch tracking
`import_items`	Individual import item records
`document_provenance`	Source linkage for imported documents
`audit_events`	Append-only hash-chained audit log

Content-Addressable Storage (CAS)

Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:

Automatic deduplication: identical content is stored once
Integrity verification: the filename is the content hash
Efficient lookups: O(1) by hash

Diff payloads are additionally compressed with zstd (level 3 by default).

Document Lifecycle

Documents progress through these statuses:

Status	Meaning
`canonical`	Text extracted, canonical markdown stored in CAS
`classified`	Classification rules applied, document class assigned
`error`	Parse or extraction failed (corrupt file, invalid format)
`image_only`	Scanned PDF with no extractable text
`unsupported`	File format has no parser (e.g., `.doc`, `.ppt`, `.xls` legacy formats)
`deduplicated`	Removed from the active set by knowledge dedup
`dismissed`	Marked unrecoverable by the user

Overriding the Data Directory

Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:

bash

unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status

Mounted Storage

Mounted source folders are valid ingest inputs. Unsterwerx can scan and normalize documents from NAS, SMB, NFS, sshfs, Google Drive, and similar mounted document stores.

The live Shared Sandbox has stricter requirements. SQLite locking, CAS writes, and atomic config updates need filesystem behavior that many mounts only partly provide. In storage.data_dir_mode = "auto", Unsterwerx runs directly on local filesystems and switches mounted or unknown data directories to mirror mode.

Mirror mode keeps live runtime state in a local directory and treats the requested mounted data directory as the publish target. After successful mutating commands, Unsterwerx publishes a SQLite snapshot and storage artifacts back to that target.

Inspect the active storage plan:

bash

unsterwerx --data-dir /Volumes/Archive/unsterwerx storage status

Retry a publish:

bash

unsterwerx --data-dir /Volumes/Archive/unsterwerx storage publish

Configure an explicit local runtime mirror:

toml

[storage]
data_dir_mode = "mirror"
runtime_dir = "/Users/alex/.unsterwerx/runtime/archive"