Data Storage

All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.

Directory Layout

~/.unsterwerx/
├── unsterwerx.db          # SQLite database (WAL mode)
├── canonical/             # CAS markdown files (SHA-256 prefix dirs)
│   ├── 0a/
│   │   └── 0a1b2c3d...   # canonical markdown content
│   └── ff/
│       └── ff9e8d7c...
├── diffs/                 # CAS diff payloads (zstd compressed)
├── archive/               # Archived original documents
└── templates/             # User Tera templates for reconstruction

SQLite Database

The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:

Table	Purpose
`documents`	Document registry with hash, status, metadata
`canonical_records`	Links documents to CAS markdown content
`canonical_fts`	FTS5 full-text search index
`similarity_signatures`	MinHash signatures per document
`similarity_candidates`	Similar document pairs with Jaccard scores
`diff_records`	Diff metadata and CAS references
`classification_rules`	Regex-based classification patterns
`document_classifications`	Classification results per document
`retention_policies`	Retention rules per document class
`source_hierarchy_rules`	Trust weight rules by source class
`knowledge_sources`	Registered import source adapters
`import_batches`	Import batch tracking
`import_items`	Individual import item records
`document_provenance`	Source linkage for imported documents
`audit_events`	Append-only hash-chained audit log

Content-Addressable Storage (CAS)

Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:

Automatic deduplication: identical content is stored once
Integrity verification: the filename is the content hash
Efficient lookups: O(1) by hash

Diff payloads are additionally compressed with zstd (level 3 by default).

Document Lifecycle

Documents progress through these statuses:

Status	Meaning
`canonical`	Text extracted, canonical markdown stored in CAS
`classified`	Classification rules applied, document class assigned
`error`	Parse or extraction failed (corrupt file, invalid format)
`image_only`	Scanned PDF with no extractable text
`unsupported`	File format has no parser (e.g., `.doc`, `.ppt`, `.xls` legacy formats)
`deduplicated`	Removed from the active set by knowledge dedup
`dismissed`	Marked unrecoverable by the user

Overriding the Data Directory

Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:

bash

unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status