Data Storage
All Unsterwerx data is stored locally in the data directory (default ~/.unsterwerx/). No cloud services or external databases are required.
Directory Layout
~/.unsterwerx/
├── unsterwerx.db # SQLite database (WAL mode)
├── canonical/ # CAS markdown files (SHA-256 prefix dirs)
│ ├── 0a/
│ │ └── 0a1b2c3d... # canonical markdown content
│ └── ff/
│ └── ff9e8d7c...
├── diffs/ # CAS diff payloads (zstd compressed)
├── archive/ # Archived original documents
└── templates/ # User Tera templates for reconstruction
SQLite Database
The database uses WAL (Write-Ahead Logging) mode for concurrent read access. Key tables:
| Table | Purpose |
|---|---|
documents | Document registry with hash, status, metadata |
canonical_records | Links documents to CAS markdown content |
canonical_fts | FTS5 full-text search index |
similarity_signatures | MinHash signatures per document |
similarity_candidates | Similar document pairs with Jaccard scores |
diff_records | Diff metadata and CAS references |
classification_rules | Regex-based classification patterns |
document_classifications | Classification results per document |
retention_policies | Retention rules per document class |
source_hierarchy_rules | Trust weight rules by source class |
knowledge_sources | Registered import source adapters |
import_batches | Import batch tracking |
import_items | Individual import item records |
document_provenance | Source linkage for imported documents |
audit_events | Append-only hash-chained audit log |
Content-Addressable Storage (CAS)
Canonical markdown and diff payloads are stored in a CAS filesystem. Files are named by their SHA-256 hash and organized into 256 prefix directories (00/ through ff/). This provides:
- Automatic deduplication: identical content is stored once
- Integrity verification: the filename is the content hash
- Efficient lookups:
O(1)by hash
Diff payloads are additionally compressed with zstd (level 3 by default).
Document Lifecycle
Documents progress through these statuses:
| Status | Meaning |
|---|---|
canonical | Text extracted, canonical markdown stored in CAS |
classified | Classification rules applied, document class assigned |
error | Parse or extraction failed (corrupt file, invalid format) |
image_only | Scanned PDF with no extractable text |
unsupported | File format has no parser (e.g., .doc, .ppt, .xls legacy formats) |
deduplicated | Removed from the active set by knowledge dedup |
dismissed | Marked unrecoverable by the user |
Overriding the Data Directory
Set a custom data directory with --data-dir or the UNSTERWERX_DATA environment variable:
bash
unsterwerx --data-dir /path/to/data status
UNSTERWERX_DATA=/path/to/data unsterwerx status