Configuration Reference
Unsterwerx configuration is stored in TOML format in the data directory. View current config with unsterwerx config show.
[general]
| Key | Type | Default | Description |
|---|
general.data_dir | string | ~/.unsterwerx | Data directory path |
[ingest]
| Key | Type | Default | Description |
|---|
ingest.extensions | string[] | ["pdf", "docx", "xlsx", "pptx", "doc", "xls", "ppt", "txt", "csv", "rtf"] | File extensions to process during ingestion |
ingest.max_file_size | integer | 524288000 (500 MB) | Maximum scan/discovery file size in bytes |
ingest.max_size_file | integer | 104857600 (100 MB) | Maximum parse-stage file size in bytes for in-memory parsers |
ingest.skip_hidden | boolean | true | Skip hidden files (starting with .) |
ingest.follow_symlinks | boolean | false | Follow symbolic links during directory traversal |
[similarity]
| Key | Type | Default | Description |
|---|
similarity.shingle_k | integer | 3 | Shingle size (number of tokens per shingle) |
similarity.num_hashes | integer | 128 | Number of MinHash hash functions |
similarity.lsh_bands | integer | 32 | Number of LSH bands |
similarity.lsh_rows | integer | 4 | Number of rows per LSH band |
similarity.threshold | float | 0.3 | Jaccard similarity threshold |
[storage]
| Key | Type | Default | Description |
|---|
storage.journal_mode | string | "wal" | SQLite journal mode (wal recommended) |
storage.busy_timeout_ms | integer | 5000 | SQLite busy timeout in milliseconds |
storage.zstd_level | integer | 3 | Zstandard compression level for diff payloads |
[knowledge]
| Key | Type | Default | Description |
|---|
knowledge.feature_version | integer | 1 | Feature version. Bump it to force full recomputation of semantic features |
knowledge.temporal_scale_secs | float | 86400.0 | Scale for temporal proximity in seconds (86400 = 24 hours) |
knowledge.feedback_weight | float | 3.0 | Weight multiplier for user feedback labels in Bayesian training |
knowledge.negative_ratio | float | 2.0 | Maximum negative samples as ratio of positive count |
knowledge.min_bootstrap_confidence | float | 0.5 | Minimum confidence threshold for bootstrap labels |
knowledge.bootstrap_threshold | float | 0.7 | Jaccard threshold for bootstrap positive seed labels |
knowledge.dedup_threshold | float | 0.8 | Default posterior threshold for knowledge dedup scan/apply |
knowledge.vectors.threshold | float | 0.5 | Posterior threshold for clustering documents into knowledge vectors |
knowledge.vectors.min_vector_size | integer | 2 | Minimum cluster size required to persist a vector |
knowledge.vectors.edge_threshold | float | 0.3 | Posterior threshold for inter-vector edges |
knowledge.vectors.max_vector_size | integer | 50 | Warning threshold for unusually large vectors |
Setting Values
# Get a value
unsterwerx config get similarity.threshold
0.3
# Set a value
unsterwerx config set similarity.threshold 0.5
# View all settings
unsterwerx config show
Notes
- The
num_hashes must equal lsh_bands * lsh_rows. The defaults (128 = 32 × 4) are balanced for accuracy and performance.
- Lower
shingle_k values catch more fine-grained similarity but increase false positives. Higher values require more exact text matches.
- WAL journal mode is strongly recommended for concurrent read access. Switching to
delete mode may cause lock contention.
- Zstd compression level 3 provides a good balance of compression ratio and speed. Higher levels (up to 22) compress better but are significantly slower.
- Changing any
knowledge.* parameter that affects training (all except feature_version) automatically invalidates the model on the next knowledge build, triggering a retrain. The system tracks this via a SHA-256 config hash.
- The
feedback_weight controls how much human labels influence the model relative to bootstrap labels. Values above 1.0 give user feedback more influence than automated labels.
- Increasing
negative_ratio produces more negative training samples per positive, which can improve precision at the cost of recall.
- The
knowledge.dedup_threshold only changes the default BI dedup cutoff. It does not retrain the Bayesian model.