Configuration Reference

Unsterwerx configuration is stored in TOML format in the data directory. View current config with unsterwerx config show.

[general]

Key	Type	Default	Description
`general.data_dir`	string	`~/.unsterwerx`	Data directory path

Key	Type	Default	Description
`ingest.extensions`	string[]	`["pdf", "docx", "xlsx", "pptx", "doc", "xls", "ppt", "txt", "csv", "rtf"]`	File extensions to process during ingestion
`ingest.max_file_size`	integer	`524288000` (500 MB)	Maximum scan/discovery file size in bytes
`ingest.max_size_file`	integer	`104857600` (100 MB)	Maximum parse-stage file size in bytes for in-memory parsers
`ingest.skip_hidden`	boolean	`true`	Skip hidden files (starting with `.`)
`ingest.follow_symlinks`	boolean	`false`	Follow symbolic links during directory traversal

Key	Type	Default	Description
`similarity.shingle_k`	integer	`3`	Shingle size (number of tokens per shingle)
`similarity.num_hashes`	integer	`128`	Number of MinHash hash functions
`similarity.lsh_bands`	integer	`32`	Number of LSH bands
`similarity.lsh_rows`	integer	`4`	Number of rows per LSH band
`similarity.threshold`	float	`0.3`	Jaccard similarity threshold

Key	Type	Default	Description
`storage.journal_mode`	string	`"wal"`	SQLite journal mode (`wal` recommended)
`storage.busy_timeout_ms`	integer	`5000`	SQLite busy timeout in milliseconds
`storage.zstd_level`	integer	`3`	Zstandard compression level for diff payloads

Key	Type	Default	Description
`knowledge.feature_version`	integer	`1`	Feature version. Bump it to force full recomputation of semantic features
`knowledge.temporal_scale_secs`	float	`86400.0`	Scale for temporal proximity in seconds (86400 = 24 hours)
`knowledge.feedback_weight`	float	`3.0`	Weight multiplier for user feedback labels in Bayesian training
`knowledge.negative_ratio`	float	`2.0`	Maximum negative samples as ratio of positive count
`knowledge.min_bootstrap_confidence`	float	`0.5`	Minimum confidence threshold for bootstrap labels
`knowledge.bootstrap_threshold`	float	`0.7`	Jaccard threshold for bootstrap positive seed labels
`knowledge.dedup_threshold`	float	`0.8`	Default posterior threshold for `knowledge dedup scan/apply`
`knowledge.vectors.threshold`	float	`0.5`	Posterior threshold for clustering documents into knowledge vectors
`knowledge.vectors.min_vector_size`	integer	`2`	Minimum cluster size required to persist a vector
`knowledge.vectors.edge_threshold`	float	`0.3`	Posterior threshold for inter-vector edges
`knowledge.vectors.max_vector_size`	integer	`50`	Warning threshold for unusually large vectors

bash

# Get a value
unsterwerx config get similarity.threshold

0.3

# Set a value
unsterwerx config set similarity.threshold 0.5

# View all settings
unsterwerx config show

The num_hashes must equal lsh_bands * lsh_rows. The defaults (128 = 32 × 4) are balanced for accuracy and performance.
Lower shingle_k values catch more fine-grained similarity but increase false positives. Higher values require more exact text matches.
WAL journal mode is strongly recommended for concurrent read access. Switching to delete mode may cause lock contention.
Zstd compression level 3 provides a good balance of compression ratio and speed. Higher levels (up to 22) compress better but are significantly slower.
Changing any knowledge.* parameter that affects training (all except feature_version) automatically invalidates the model on the next knowledge build, triggering a retrain. The system tracks this via a SHA-256 config hash.
The feedback_weight controls how much human labels influence the model relative to bootstrap labels. Values above 1.0 give user feedback more influence than automated labels.
Increasing negative_ratio produces more negative training samples per positive, which can improve precision at the cost of recall.
The knowledge.dedup_threshold only changes the default BI dedup cutoff. It does not retrain the Bayesian model.