knowledge
Builds and manages Bayesian knowledge scores for document pairs, clusters related documents into knowledge vectors, and applies Business Intelligence dedup within those vectors. It uses a Naive Bayes model trained on bootstrap labels derived from similarity candidates plus user feedback to compute posterior probabilities for document relatedness inside the Shared Sandbox.
This implements the Business Intelligence layer of the TCA (Trusted Client-Centric Application Architecture), persisting pair scores, vector graph state, and dedup actions in the Universal Data Module while operating on content normalized into the Universal Data Set.
Subcommands
build
Builds semantic features from the Universal Data Set, trains or reuses a Bayesian model, and scores all similarity candidate pairs stored in the Universal Data Module.
unsterwerx knowledge build [OPTIONS]
Options
| Option | Type | Default | Description |
|---|---|---|---|
--retrain | flag | Force model retrain even if inputs are unchanged | |
--evaluate | flag | Print evaluation metrics after scoring | |
--top | integer | 20 | Number of top-scored pairs to display |
Pipeline
- Preflight: verifies that similarity candidates and canonical records exist
- Semantic features: computes TF-IDF corpus statistics and per-pair feature vectors
- Label generation: creates bootstrap labels from similarity candidates
- Training: trains a Laplace-smoothed Naive Bayes model with weighted labels
- Scoring: computes posterior
P(duplicate | features)for each candidate pair - Evaluation: runs an optional consistency check plus user-feedback precision and recall
Automatic Invalidation
The model automatically retrains when any of the following change:
- Config changes: modifications to
knowledge.bootstrap_threshold,knowledge.feedback_weight,knowledge.negative_ratio,knowledge.min_bootstrap_confidence, orknowledge.temporal_scale_secsare detected via config hash comparison - New labels: any new label events since the last training run, tracked by event ID rather than timestamp for deterministic invalidation
- Feature version: bumping
knowledge.feature_versionin config forces full recomputation - New IDF snapshot: corpus changes that produce a new IDF snapshot trigger retraining
Use --retrain to force a rebuild regardless of invalidation state.
Example
unsterwerx knowledge build --evaluate --top 5
Preflight checks...
All prerequisites met.
Building semantic features...
Corpus: 1807 docs, 2939590 unique terms (IDF snapshot #1)
Training Bayesian model...
Bootstrap labels: 318 positive, 636 negative
Model trained: run #4, P(dup)=0.302, P(unrel)=0.698
Scoring candidates...
Timing: Semantic: 1.5s | Scoring: 0.1s | Total: 2.5s
Candidates scored: 371
Evaluation:
Post-train consistency: 100.0%
User feedback labels: 2
Feedback precision: 100.0%
Feedback recall: 100.0%
Feedback F1: 100.0%
Unscored feedback: 1 (pairs not in similarity candidates)
Top 5 pairs by posterior:
Doc A Doc B Posterior Jaccard Cosine
--------------------------------------------------------------------------------------------------------------
0b1c8023-6a4b-49ac-822d-5e2840ff7d38 f34e4f01-f7a8-44f2-aeae-d02630feb5c9 1.000 1.000 0.988
ab2fae57-21e1-44c6-b14f-70d465d951ab d77633f0-1236-454d-9e3e-05d49bf4b4e2 1.000 1.000 1.000
4e425c33-835b-4621-8151-caf7c73d734c 636b1037-ffc6-4324-8c34-41d5a67d5f48 1.000 0.805 0.962
576ef4bb-946b-46f9-81ed-863939069d0e 5b7978db-e46a-460c-85e6-47e867be31f5 1.000 1.000 1.000
002f63f9-66dd-4c47-9317-9770fd7b78bb 7a197541-eec1-4a41-a22c-61b926c09587 1.000 0.914 0.964
labels
Manage training labels for document pairs.
labels add
Add a user feedback label for a document pair.
unsterwerx knowledge labels add --label <LABEL> <DOC_A> <DOC_B>
| Argument/Option | Type | Description |
|---|---|---|
DOC_A | string | First document ID |
DOC_B | string | Second document ID |
--label | string | duplicate_or_same_concept or unrelated |
User feedback labels take precedence over bootstrap labels during training (weighted by knowledge.feedback_weight, default 3.0).
If the labeled pair is not a similarity candidate, it is scored ad-hoc using the current model and the result is persisted:
unsterwerx knowledge labels add --label unrelated \
0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
6a4f7b82-ed2b-4c86-95c8-352aa082a17a
Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)
labels list
List all existing labels (bootstrap and user feedback).
unsterwerx knowledge labels list
Use --label to filter by the classification value (duplicate_or_same_concept, unrelated) and --source or --label-source to filter by how the label was created (user_feedback, bootstrap_near_duplicate, bootstrap_same_concept). The legacy --label-type flag is still accepted as a hidden compatibility alias.
unsterwerx knowledge labels list --label duplicate_or_same_concept
unsterwerx knowledge labels list --source user_feedback
Doc A Doc B Label Source Conf Created
----------------------------------------------------------------------------------------------------
5a188909-533 d888ccdc-f89 unrelated bootstrap_near_duplicate 1.00 2026-03-13 03:11:41
59d26f82-642 d82cd9af-5fd unrelated bootstrap_near_duplicate 1.00 2026-03-13 03:11:41
...
vectors
Build and query the knowledge vector graph in the Universal Data Module.
unsterwerx knowledge vectors <COMMAND>
vectors build
Cluster documents into vectors using Bayesian posterior scores derived from Business Intelligence scoring.
unsterwerx knowledge vectors build [--threshold <FLOAT>] [--min-vector-size <INT>] [--edge-threshold <FLOAT>] [--model-id <ID>] [--dry-run]
Use this after knowledge build to create or refresh the vector graph stored in the Universal Data Module.
vectors list / show / search / traverse
unsterwerx knowledge vectors list [--limit <INT>] [--min-confidence <FLOAT>]
unsterwerx knowledge vectors show <VECTOR_ID_OR_PREFIX>
unsterwerx knowledge vectors search <QUERY> [--limit <INT>]
unsterwerx knowledge vectors traverse <VECTOR_ID_OR_PREFIX> [--depth <INT>]
listshows existing vectors with size and confidence summariesshowprints vector members, their filenames, and connected edgessearchruns FTS over vector member contenttraversewalks the inter-vector graph outward from one vector
dedup
Scan vectors for redundant members and optionally collapse them inside the Shared Sandbox.
dedup scan
unsterwerx knowledge dedup scan [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>]
scan is read-only. It uses vector membership plus Bayesian posterior to choose a kept anchor and propose which lower-priority documents can be removed from the active Universal Data Set.
dedup apply
unsterwerx knowledge dedup apply [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>] [--dry-run] [--confirm]
apply computes rollback diffs in the Universal Data Module when canonical content is available, merges provenance into the kept document, marks removed documents as deduplicated inside the Shared Sandbox, audits each removal, and prints a reminder to rebuild the vector graph.
dedup list
unsterwerx knowledge dedup list [--limit <INT>]
Lists previously applied dedup rules with the kept document, removed documents, and the recorded timestamp. Use this to review past dedup decisions before considering rollback.
dedup show
unsterwerx knowledge dedup show <RULE_ID_OR_PREFIX>
Shows details of a specific dedup rule including the kept document, removed documents, provenance merge results, and rollback diff availability.
dedup rollback
unsterwerx knowledge dedup rollback <RULE_ID_OR_PREFIX>
Reverses a previously applied dedup rule. Restores removed documents from deduplicated status back to their prior state, using stored rollback diffs to reconstruct canonical content. Provenance changes are not reversed; the kept document retains merged provenance.
Understanding the Output
Posterior Score
The posterior probability P(duplicate | features) ranges from 0.0 to 1.0:
- >= 0.9: very likely duplicate or the same concept
- 0.5 to 0.9: possible relationship, worth reviewing
- < 0.5: likely unrelated
Features
Each pair is scored on six semantic features:
| Feature | Description |
|---|---|
jaccard | Jaccard similarity from MinHash/LSH |
cosine | TF-IDF cosine similarity |
title_overlap | Normalized title/filename overlap |
structural_overlap | Structural element similarity (headings, lists, tables) |
temporal_proximity | How close the documents are in time (scaled by temporal_scale_secs) |
source_weight_delta | Difference in provenance source weights |
Evaluation Metrics
- Post-train consistency: sanity check that the model reproduces its training signal (should be at least 95%)
- Feedback precision: of pairs predicted as duplicates, what fraction are actually duplicates based on user feedback
- Feedback recall: of actual duplicates based on user feedback, what fraction were predicted correctly
- Feedback F1: harmonic mean of precision and recall
- Unscored feedback: labels on non-candidate pairs that were trained on but have no scored prediction
Notes
- Run
unsterwerx similarityfirst to generate similarity candidates before building knowledge scores. - The model is automatically retrained when config changes or new labels are added.
- Use
--retrainto force a fresh model if you suspect stale results. - Bootstrap labels are regenerated each run; they seed the model until enough user feedback accumulates.
- User feedback always overrides bootstrap labels for the same document pair.
- The knowledge build is audited. Each run is recorded in the audit log.
- Deduplicated and archived documents are excluded from later scoring, vector rebuilds, and vector search.
- After
knowledge dedup apply, rununsterwerx knowledge vectors buildto refresh vector membership and edges.