knowledge

Builds and manages Bayesian knowledge scores for document pairs, clusters related documents into knowledge vectors, and applies Business Intelligence dedup within those vectors. It uses a Naive Bayes model trained on bootstrap labels derived from similarity candidates plus user feedback to compute posterior probabilities for document relatedness inside the Shared Sandbox.

This implements the Business Intelligence layer of the TCA (Trusted Client-Centric Application Architecture), persisting pair scores, vector graph state, and dedup actions in the Universal Data Module while operating on content normalized into the Universal Data Set.

Subcommands

build

Builds semantic features from the Universal Data Set, trains or reuses a Bayesian model, and scores all similarity candidate pairs stored in the Universal Data Module.

bash

unsterwerx knowledge build [OPTIONS]

Options

Option	Type	Default	Description
`--retrain`	flag		Force model retrain even if inputs are unchanged
`--evaluate`	flag		Print evaluation metrics after scoring
`--top`	integer	20	Number of top-scored pairs to display

Pipeline

Preflight: verifies that similarity candidates and canonical records exist
Semantic features: computes TF-IDF corpus statistics and per-pair feature vectors
Label generation: creates bootstrap labels from similarity candidates
Training: trains a Laplace-smoothed Naive Bayes model with weighted labels
Scoring: computes posterior P(duplicate | features) for each candidate pair
Evaluation: runs an optional consistency check plus user-feedback precision and recall

Automatic Invalidation

The model automatically retrains when any of the following change:

Config changes: modifications to knowledge.bootstrap_threshold, knowledge.feedback_weight, knowledge.negative_ratio, knowledge.min_bootstrap_confidence, or knowledge.temporal_scale_secs are detected via config hash comparison
New labels: any new label events since the last training run, tracked by event ID rather than timestamp for deterministic invalidation
Feature version: bumping knowledge.feature_version in config forces full recomputation
New IDF snapshot: corpus changes that produce a new IDF snapshot trigger retraining

Use --retrain to force a rebuild regardless of invalidation state.

Example

bash

unsterwerx knowledge build --evaluate --top 5

Preflight checks...
  All prerequisites met.

Building semantic features...
  Corpus: 1807 docs, 2939590 unique terms (IDF snapshot #1)

Training Bayesian model...
  Bootstrap labels: 318 positive, 636 negative
  Model trained: run #4, P(dup)=0.302, P(unrel)=0.698

Scoring candidates...

Timing: Semantic: 1.5s | Scoring: 0.1s | Total: 2.5s
Candidates scored: 371

Evaluation:
  Post-train consistency: 100.0%
  User feedback labels: 2
  Feedback precision: 100.0%
  Feedback recall:    100.0%
  Feedback F1:        100.0%
  Unscored feedback: 1 (pairs not in similarity candidates)

Top 5 pairs by posterior:
  Doc A                                  Doc B                                   Posterior    Jaccard     Cosine
  --------------------------------------------------------------------------------------------------------------
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38   f34e4f01-f7a8-44f2-aeae-d02630feb5c9       1.000      1.000      0.988
  ab2fae57-21e1-44c6-b14f-70d465d951ab   d77633f0-1236-454d-9e3e-05d49bf4b4e2       1.000      1.000      1.000
  4e425c33-835b-4621-8151-caf7c73d734c   636b1037-ffc6-4324-8c34-41d5a67d5f48       1.000      0.805      0.962
  576ef4bb-946b-46f9-81ed-863939069d0e   5b7978db-e46a-460c-85e6-47e867be31f5       1.000      1.000      1.000
  002f63f9-66dd-4c47-9317-9770fd7b78bb   7a197541-eec1-4a41-a22c-61b926c09587       1.000      0.914      0.964

labels

Manage training labels for document pairs.

labels add

Add a user feedback label for a document pair.

bash

unsterwerx knowledge labels add --label <LABEL> <DOC_A> <DOC_B>

Argument/Option	Type	Description
`DOC_A`	string	First document ID
`DOC_B`	string	Second document ID
`--label`	string	`duplicate_or_same_concept` or `unrelated`

User feedback labels take precedence over bootstrap labels during training (weighted by knowledge.feedback_weight, default 3.0).

If the labeled pair is not a similarity candidate, it is scored ad-hoc using the current model and the result is persisted:

bash

unsterwerx knowledge labels add --label unrelated \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  6a4f7b82-ed2b-4c86-95c8-352aa082a17a

Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)

labels list

List all existing labels (bootstrap and user feedback).

bash

unsterwerx knowledge labels list

Use --label to filter by the classification value (duplicate_or_same_concept, unrelated) and --source or --label-source to filter by how the label was created (user_feedback, bootstrap_near_duplicate, bootstrap_same_concept). The legacy --label-type flag is still accepted as a hidden compatibility alias.

bash

unsterwerx knowledge labels list --label duplicate_or_same_concept
unsterwerx knowledge labels list --source user_feedback

Doc A          Doc B          Label                        Source                       Conf Created
----------------------------------------------------------------------------------------------------
5a188909-533   d888ccdc-f89   unrelated                    bootstrap_near_duplicate    1.00 2026-03-13 03:11:41
59d26f82-642   d82cd9af-5fd   unrelated                    bootstrap_near_duplicate    1.00 2026-03-13 03:11:41
...

vectors

Build and query the knowledge vector graph in the Universal Data Module.

bash

unsterwerx knowledge vectors <COMMAND>

vectors build

Cluster documents into vectors using Bayesian posterior scores derived from Business Intelligence scoring.

bash

unsterwerx knowledge vectors build [--threshold <FLOAT>] [--min-vector-size <INT>] [--edge-threshold <FLOAT>] [--model-id <ID>] [--dry-run]

Use this after knowledge build to create or refresh the vector graph stored in the Universal Data Module.

vectors list / show / search / traverse

bash

unsterwerx knowledge vectors list [--limit <INT>] [--min-confidence <FLOAT>]
unsterwerx knowledge vectors show <VECTOR_ID_OR_PREFIX>
unsterwerx knowledge vectors search <QUERY> [--limit <INT>]
unsterwerx knowledge vectors traverse <VECTOR_ID_OR_PREFIX> [--depth <INT>]

list shows existing vectors with size and confidence summaries
show prints vector members, their filenames, and connected edges
search runs FTS over vector member content
traverse walks the inter-vector graph outward from one vector

dedup

Scan vectors for redundant members and optionally collapse them inside the Shared Sandbox.

dedup scan

bash

unsterwerx knowledge dedup scan [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>]

scan is read-only. It uses vector membership plus Bayesian posterior to choose a kept anchor and propose which lower-priority documents can be removed from the active Universal Data Set.

dedup apply

bash

unsterwerx knowledge dedup apply [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>] [--dry-run] [--confirm]

apply computes rollback diffs in the Universal Data Module when canonical content is available, merges provenance into the kept document, marks removed documents as deduplicated inside the Shared Sandbox, audits each removal, and prints a reminder to rebuild the vector graph.

dedup list

bash

unsterwerx knowledge dedup list [--limit <INT>]

Lists previously applied dedup rules with the kept document, removed documents, and the recorded timestamp. Use this to review past dedup decisions before considering rollback.

dedup show

bash

unsterwerx knowledge dedup show <RULE_ID_OR_PREFIX>

Shows details of a specific dedup rule including the kept document, removed documents, provenance merge results, and rollback diff availability.

dedup rollback

bash

unsterwerx knowledge dedup rollback <RULE_ID_OR_PREFIX>

Reverses a previously applied dedup rule. Restores removed documents from deduplicated status back to their prior state, using stored rollback diffs to reconstruct canonical content. Provenance changes are not reversed; the kept document retains merged provenance.

Understanding the Output

Posterior Score

The posterior probability P(duplicate | features) ranges from 0.0 to 1.0:

>= 0.9: very likely duplicate or the same concept
0.5 to 0.9: possible relationship, worth reviewing
< 0.5: likely unrelated

Features

Each pair is scored on six semantic features:

Feature	Description
`jaccard`	Jaccard similarity from MinHash/LSH
`cosine`	TF-IDF cosine similarity
`title_overlap`	Normalized title/filename overlap
`structural_overlap`	Structural element similarity (headings, lists, tables)
`temporal_proximity`	How close the documents are in time (scaled by `temporal_scale_secs`)
`source_weight_delta`	Difference in provenance source weights

Evaluation Metrics

Post-train consistency: sanity check that the model reproduces its training signal (should be at least 95%)
Feedback precision: of pairs predicted as duplicates, what fraction are actually duplicates based on user feedback
Feedback recall: of actual duplicates based on user feedback, what fraction were predicted correctly
Feedback F1: harmonic mean of precision and recall
Unscored feedback: labels on non-candidate pairs that were trained on but have no scored prediction

Notes

Run unsterwerx similarity first to generate similarity candidates before building knowledge scores.
The model is automatically retrained when config changes or new labels are added.
Use --retrain to force a fresh model if you suspect stale results.
Bootstrap labels are regenerated each run; they seed the model until enough user feedback accumulates.
User feedback always overrides bootstrap labels for the same document pair.
The knowledge build is audited. Each run is recorded in the audit log.
Deduplicated and archived documents are excluded from later scoring, vector rebuilds, and vector search.
After knowledge dedup apply, run unsterwerx knowledge vectors build to refresh vector membership and edges.