Unsterwerx

knowledge

Builds and manages Bayesian knowledge scores for document pairs, clusters related documents into knowledge vectors, and applies Business Intelligence dedup within those vectors. It uses a Naive Bayes model trained on bootstrap labels derived from similarity candidates plus user feedback to compute posterior probabilities for document relatedness inside the Shared Sandbox.

This implements the Business Intelligence layer of the TCA (Trusted Client-Centric Application Architecture), persisting pair scores, vector graph state, and dedup actions in the Universal Data Module while operating on content normalized into the Universal Data Set.

Subcommands

build

Builds semantic features from the Universal Data Set, trains or reuses a Bayesian model, and scores all similarity candidate pairs stored in the Universal Data Module.

bash
unsterwerx knowledge build [OPTIONS]

Options

OptionTypeDefaultDescription
--retrainflagForce model retrain even if inputs are unchanged
--evaluateflagPrint evaluation metrics after scoring
--topinteger20Number of top-scored pairs to display

Pipeline

  1. Preflight: verifies that similarity candidates and canonical records exist
  2. Semantic features: computes TF-IDF corpus statistics and per-pair feature vectors
  3. Label generation: creates bootstrap labels from similarity candidates
  4. Training: trains a Laplace-smoothed Naive Bayes model with weighted labels
  5. Scoring: computes posterior P(duplicate | features) for each candidate pair
  6. Evaluation: runs an optional consistency check plus user-feedback precision and recall

Automatic Invalidation

The model automatically retrains when any of the following change:

Use --retrain to force a rebuild regardless of invalidation state.

Example

bash
unsterwerx knowledge build --evaluate --top 5
Preflight checks...
  All prerequisites met.

Building semantic features...
  Corpus: 1807 docs, 2939590 unique terms (IDF snapshot #1)

Training Bayesian model...
  Bootstrap labels: 318 positive, 636 negative
  Model trained: run #4, P(dup)=0.302, P(unrel)=0.698

Scoring candidates...

Timing: Semantic: 1.5s | Scoring: 0.1s | Total: 2.5s
Candidates scored: 371

Evaluation:
  Post-train consistency: 100.0%
  User feedback labels: 2
  Feedback precision: 100.0%
  Feedback recall:    100.0%
  Feedback F1:        100.0%
  Unscored feedback: 1 (pairs not in similarity candidates)

Top 5 pairs by posterior:
  Doc A                                  Doc B                                   Posterior    Jaccard     Cosine
  --------------------------------------------------------------------------------------------------------------
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38   f34e4f01-f7a8-44f2-aeae-d02630feb5c9       1.000      1.000      0.988
  ab2fae57-21e1-44c6-b14f-70d465d951ab   d77633f0-1236-454d-9e3e-05d49bf4b4e2       1.000      1.000      1.000
  4e425c33-835b-4621-8151-caf7c73d734c   636b1037-ffc6-4324-8c34-41d5a67d5f48       1.000      0.805      0.962
  576ef4bb-946b-46f9-81ed-863939069d0e   5b7978db-e46a-460c-85e6-47e867be31f5       1.000      1.000      1.000
  002f63f9-66dd-4c47-9317-9770fd7b78bb   7a197541-eec1-4a41-a22c-61b926c09587       1.000      0.914      0.964

labels

Manage training labels for document pairs.

labels add

Add a user feedback label for a document pair.

bash
unsterwerx knowledge labels add --label <LABEL> <DOC_A> <DOC_B>
Argument/OptionTypeDescription
DOC_AstringFirst document ID
DOC_BstringSecond document ID
--labelstringduplicate_or_same_concept or unrelated

User feedback labels take precedence over bootstrap labels during training (weighted by knowledge.feedback_weight, default 3.0).

If the labeled pair is not a similarity candidate, it is scored ad-hoc using the current model and the result is persisted:

bash
unsterwerx knowledge labels add --label unrelated \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  6a4f7b82-ed2b-4c86-95c8-352aa082a17a
Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)

labels list

List all existing labels (bootstrap and user feedback).

bash
unsterwerx knowledge labels list

Use --label to filter by the classification value (duplicate_or_same_concept, unrelated) and --source or --label-source to filter by how the label was created (user_feedback, bootstrap_near_duplicate, bootstrap_same_concept). The legacy --label-type flag is still accepted as a hidden compatibility alias.

bash
unsterwerx knowledge labels list --label duplicate_or_same_concept
unsterwerx knowledge labels list --source user_feedback
Doc A          Doc B          Label                        Source                       Conf Created
----------------------------------------------------------------------------------------------------
5a188909-533   d888ccdc-f89   unrelated                    bootstrap_near_duplicate    1.00 2026-03-13 03:11:41
59d26f82-642   d82cd9af-5fd   unrelated                    bootstrap_near_duplicate    1.00 2026-03-13 03:11:41
...

vectors

Build and query the knowledge vector graph in the Universal Data Module.

bash
unsterwerx knowledge vectors <COMMAND>

vectors build

Cluster documents into vectors using Bayesian posterior scores derived from Business Intelligence scoring.

bash
unsterwerx knowledge vectors build [--threshold <FLOAT>] [--min-vector-size <INT>] [--edge-threshold <FLOAT>] [--model-id <ID>] [--dry-run]

Use this after knowledge build to create or refresh the vector graph stored in the Universal Data Module.

vectors list / show / search / traverse

bash
unsterwerx knowledge vectors list [--limit <INT>] [--min-confidence <FLOAT>]
unsterwerx knowledge vectors show <VECTOR_ID_OR_PREFIX>
unsterwerx knowledge vectors search <QUERY> [--limit <INT>]
unsterwerx knowledge vectors traverse <VECTOR_ID_OR_PREFIX> [--depth <INT>]

dedup

Scan vectors for redundant members and optionally collapse them inside the Shared Sandbox.

dedup scan

bash
unsterwerx knowledge dedup scan [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>]

scan is read-only. It uses vector membership plus Bayesian posterior to choose a kept anchor and propose which lower-priority documents can be removed from the active Universal Data Set.

dedup apply

bash
unsterwerx knowledge dedup apply [--threshold <FLOAT>] [--vector <ID_OR_PREFIX>] [--keep <DOC_ID>] [--model-id <ID>] [--dry-run] [--confirm]

apply computes rollback diffs in the Universal Data Module when canonical content is available, merges provenance into the kept document, marks removed documents as deduplicated inside the Shared Sandbox, audits each removal, and prints a reminder to rebuild the vector graph.

dedup list

bash
unsterwerx knowledge dedup list [--limit <INT>]

Lists previously applied dedup rules with the kept document, removed documents, and the recorded timestamp. Use this to review past dedup decisions before considering rollback.

dedup show

bash
unsterwerx knowledge dedup show <RULE_ID_OR_PREFIX>

Shows details of a specific dedup rule including the kept document, removed documents, provenance merge results, and rollback diff availability.

dedup rollback

bash
unsterwerx knowledge dedup rollback <RULE_ID_OR_PREFIX>

Reverses a previously applied dedup rule. Restores removed documents from deduplicated status back to their prior state, using stored rollback diffs to reconstruct canonical content. Provenance changes are not reversed; the kept document retains merged provenance.

Understanding the Output

Posterior Score

The posterior probability P(duplicate | features) ranges from 0.0 to 1.0:

Features

Each pair is scored on six semantic features:

FeatureDescription
jaccardJaccard similarity from MinHash/LSH
cosineTF-IDF cosine similarity
title_overlapNormalized title/filename overlap
structural_overlapStructural element similarity (headings, lists, tables)
temporal_proximityHow close the documents are in time (scaled by temporal_scale_secs)
source_weight_deltaDifference in provenance source weights

Evaluation Metrics

Notes