Knowledge Scoring Guide

Knowledge scoring is Unsterwerx's Bayesian Business Intelligence layer. It goes beyond simple Jaccard similarity and computes a multi-feature posterior probability that two documents are semantically related. The system learns from automated signals and human feedback.

Prerequisites

Before running knowledge scoring, you need:

Ingested documents: at least two documents with canonical text
Similarity candidates: run unsterwerx similarity to generate MinHash/LSH candidate pairs
Canonical records: documents must have extracted canonical markdown

Quick Start

bash

# 1. Ingest your documents
unsterwerx ingest ~/documents

# 2. Generate similarity candidates
unsterwerx similarity

# 3. Build knowledge scores
unsterwerx knowledge build --evaluate

How It Works

Feature Engineering

For each similarity candidate pair, six features are computed:

Feature	Source	What It Measures
Jaccard	MinHash/LSH	Token-level overlap (shingle-based)
Cosine	TF-IDF	Semantic similarity weighted by term importance
Title overlap	File metadata	Filename/title similarity
Structural overlap	Canonical records	Structural element similarity (headings, lists, tables)
Temporal proximity	Provenance timestamps	Time-based closeness (configurable scale)
Source weight delta	Import provenance	Difference in source trust weights

Bootstrap Labels

When no user feedback exists, the model bootstraps itself:

Positives: similarity candidates with Jaccard above knowledge.bootstrap_threshold (default: 0.7) are labeled as duplicate_or_same_concept, with confidence proportional to Jaccard score and inverse diff ratio
Negatives: random cross-source document pairs are labeled as unrelated (count = positives × knowledge.negative_ratio)

Naive Bayes Training

Features are discretized into bins and a Laplace-smoothed Naive Bayes classifier is trained:

Each feature is binned independently (e.g., Jaccard bins: 0.0–0.2, 0.2–0.5, 0.5–0.8, 0.8–1.0)
User feedback labels are weighted higher (knowledge.feedback_weight, default: 3.0) than bootstrap labels
The model outputs class priors P(duplicate) and P(unrelated), plus conditional probabilities per (feature, bin) combination

Scoring

For each candidate pair, the model computes:

P(duplicate | features) = P(features | duplicate) × P(duplicate) / P(features)

Using the numerically stable log-space computation to avoid underflow.

Improving Results with Feedback

The model improves with human feedback. Use knowledge labels add to provide ground truth:

bash

# Mark a pair as definitely duplicates
unsterwerx knowledge labels add --label duplicate_or_same_concept \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  f34e4f01-f7a8-44f2-aeae-d02630feb5c9

# Mark a pair as definitely unrelated
unsterwerx knowledge labels add --label unrelated \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  6a4f7b82-ed2b-4c86-95c8-352aa082a17a

User feedback overrides bootstrap labels for the same pair. The model automatically retrains on the next knowledge build when new feedback is detected.

Ad-hoc Scoring

When you label a pair that is not a similarity candidate (e.g., two documents from different clusters), the system automatically scores it using the current model and persists the result:

Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)

This ensures all labeled pairs have scores, even those outside the similarity candidate set.

Model Invalidation

The model tracks what it was trained on and automatically retrains when conditions change:

Trigger	Mechanism	Description
Config change	SHA-256 config hash	Any change to `knowledge.*` training parameters
New labels	Event ID tracking	New label events (bootstrap or feedback) by ID, not timestamp
Feature version	Version number	Bumping `knowledge.feature_version` forces recomputation
New IDF snapshot	IDF ID tracking	Corpus changes that produce a new TF-IDF snapshot

This replaces timestamp-based invalidation, which could miss same-second label writes.

Forcing Retrain

bash

# Force retrain regardless of invalidation state
unsterwerx knowledge build --retrain

Configuration

All knowledge scoring parameters are in the [knowledge] section of config.toml:

Key	Type	Default	Description
`feature_version`	integer	`1`	Bump to force full feature recomputation
`temporal_scale_secs`	float	`86400.0`	Scale for temporal proximity (seconds). 86400 = 1 day
`feedback_weight`	float	`3.0`	Weight multiplier for user feedback labels in training
`negative_ratio`	float	`2.0`	Ratio of negative to positive bootstrap samples
`min_bootstrap_confidence`	float	`0.5`	Minimum confidence for bootstrap labels
`bootstrap_threshold`	float	`0.7`	Jaccard threshold for bootstrap positive labels

Tuning Tips

Low precision (too many false positives): increase bootstrap_threshold to raise the bar for positive labels
Low recall (missing duplicates): decrease bootstrap_threshold or add more user feedback for edge cases
Feedback not affecting scores: check that feedback_weight is > 1.0 (it amplifies feedback labels relative to bootstrap)
Too many negatives: decrease negative_ratio if negative samples overwhelm the positives

Evaluation Metrics

Run knowledge build --evaluate to see:

Evaluation:
  Post-train consistency: 100.0%
  User feedback labels: 2
  Feedback precision: 100.0%
  Feedback recall:    100.0%
  Feedback F1:        100.0%
  Unscored feedback: 1 (pairs not in similarity candidates)

Post-train consistency should be ≥95%. If lower, the model may not be converging properly.
Feedback precision/recall/F1 are the metrics that matter most. They measure model agreement with human labels.
Unscored feedback shows how many labeled pairs were not in the similarity candidate set. These are still used for training but were scored ad hoc, or not at all if no model existed at label time.

Workflow Integration

Knowledge scoring fits into the standard Unsterwerx pipeline:

ingest → similarity → knowledge build → classify → archive
                              ↑
                     knowledge labels add
                       (human feedback)

The knowledge scores can be used alongside classification and retention policies to make more informed archival decisions.