Unsterwerx

Knowledge Scoring Guide

Knowledge scoring is Unsterwerx's Bayesian Business Intelligence layer. It goes beyond simple Jaccard similarity and computes a multi-feature posterior probability that two documents are semantically related. The system learns from automated signals and human feedback.

Prerequisites

Before running knowledge scoring, you need:

  1. Ingested documents: at least two documents with canonical text
  2. Similarity candidates: run unsterwerx similarity to generate MinHash/LSH candidate pairs
  3. Canonical records: documents must have extracted canonical markdown

Quick Start

bash
# 1. Ingest your documents
unsterwerx ingest ~/documents

# 2. Generate similarity candidates
unsterwerx similarity

# 3. Build knowledge scores
unsterwerx knowledge build --evaluate

How It Works

Feature Engineering

For each similarity candidate pair, six features are computed:

FeatureSourceWhat It Measures
JaccardMinHash/LSHToken-level overlap (shingle-based)
CosineTF-IDFSemantic similarity weighted by term importance
Title overlapFile metadataFilename/title similarity
Structural overlapCanonical recordsStructural element similarity (headings, lists, tables)
Temporal proximityProvenance timestampsTime-based closeness (configurable scale)
Source weight deltaImport provenanceDifference in source trust weights

Bootstrap Labels

When no user feedback exists, the model bootstraps itself:

Naive Bayes Training

Features are discretized into bins and a Laplace-smoothed Naive Bayes classifier is trained:

Scoring

For each candidate pair, the model computes:

P(duplicate | features) = P(features | duplicate) × P(duplicate) / P(features)

Using the numerically stable log-space computation to avoid underflow.

Improving Results with Feedback

The model improves with human feedback. Use knowledge labels add to provide ground truth:

bash
# Mark a pair as definitely duplicates
unsterwerx knowledge labels add --label duplicate_or_same_concept \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  f34e4f01-f7a8-44f2-aeae-d02630feb5c9

# Mark a pair as definitely unrelated
unsterwerx knowledge labels add --label unrelated \
  0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
  6a4f7b82-ed2b-4c86-95c8-352aa082a17a

User feedback overrides bootstrap labels for the same pair. The model automatically retrains on the next knowledge build when new feedback is detected.

Ad-hoc Scoring

When you label a pair that is not a similarity candidate (e.g., two documents from different clusters), the system automatically scores it using the current model and persists the result:

Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)

This ensures all labeled pairs have scores, even those outside the similarity candidate set.

Model Invalidation

The model tracks what it was trained on and automatically retrains when conditions change:

TriggerMechanismDescription
Config changeSHA-256 config hashAny change to knowledge.* training parameters
New labelsEvent ID trackingNew label events (bootstrap or feedback) by ID, not timestamp
Feature versionVersion numberBumping knowledge.feature_version forces recomputation
New IDF snapshotIDF ID trackingCorpus changes that produce a new TF-IDF snapshot

This replaces timestamp-based invalidation, which could miss same-second label writes.

Forcing Retrain

bash
# Force retrain regardless of invalidation state
unsterwerx knowledge build --retrain

Configuration

All knowledge scoring parameters are in the [knowledge] section of config.toml:

KeyTypeDefaultDescription
feature_versioninteger1Bump to force full feature recomputation
temporal_scale_secsfloat86400.0Scale for temporal proximity (seconds). 86400 = 1 day
feedback_weightfloat3.0Weight multiplier for user feedback labels in training
negative_ratiofloat2.0Ratio of negative to positive bootstrap samples
min_bootstrap_confidencefloat0.5Minimum confidence for bootstrap labels
bootstrap_thresholdfloat0.7Jaccard threshold for bootstrap positive labels

Tuning Tips

Evaluation Metrics

Run knowledge build --evaluate to see:

Evaluation:
  Post-train consistency: 100.0%
  User feedback labels: 2
  Feedback precision: 100.0%
  Feedback recall:    100.0%
  Feedback F1:        100.0%
  Unscored feedback: 1 (pairs not in similarity candidates)

Workflow Integration

Knowledge scoring fits into the standard Unsterwerx pipeline:

ingest → similarity → knowledge build → classify → archive
                              ↑
                     knowledge labels add
                       (human feedback)

The knowledge scores can be used alongside classification and retention policies to make more informed archival decisions.