Knowledge Scoring Guide
Knowledge scoring is Unsterwerx's Bayesian Business Intelligence layer. It goes beyond simple Jaccard similarity and computes a multi-feature posterior probability that two documents are semantically related. The system learns from automated signals and human feedback.
Prerequisites
Before running knowledge scoring, you need:
- Ingested documents: at least two documents with canonical text
- Similarity candidates: run
unsterwerx similarityto generate MinHash/LSH candidate pairs - Canonical records: documents must have extracted canonical markdown
Quick Start
# 1. Ingest your documents
unsterwerx ingest ~/documents
# 2. Generate similarity candidates
unsterwerx similarity
# 3. Build knowledge scores
unsterwerx knowledge build --evaluate
How It Works
Feature Engineering
For each similarity candidate pair, six features are computed:
| Feature | Source | What It Measures |
|---|---|---|
| Jaccard | MinHash/LSH | Token-level overlap (shingle-based) |
| Cosine | TF-IDF | Semantic similarity weighted by term importance |
| Title overlap | File metadata | Filename/title similarity |
| Structural overlap | Canonical records | Structural element similarity (headings, lists, tables) |
| Temporal proximity | Provenance timestamps | Time-based closeness (configurable scale) |
| Source weight delta | Import provenance | Difference in source trust weights |
Bootstrap Labels
When no user feedback exists, the model bootstraps itself:
- Positives: similarity candidates with Jaccard above
knowledge.bootstrap_threshold(default: 0.7) are labeled asduplicate_or_same_concept, with confidence proportional to Jaccard score and inverse diff ratio - Negatives: random cross-source document pairs are labeled as
unrelated(count = positives ×knowledge.negative_ratio)
Naive Bayes Training
Features are discretized into bins and a Laplace-smoothed Naive Bayes classifier is trained:
- Each feature is binned independently (e.g., Jaccard bins: 0.0–0.2, 0.2–0.5, 0.5–0.8, 0.8–1.0)
- User feedback labels are weighted higher (
knowledge.feedback_weight, default: 3.0) than bootstrap labels - The model outputs class priors P(duplicate) and P(unrelated), plus conditional probabilities per (feature, bin) combination
Scoring
For each candidate pair, the model computes:
P(duplicate | features) = P(features | duplicate) × P(duplicate) / P(features)
Using the numerically stable log-space computation to avoid underflow.
Improving Results with Feedback
The model improves with human feedback. Use knowledge labels add to provide ground truth:
# Mark a pair as definitely duplicates
unsterwerx knowledge labels add --label duplicate_or_same_concept \
0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
f34e4f01-f7a8-44f2-aeae-d02630feb5c9
# Mark a pair as definitely unrelated
unsterwerx knowledge labels add --label unrelated \
0b1c8023-6a4b-49ac-822d-5e2840ff7d38 \
6a4f7b82-ed2b-4c86-95c8-352aa082a17a
User feedback overrides bootstrap labels for the same pair. The model automatically retrains on the next knowledge build when new feedback is detected.
Ad-hoc Scoring
When you label a pair that is not a similarity candidate (e.g., two documents from different clusters), the system automatically scores it using the current model and persists the result:
Label added: 0b1c8023-6a4 / 6a4f7b82-ed2 → unrelated
Note: pair is not a similarity candidate; scored ad hoc (posterior=0.000)
This ensures all labeled pairs have scores, even those outside the similarity candidate set.
Model Invalidation
The model tracks what it was trained on and automatically retrains when conditions change:
| Trigger | Mechanism | Description |
|---|---|---|
| Config change | SHA-256 config hash | Any change to knowledge.* training parameters |
| New labels | Event ID tracking | New label events (bootstrap or feedback) by ID, not timestamp |
| Feature version | Version number | Bumping knowledge.feature_version forces recomputation |
| New IDF snapshot | IDF ID tracking | Corpus changes that produce a new TF-IDF snapshot |
This replaces timestamp-based invalidation, which could miss same-second label writes.
Forcing Retrain
# Force retrain regardless of invalidation state
unsterwerx knowledge build --retrain
Configuration
All knowledge scoring parameters are in the [knowledge] section of config.toml:
| Key | Type | Default | Description |
|---|---|---|---|
feature_version | integer | 1 | Bump to force full feature recomputation |
temporal_scale_secs | float | 86400.0 | Scale for temporal proximity (seconds). 86400 = 1 day |
feedback_weight | float | 3.0 | Weight multiplier for user feedback labels in training |
negative_ratio | float | 2.0 | Ratio of negative to positive bootstrap samples |
min_bootstrap_confidence | float | 0.5 | Minimum confidence for bootstrap labels |
bootstrap_threshold | float | 0.7 | Jaccard threshold for bootstrap positive labels |
Tuning Tips
- Low precision (too many false positives): increase
bootstrap_thresholdto raise the bar for positive labels - Low recall (missing duplicates): decrease
bootstrap_thresholdor add more user feedback for edge cases - Feedback not affecting scores: check that
feedback_weightis > 1.0 (it amplifies feedback labels relative to bootstrap) - Too many negatives: decrease
negative_ratioif negative samples overwhelm the positives
Evaluation Metrics
Run knowledge build --evaluate to see:
Evaluation:
Post-train consistency: 100.0%
User feedback labels: 2
Feedback precision: 100.0%
Feedback recall: 100.0%
Feedback F1: 100.0%
Unscored feedback: 1 (pairs not in similarity candidates)
- Post-train consistency should be ≥95%. If lower, the model may not be converging properly.
- Feedback precision/recall/F1 are the metrics that matter most. They measure model agreement with human labels.
- Unscored feedback shows how many labeled pairs were not in the similarity candidate set. These are still used for training but were scored ad hoc, or not at all if no model existed at label time.
Workflow Integration
Knowledge scoring fits into the standard Unsterwerx pipeline:
ingest → similarity → knowledge build → classify → archive
↑
knowledge labels add
(human feedback)
The knowledge scores can be used alongside classification and retention policies to make more informed archival decisions.