Changelog

All notable changes to Unsterwerx are documented here. This project adheres to Semantic Versioning.

Unreleased

0.4.9 - 2026-03-31

Added

Content-based file-type resolution for ingest and import using magic-byte sniffing, first-chunk reuse from streaming SHA-256 hashing, PDF buffer cleanup, and optional pdftotext fallback for recoverable parser failures.
Managed ingest/import job execution with foreground and background run records, hidden worker replay, resumable runs, persistent diagnostics, jobs control/status commands, deterministic log files, and shared JSON envelopes across the operator surface.

Changed

Retry and duplicate-heal paths now re-sniff file-backed documents from content, persist corrected file_type values, and retry previously unsupported documents when they now resolve to a parseable format.
Operator output for ingest, import run, and jobs * now uses a shared JSON envelope and stable status counters while keeping human-readable summaries aligned with the same DTOs.
Canonical worker concurrency now honors persisted ingest.worker_threads settings for background and resumed runs.

Fixed

Mislabeled plain-text, HTML, offset-header PDF, and junk-padded PDF files now route through lightweight recovery before being rejected.
Text magic-byte detection now preserves UTF-8 recognition when the 1024-byte sniff window ends mid-codepoint, avoiding fallback to stale extension-based dispatch.
jobs stop now interrupts ingest --retry-errors work instead of waiting for the full retry batch to finish.
config set now rejects invalid nullable TOML types up front via round-trip validation.

0.4.8 - 2026-03-30

Added

Root README.md with build/install guidance, first-use commands, documentation authority chain, and patent-mapping pointer.
Docs-site patent mapping page with navigation entries linking the patent relationship document from the concepts section.
Regression coverage for knowledge build no-labels path in both human-readable and --json modes.

Changed

Patent-positioning copy in the vision doc and docs-site landing/concepts pages now frames Unsterwerx as a document-domain implementation of TCA concepts, with broader cross-application coverage described as roadmap scope.
Documentation authority chain now calls out tmp/ as gitignored local material rather than part of the authoritative docs set.
Internal module layout split large SQL parsing, knowledge CLI, Bayes model, dedup, and import staging files into domain-aligned subdirectories while preserving public import paths.
Docs-site prose refined for lower AI-tell scores.

Fixed

knowledge build with no resolved training labels now prints candidate counts and valid next-step commands instead of a terse threshold error.
Dedup apply rejects stale plans after vector rebuilds, keeps per-document database mutations atomic, and returns non-zero from apply or rollback on partial errors.
Bayesian training and retrain detection now use a consistent active-label cutoff, preventing missed concurrent labels and false retrains.
Import staging reuses precomputed content hashes during idempotency and file registration checks, avoiding redundant hashing.

0.4.7 - 2026-03-27

Added

unsterwerx import backfill to repair missing provenance rows without manual SQL.
First-class Markdown ingest support with structured parsing across headings, fenced code blocks, frontmatter, and tabular content.
A command-matrix acceptance test that exercises the full CLI surface against a bundled demo dataset.
Pipeline run provenance across similarity, knowledge scoring, vector builds, and dedup scans.

Changed

knowledge labels list now uses explicit --label and --source filters, with --label-source exposed as a visible alias and --label-type kept as a hidden compatibility alias.

Fixed

Bayesian bootstrap label insertion no longer crashes knowledge build with a SQL parameter mismatch.
Knowledge preflight now points to a real provenance repair command, and parse-failed imports keep their provenance so one bad file does not block the rest of the corpus.
archive --dry-run, status --help, status errors, structural heading diffs, benchmark JSON parity, and operator --json output now stay consistent in release workflows.
Local upgrade verification retries transient ETXTBSY failures, and the knowledge pipeline warns when downstream stages consume stale similarity lineage.

0.4.6 - 2026-03-26

Added

Offline upgrade support with unsterwerx upgrade --from-file, optional checksum verification, adjacent SHA256SUMS discovery, and deferred audit replay.
Published release checksums and installer-side integrity verification for tarball installs.

Changed

Upgrade endpoint resolution now honors UNSTERWERX_UPGRADE_URL across the CLI, installer, and release script.

Fixed

Local upgrade verification now rejects extracted binaries that fail their --version probe and treats invalid default config loading as best-effort during upgrade.
unsterwerx upgrade --check now reports concrete network failures and exits with status 2 when the update server is unreachable.

0.4.5 - 2026-03-25

Added

SQL file ingest, parsing, canonical markdown, full-text indexing, and classification support.
A production panic-path regression guard for src/**/*.rs and runtime validation for invalid similarity configurations.

Changed

File-backed canonical staging now preserves parser-derived titles and word counts, and the docs site was refreshed with clearer prose and current examples.

Fixed

Panic-prone knowledge, import, rules, similarity, benchmark, and CLI status paths now fail safely.
MinHash decoding, empty vector representative election, source-document prefix resolution, ingest retry recovery, and inactive-document filtering now hold up across the pipeline.

0.4.4 - 2026-03-24

Added

Retention anchor and signing timestamp details in unsterwerx status --document output for direct verification of document lifecycle metadata.

Changed

Skipped re-import self-healing now reuses a single document_id lookup and a shared SystemTime to RFC 3339 conversion path.

Fixed

Local filesystem imports now persist filesystem creation and modification timestamps into provenance, preventing retention anchor fallback to ingest time.
Unchanged re-imports now refresh provenance timestamps and filesystem audit metadata when source timestamps change without a content change.

0.4.3 - 2026-03-22

Added

Scoped governance preview and control commands: unsterwerx rules resolve for cascaded policy inspection and unsterwerx rules assign-scope for compare-and-set document scope assignment.
Universal parse-stage size guard (--max-size-file) applies to all formats, not just PDF.
Element::Table::continuation flag for chunked table emission with stable canonical markdown output.

Changed

XLSX NAC now uses calamine's streaming worksheet_cells_reader() + next_cell() instead of worksheet_range(), avoiding eager materialization of entire sheets into memory.
DOCX/PPTX NACs stream XML directly from ZIP entries via Reader::from_reader(BufReader) instead of loading full XML strings into memory.
DOCX/XLSX table emission flushes rows in bounded chunks (default 10,000), keeping memory proportional to chunk size rather than total row count.
TXT/CSV parsing uses BufReader instead of fs::read_to_string, preserving identical output.
Format-specific parser functions changed to pub(crate) visibility so the universal size guard cannot be bypassed by direct calls.

Fixed

Parser entrypoints keep format-specific parsers crate-internal so external callers cannot bypass parse-stage size limits.
DOCX table chunking no longer emits a spurious empty continuation chunk when row count is an exact multiple of chunk_rows.
DOCX validates chunk_rows >= 1, matching XLSX behavior.
Classification now honors scoped rules in both batch and single-document runs, preventing division and user rules from leaking into unrelated documents.
Scoped governance validation rejects global rules and policies that incorrectly include a scope_id, enforced in the rule persistence layer.
Import scope enforcement handles unchanged reimports, cross-source content-hash duplicates, and dry-run previews so scope conflicts are detected consistently.
Scoped policy resolution, legal-hold checks, and CLI previews resolve only the applicable scope chain and show the cascaded effective result.

0.4.2 - 2026-03-20

Added

Classification-rule lifecycle controls: unsterwerx rules reactivate and unsterwerx rules remove --purge, plus dedicated audit actions for retire/reactivate events.

Fixed

Reconstruct template loading now surfaces template directory access and template parse errors instead of silently falling back, and tightens default report fallback to require an exact report.md match.
Classification rule retirement/deletion avoids FK violations, cleans up only rule-affected classification state, and allows classify to rebuild results for already-classified documents after rule lifecycle changes.
unsterwerx similarity honors persisted similarity.* config values when flags are omitted, restoring the intended precedence of CLI flag > config file > built-in default.
Stale similarity state across reruns and lifecycle transitions is resolved by replacing candidate sets atomically, cleaning similarity rows when documents are deduplicated or archived, and rebuilding legacy similarity tables with ON DELETE CASCADE.
Stale knowledge build pair scores when the latest Bayesian model is reused are resolved by clearing and rewriting knowledge_scores for the active model in one transaction.
Signed PDF ingest/import now persists normalized signature metadata during staging, preserves original PDFs for reconstruction, self-heals pre-patch rows on re-import, and avoids duplicate signature audit events on canonical retries.

0.4.1 - 2026-03-16

Added

Shared document ID prefix resolution with filename fallback across CLI workflows, covering reconstruct, diff, status, and single-document classification paths.
Review and rollback ergonomics for Business Intelligence deduplication: knowledge dedup list, show, and rollback.

Changed

Local import staging carries scan statistics through the import pipeline so Shared Sandbox ingest reports empty and oversized files without rescanning the source tree.
Canonical full-text search uses strict adjacent matching for CJK terms while preserving clean display titles and snippets.

Fixed

Single-document classification requires canonical or indexed documents before applying classification rules.
Benchmark output reports storage overhead as a positive multiplier instead of a negative reduction percentage when compaction grows data.
knowledge dedup list --limit and rollback target resolution for numeric dedup rule prefixes.

0.4.0 - 2026-03-16

Added

knowledge dedup scan and knowledge dedup apply for Business Intelligence dedup inside knowledge vectors.
knowledge.dedup_threshold config and deduplicated document status for managed post-dedup exclusion.

Changed

Downstream knowledge flows now exclude deduplicated and archived documents from scoring, clustering, and search.
Dedup provenance merge now preserves strongest weight, earliest origin timestamp, and updates the kept document retention anchor.

Fixed

Dedup apply now skips removal if rollback diff generation fails.
File-only archive accounting during dedup now reports freed bytes only when the source file is actually removed.

0.3.3 - 2026-03-16

Added

Knowledge vector graph support under unsterwerx knowledge vectors with build, list, show, search, and traverse subcommands.
Vector graph configuration for clustering threshold, edge threshold, minimum vector size, and oversize warnings.

Changed

Vector build reconciliation now preserves stable vector IDs when cluster overlap remains high.
Vector confidence now uses mean pairwise posterior within each cluster.

Fixed

Zero-pair vector rebuilds now clean stale graph state and report accurate dry-run deletions.
Vector prefix lookup, traversal, and search result handling were corrected to avoid capped or starved result windows.

0.3.2 - 2026-03-13

Added

Bayesian knowledge scoring under unsterwerx knowledge build.
knowledge labels add and knowledge labels list for user feedback training labels.
Automatic model invalidation on config changes, new labels, feature-version bumps, and IDF snapshot changes.

Fixed

Bayesian scoring now tracks unscored feedback and supports ad-hoc scoring for labeled non-candidate pairs.

0.3.0 - 2026-02-25

Added

Shared PDF text normalization and structural extraction pipeline so both pdf-extract and lopdf paths produce consistent output.
List detection for Unicode bullets, ASCII bullets, and ordered markers in PDF parsing.
Strict YAML frontmatter skipping for PDF reconstruction input.
Regression tests for numbered-heading vs list classification, Unicode/multibyte text safety, and frontmatter edge cases.

Changed

Improved PDF parser paragraph handling with explicit state-machine flushing on blank lines and block transitions.
Enhanced PDF renderer to support H3/H4 headings, unordered -/* bullets, ordered lists, and compact table row rendering.

Fixed

Fixed concatenated-word recovery in PowerPoint-converted PDFs by inserting missing spaces at common token boundaries.
Fixed numbered section headings like 1. Introduction being misclassified as ordered list items.
Fixed frontmatter closing-fence detection to avoid terminating on partial --- sequences in content.

0.2.9 - 2026-02-25

Added

PdfParseError typed enum (Encrypted, ImageOnly, ParseFailed) replacing string-based errors.
DocumentStatus::ImageOnly variant for scanned/image-only PDFs.
route_document_error() helper for routing image-only PDFs to image_only status.
MAX_PDF_BYTES (100 MB) file size guard before fs::read() in PDF parser.
8 new tests for typed error downcasting, oversize rejection, and routing logic.

Changed

Updated PDF parser to return typed PdfParseError variants via .into() instead of anyhow!() strings.

0.2.8 - 2026-02-25

Added

Cascade validation for scoped retention policies; lower scopes cannot loosen parent scope constraints.
Migration to set match_all: true on seed-cv and seed-report rules, reducing false positives.
Signed PDF handling: signature timestamp extraction and original PDF CAS storage.
Policy cascade scope columns (scope, scope_id) for retention policies.
--format auto-detection from output extension in reconstruct CLI.
10 new tests for cascade validation, signed document policy resolution, and seed rule tightening.

Fixed

Fixed benchmark archive stage FK violations by ensuring classification rules exist before classifying.
Fixed benchmark fresh-mode ingest to properly register documents before canonical extraction.

0.2.7 - 2026-02-25

Added

Scoped retention policy cascade (org > division > user) per US9069626B2 Claims 5-6.
PolicyCascadeViolation error variant for clear cascade rejection messages.
Signed PDF detection via byte-level marker search with lopdf-based timestamp extraction.
Original signed PDF CAS archival.
PDF date parser supporting D:YYYYMMDDHHmmSS with timezone normalization.
rules policy and rules policies CLI subcommands.
Read-only PDF reconstruction encryption (RC4, V=1/R=2).
20+ new tests.

Changed

Updated resolve_effective_policy() to walk scope levels with most-restrictive-wins merge.
Updated signed documents to always resolve as immutable with legal hold.
Updated archive cleaner to respect legal hold and signed-document protection.

Fixed

Fixed legal-hold protection for signed documents in policy resolution and archive guardrails.
Fixed benchmark stage isolation so canonical extraction timing is not absorbed into ingest stage.

0.2.6 - 2026-02-25

Added

Source hierarchy trust-rule management under rules source with list, set, remove, and resolve.
Temporal VersionGraph support and timeline diff event integration.
Read-only PDF reconstruction encryption.

Changed

Updated rule classification to compute weighted confidence scores.
Updated import staging to apply source hierarchy resolution by default.
Updated archive output to include retention_pending counts.

Fixed

Fixed benchmark stage isolation.
Fixed ingest benchmark throughput reporting.
Fixed archive retention-age checks.
Fixed legal-hold protection for signed documents.

0.2.5 - 2026-02-25

Added

Universal Import NAC framework with import command and source adapters for local, chatgpt, notion, obsidian, and telegram.
Import outcome tracking for duplicates and unsupported files.
Focused import pipeline regression tests.

Changed

Refactored ingest to run through the local import adapter path.

Fixed

Fixed duplicate content handling to avoid re-canonicalizing existing documents.
Fixed import batch lifecycle handling.
Fixed import item upsert flow.
Fixed import history output labels.

0.2.4 - 2026-02-24

Fixed

Fixed noisy search auto-indexing output.
Fixed search auto-indexing PDF noise by bypassing pdf-extract.

0.2.3 - 2026-02-24

Fixed

Fixed search returning no results after ingest by auto-running canonical extraction.
Added regression coverage for search-after-ingest workflow.

0.2.2 - 2026-02-23

Fixed

Fixed classify --document to only classify canonical/indexed documents.
Fixed benchmark --runs 0 validation.
Fixed benchmark memory reporting to read true peak RSS (VmHWM).

0.2.1 - 2026-02-23

Added

Bounded parallel canonical extraction (default 8 workers).
UNSTERWERX_CANONICAL_THREADS environment variable.

Changed

Updated benchmark to suppress noisy parser output.
Updated canonical sub-timings to report normalized shares under parallel execution.

Fixed

Fixed benchmark --format json output contamination from parser noise.
Fixed canonical extraction to treat empty parse outputs as failures.

0.2.0 - 2026-02-23

Added

benchmark command for pipeline benchmarking.
Version-aware self-upgrade via upgrade command.
--check and --force flags for upgrade scripting.
Compile-time version embedding.

Changed

Updated install.sh to use latest-version.txt as source of truth.
Updated release automation for deployment.

Fixed

Fixed destructive benchmark behavior by running archive in dry-run mode.
Fixed installer upgrade permission handling with sudo fallback.

0.1.0 - 2026-02-22

Added

Initial public release.
Commands: ingest, status, similarity, diff, search, reconstruct, classify, archive, audit, rules, config.