Changelog
All notable changes to Unsterwerx are documented here. This project adheres to Semantic Versioning.
Unreleased
0.4.9 - 2026-03-31
Added
- Content-based file-type resolution for ingest and import using magic-byte sniffing, first-chunk reuse from streaming SHA-256 hashing, PDF buffer cleanup, and optional
pdftotextfallback for recoverable parser failures. - Managed ingest/import job execution with foreground and background run records, hidden worker replay, resumable runs, persistent diagnostics,
jobscontrol/status commands, deterministic log files, and shared JSON envelopes across the operator surface.
Changed
- Retry and duplicate-heal paths now re-sniff file-backed documents from content, persist corrected
file_typevalues, and retry previously unsupported documents when they now resolve to a parseable format. - Operator output for
ingest,import run, andjobs *now uses a shared JSON envelope and stable status counters while keeping human-readable summaries aligned with the same DTOs. - Canonical worker concurrency now honors persisted
ingest.worker_threadssettings for background and resumed runs.
Fixed
- Mislabeled plain-text, HTML, offset-header PDF, and junk-padded PDF files now route through lightweight recovery before being rejected.
- Text magic-byte detection now preserves UTF-8 recognition when the 1024-byte sniff window ends mid-codepoint, avoiding fallback to stale extension-based dispatch.
jobs stopnow interruptsingest --retry-errorswork instead of waiting for the full retry batch to finish.config setnow rejects invalid nullable TOML types up front via round-trip validation.
0.4.8 - 2026-03-30
Added
- Root
README.mdwith build/install guidance, first-use commands, documentation authority chain, and patent-mapping pointer. - Docs-site patent mapping page with navigation entries linking the patent relationship document from the concepts section.
- Regression coverage for
knowledge buildno-labels path in both human-readable and--jsonmodes.
Changed
- Patent-positioning copy in the vision doc and docs-site landing/concepts pages now frames Unsterwerx as a document-domain implementation of TCA concepts, with broader cross-application coverage described as roadmap scope.
- Documentation authority chain now calls out
tmp/as gitignored local material rather than part of the authoritative docs set. - Internal module layout split large SQL parsing, knowledge CLI, Bayes model, dedup, and import staging files into domain-aligned subdirectories while preserving public import paths.
- Docs-site prose refined for lower AI-tell scores.
Fixed
knowledge buildwith no resolved training labels now prints candidate counts and valid next-step commands instead of a terse threshold error.- Dedup apply rejects stale plans after vector rebuilds, keeps per-document database mutations atomic, and returns non-zero from apply or rollback on partial errors.
- Bayesian training and retrain detection now use a consistent active-label cutoff, preventing missed concurrent labels and false retrains.
- Import staging reuses precomputed content hashes during idempotency and file registration checks, avoiding redundant hashing.
0.4.7 - 2026-03-27
Added
unsterwerx import backfillto repair missing provenance rows without manual SQL.- First-class Markdown ingest support with structured parsing across headings, fenced code blocks, frontmatter, and tabular content.
- A command-matrix acceptance test that exercises the full CLI surface against a bundled demo dataset.
- Pipeline run provenance across similarity, knowledge scoring, vector builds, and dedup scans.
Changed
knowledge labels listnow uses explicit--labeland--sourcefilters, with--label-sourceexposed as a visible alias and--label-typekept as a hidden compatibility alias.
Fixed
- Bayesian bootstrap label insertion no longer crashes
knowledge buildwith a SQL parameter mismatch. - Knowledge preflight now points to a real provenance repair command, and parse-failed imports keep their provenance so one bad file does not block the rest of the corpus.
archive --dry-run,status --help,status errors, structural heading diffs, benchmark JSON parity, and operator--jsonoutput now stay consistent in release workflows.- Local upgrade verification retries transient
ETXTBSYfailures, and the knowledge pipeline warns when downstream stages consume stale similarity lineage.
0.4.6 - 2026-03-26
Added
- Offline upgrade support with
unsterwerx upgrade --from-file, optional checksum verification, adjacentSHA256SUMSdiscovery, and deferred audit replay. - Published release checksums and installer-side integrity verification for tarball installs.
Changed
- Upgrade endpoint resolution now honors
UNSTERWERX_UPGRADE_URLacross the CLI, installer, and release script.
Fixed
- Local upgrade verification now rejects extracted binaries that fail their
--versionprobe and treats invalid default config loading as best-effort during upgrade. unsterwerx upgrade --checknow reports concrete network failures and exits with status2when the update server is unreachable.
0.4.5 - 2026-03-25
Added
- SQL file ingest, parsing, canonical markdown, full-text indexing, and classification support.
- A production panic-path regression guard for
src/**/*.rsand runtime validation for invalid similarity configurations.
Changed
- File-backed canonical staging now preserves parser-derived titles and word counts, and the docs site was refreshed with clearer prose and current examples.
Fixed
- Panic-prone knowledge, import, rules, similarity, benchmark, and CLI status paths now fail safely.
- MinHash decoding, empty vector representative election, source-document prefix resolution, ingest retry recovery, and inactive-document filtering now hold up across the pipeline.
0.4.4 - 2026-03-24
Added
- Retention anchor and signing timestamp details in
unsterwerx status --documentoutput for direct verification of document lifecycle metadata.
Changed
- Skipped re-import self-healing now reuses a single
document_idlookup and a sharedSystemTimeto RFC 3339 conversion path.
Fixed
- Local filesystem imports now persist filesystem creation and modification timestamps into provenance, preventing retention anchor fallback to ingest time.
- Unchanged re-imports now refresh provenance timestamps and filesystem audit metadata when source timestamps change without a content change.
0.4.3 - 2026-03-22
Added
- Scoped governance preview and control commands:
unsterwerx rules resolvefor cascaded policy inspection andunsterwerx rules assign-scopefor compare-and-set document scope assignment. - Universal parse-stage size guard (
--max-size-file) applies to all formats, not just PDF. Element::Table::continuationflag for chunked table emission with stable canonical markdown output.
Changed
- XLSX NAC now uses calamine's streaming
worksheet_cells_reader()+next_cell()instead ofworksheet_range(), avoiding eager materialization of entire sheets into memory. - DOCX/PPTX NACs stream XML directly from ZIP entries via
Reader::from_reader(BufReader)instead of loading full XML strings into memory. - DOCX/XLSX table emission flushes rows in bounded chunks (default 10,000), keeping memory proportional to chunk size rather than total row count.
- TXT/CSV parsing uses
BufReaderinstead offs::read_to_string, preserving identical output. - Format-specific parser functions changed to
pub(crate)visibility so the universal size guard cannot be bypassed by direct calls.
Fixed
- Parser entrypoints keep format-specific parsers crate-internal so external callers cannot bypass parse-stage size limits.
- DOCX table chunking no longer emits a spurious empty continuation chunk when row count is an exact multiple of chunk_rows.
- DOCX validates
chunk_rows >= 1, matching XLSX behavior. - Classification now honors scoped rules in both batch and single-document runs, preventing division and user rules from leaking into unrelated documents.
- Scoped governance validation rejects
globalrules and policies that incorrectly include ascope_id, enforced in the rule persistence layer. - Import scope enforcement handles unchanged reimports, cross-source content-hash duplicates, and dry-run previews so scope conflicts are detected consistently.
- Scoped policy resolution, legal-hold checks, and CLI previews resolve only the applicable scope chain and show the cascaded effective result.
0.4.2 - 2026-03-20
Added
- Classification-rule lifecycle controls:
unsterwerx rules reactivateandunsterwerx rules remove --purge, plus dedicated audit actions for retire/reactivate events.
Fixed
- Reconstruct template loading now surfaces template directory access and template parse errors instead of silently falling back, and tightens default report fallback to require an exact
report.mdmatch. - Classification rule retirement/deletion avoids FK violations, cleans up only rule-affected classification state, and allows
classifyto rebuild results for already-classified documents after rule lifecycle changes. unsterwerx similarityhonors persistedsimilarity.*config values when flags are omitted, restoring the intended precedence of CLI flag > config file > built-in default.- Stale similarity state across reruns and lifecycle transitions is resolved by replacing candidate sets atomically, cleaning similarity rows when documents are deduplicated or archived, and rebuilding legacy similarity tables with
ON DELETE CASCADE. - Stale
knowledge buildpair scores when the latest Bayesian model is reused are resolved by clearing and rewritingknowledge_scoresfor the active model in one transaction. - Signed PDF ingest/import now persists normalized signature metadata during staging, preserves original PDFs for reconstruction, self-heals pre-patch rows on re-import, and avoids duplicate signature audit events on canonical retries.
0.4.1 - 2026-03-16
Added
- Shared document ID prefix resolution with filename fallback across CLI workflows, covering reconstruct, diff, status, and single-document classification paths.
- Review and rollback ergonomics for Business Intelligence deduplication:
knowledge dedup list,show, androllback.
Changed
- Local import staging carries scan statistics through the import pipeline so Shared Sandbox ingest reports empty and oversized files without rescanning the source tree.
- Canonical full-text search uses strict adjacent matching for CJK terms while preserving clean display titles and snippets.
Fixed
- Single-document classification requires canonical or indexed documents before applying classification rules.
- Benchmark output reports storage overhead as a positive multiplier instead of a negative reduction percentage when compaction grows data.
knowledge dedup list --limitand rollback target resolution for numeric dedup rule prefixes.
0.4.0 - 2026-03-16
Added
knowledge dedup scanandknowledge dedup applyfor Business Intelligence dedup inside knowledge vectors.knowledge.dedup_thresholdconfig anddeduplicateddocument status for managed post-dedup exclusion.
Changed
- Downstream knowledge flows now exclude
deduplicatedandarchiveddocuments from scoring, clustering, and search. - Dedup provenance merge now preserves strongest weight, earliest origin timestamp, and updates the kept document retention anchor.
Fixed
- Dedup apply now skips removal if rollback diff generation fails.
- File-only archive accounting during dedup now reports freed bytes only when the source file is actually removed.
0.3.3 - 2026-03-16
Added
- Knowledge vector graph support under
unsterwerx knowledge vectorswithbuild,list,show,search, andtraversesubcommands. - Vector graph configuration for clustering threshold, edge threshold, minimum vector size, and oversize warnings.
Changed
- Vector build reconciliation now preserves stable vector IDs when cluster overlap remains high.
- Vector confidence now uses mean pairwise posterior within each cluster.
Fixed
- Zero-pair vector rebuilds now clean stale graph state and report accurate dry-run deletions.
- Vector prefix lookup, traversal, and search result handling were corrected to avoid capped or starved result windows.
0.3.2 - 2026-03-13
Added
- Bayesian knowledge scoring under
unsterwerx knowledge build. knowledge labels addandknowledge labels listfor user feedback training labels.- Automatic model invalidation on config changes, new labels, feature-version bumps, and IDF snapshot changes.
Fixed
- Bayesian scoring now tracks unscored feedback and supports ad-hoc scoring for labeled non-candidate pairs.
0.3.0 - 2026-02-25
Added
- Shared PDF text normalization and structural extraction pipeline so both
pdf-extractandlopdfpaths produce consistent output. - List detection for Unicode bullets, ASCII bullets, and ordered markers in PDF parsing.
- Strict YAML frontmatter skipping for PDF reconstruction input.
- Regression tests for numbered-heading vs list classification, Unicode/multibyte text safety, and frontmatter edge cases.
Changed
- Improved PDF parser paragraph handling with explicit state-machine flushing on blank lines and block transitions.
- Enhanced PDF renderer to support H3/H4 headings, unordered
-/*bullets, ordered lists, and compact table row rendering.
Fixed
- Fixed concatenated-word recovery in PowerPoint-converted PDFs by inserting missing spaces at common token boundaries.
- Fixed numbered section headings like
1. Introductionbeing misclassified as ordered list items. - Fixed frontmatter closing-fence detection to avoid terminating on partial
---sequences in content.
0.2.9 - 2026-02-25
Added
PdfParseErrortyped enum (Encrypted,ImageOnly,ParseFailed) replacing string-based errors.DocumentStatus::ImageOnlyvariant for scanned/image-only PDFs.route_document_error()helper for routing image-only PDFs toimage_onlystatus.MAX_PDF_BYTES(100 MB) file size guard beforefs::read()in PDF parser.- 8 new tests for typed error downcasting, oversize rejection, and routing logic.
Changed
- Updated PDF parser to return typed
PdfParseErrorvariants via.into()instead ofanyhow!()strings.
0.2.8 - 2026-02-25
Added
- Cascade validation for scoped retention policies; lower scopes cannot loosen parent scope constraints.
- Migration to set
match_all: trueonseed-cvandseed-reportrules, reducing false positives. - Signed PDF handling: signature timestamp extraction and original PDF CAS storage.
- Policy cascade scope columns (
scope,scope_id) for retention policies. --formatauto-detection from output extension inreconstructCLI.- 10 new tests for cascade validation, signed document policy resolution, and seed rule tightening.
Fixed
- Fixed benchmark archive stage FK violations by ensuring classification rules exist before classifying.
- Fixed benchmark fresh-mode ingest to properly register documents before canonical extraction.
0.2.7 - 2026-02-25
Added
- Scoped retention policy cascade (org > division > user) per US9069626B2 Claims 5-6.
PolicyCascadeViolationerror variant for clear cascade rejection messages.- Signed PDF detection via byte-level marker search with
lopdf-based timestamp extraction. - Original signed PDF CAS archival.
- PDF date parser supporting
D:YYYYMMDDHHmmSSwith timezone normalization. rules policyandrules policiesCLI subcommands.- Read-only PDF reconstruction encryption (RC4, V=1/R=2).
- 20+ new tests.
Changed
- Updated
resolve_effective_policy()to walk scope levels with most-restrictive-wins merge. - Updated signed documents to always resolve as immutable with legal hold.
- Updated archive cleaner to respect legal hold and signed-document protection.
Fixed
- Fixed legal-hold protection for signed documents in policy resolution and archive guardrails.
- Fixed benchmark stage isolation so canonical extraction timing is not absorbed into ingest stage.
0.2.6 - 2026-02-25
Added
- Source hierarchy trust-rule management under
rules sourcewithlist,set,remove, andresolve. - Temporal
VersionGraphsupport and timeline diff event integration. - Read-only PDF reconstruction encryption.
Changed
- Updated rule classification to compute weighted confidence scores.
- Updated import staging to apply source hierarchy resolution by default.
- Updated archive output to include
retention_pendingcounts.
Fixed
- Fixed benchmark stage isolation.
- Fixed ingest benchmark throughput reporting.
- Fixed archive retention-age checks.
- Fixed legal-hold protection for signed documents.
0.2.5 - 2026-02-25
Added
- Universal Import NAC framework with
importcommand and source adapters forlocal,chatgpt,notion,obsidian, andtelegram. - Import outcome tracking for duplicates and unsupported files.
- Focused import pipeline regression tests.
Changed
- Refactored
ingestto run through thelocalimport adapter path.
Fixed
- Fixed duplicate content handling to avoid re-canonicalizing existing documents.
- Fixed import batch lifecycle handling.
- Fixed import item upsert flow.
- Fixed
import historyoutput labels.
0.2.4 - 2026-02-24
Fixed
- Fixed noisy
searchauto-indexing output. - Fixed
searchauto-indexing PDF noise by bypassingpdf-extract.
0.2.3 - 2026-02-24
Fixed
- Fixed
searchreturning no results after ingest by auto-running canonical extraction. - Added regression coverage for search-after-ingest workflow.
0.2.2 - 2026-02-23
Fixed
- Fixed
classify --documentto only classify canonical/indexed documents. - Fixed benchmark
--runs 0validation. - Fixed benchmark memory reporting to read true peak RSS (
VmHWM).
0.2.1 - 2026-02-23
Added
- Bounded parallel canonical extraction (default 8 workers).
UNSTERWERX_CANONICAL_THREADSenvironment variable.
Changed
- Updated benchmark to suppress noisy parser output.
- Updated canonical sub-timings to report normalized shares under parallel execution.
Fixed
- Fixed
benchmark --format jsonoutput contamination from parser noise. - Fixed canonical extraction to treat empty parse outputs as failures.
0.2.0 - 2026-02-23
Added
benchmarkcommand for pipeline benchmarking.- Version-aware self-upgrade via
upgradecommand. --checkand--forceflags for upgrade scripting.- Compile-time version embedding.
Changed
- Updated
install.shto uselatest-version.txtas source of truth. - Updated release automation for deployment.
Fixed
- Fixed destructive benchmark behavior by running archive in dry-run mode.
- Fixed installer upgrade permission handling with
sudofallback.
0.1.0 - 2026-02-22
Added
- Initial public release.
- Commands:
ingest,status,similarity,diff,search,reconstruct,classify,archive,audit,rules,config.