Unsterwerx

FAQ

Installation

unsterwerx: command not found after install?

Ensure /usr/local/bin is on your PATH:

bash
export PATH=$PATH:/usr/local/bin

Add that line to your ~/.zshrc or ~/.bashrc to make it permanent.

How do I upgrade?

bash
unsterwerx upgrade

Or check for updates first:

bash
unsterwerx upgrade --check

Documents

What file formats are supported?

Unsterwerx can parse and extract content from:

Legacy formats (.doc, .xls, .ppt) are registered in the database but marked as unsupported since no parser handles these binary formats.

What happens to corrupt or unreadable files?

Files that fail parsing are marked with error status in the database. They are still tracked and appear in unsterwerx status, but they cannot be searched or reconstructed, and are excluded from diffing and classification. Common causes: encrypted PDFs, corrupt headers, truncated files. Image-only scanned documents get their own image_only status.

How are duplicates detected?

Unsterwerx uses a two-stage approach:

  1. Exact duplicates: detected during ingestion by SHA-256 content hash. Identical files are skipped.
  2. Near-duplicates: detected by MinHash + LSH similarity analysis. Documents with similar text are identified as candidate pairs with Jaccard similarity scores.

What does "canonical" mean?

Canonical content is the normalized markdown representation of a document. Regardless of the original format, the canonical version preserves structural elements (headings, body text, lists, tables, code blocks, page breaks) in markdown form. That makes cross-format comparison and search possible.

Storage

Where is data stored?

By default in ~/.unsterwerx/. Override with --data-dir or the UNSTERWERX_DATA environment variable. See Data Storage.

How much disk space does Unsterwerx use?

The tool achieves significant compaction. In benchmarks with a 2.7 GB dataset (2,074 documents):

Can I use a different database?

No. SQLite is the only supported backend. The database is a single file (unsterwerx.db) in the data directory with WAL mode enabled.

Performance

How long does ingestion take?

Depends on document count, total size, and how many PDFs are in the mix. Benchmarks:

How do I speed up canonical extraction?

Set the UNSTERWERX_CANONICAL_THREADS environment variable to increase parallel workers (default: 8):

bash
export UNSTERWERX_CANONICAL_THREADS=16
unsterwerx similarity

Policies

What is the policy cascade?

Retention policies follow a hierarchy: global > organization > division > user. Each level can only tighten constraints set by the level above. See Classification Guide.

What happens to signed documents?

Signed PDFs are detected automatically during ingest and receive special treatment:

What can I do about documents that failed to parse?

Use the error recovery workflow:

  1. Review: unsterwerx status errors lists all documents in error or image_only status with error details.
  2. Retry: unsterwerx ingest --retry-errors re-attempts canonical extraction for error documents. Transient failures can succeed on a second pass.
  3. Dismiss: unsterwerx status dismiss <id> --reason "..." marks a document as unrecoverable. Dismissed documents are excluded from search and downstream processing but remain in the database for audit purposes.

Trust Chain

What is the audit trail?

Every operation that modifies data is recorded in an append-only, hash-chained log. Each event links to the previous event via a cryptographic hash, forming an unbroken chain that can be verified:

bash
unsterwerx audit --verify
Chain verified: 142 events, integrity OK

Can the audit trail be tampered with?

The hash chain makes tampering detectable. If any event is modified or removed, audit --verify will report a chain break. Out-of-order insertions are caught the same way. The audit trail cannot be cleared or reset.