FAQ

Installation

`unsterwerx: command not found` after install?

Ensure /usr/local/bin is on your PATH:

bash

export PATH=$PATH:/usr/local/bin

Add that line to your ~/.zshrc or ~/.bashrc to make it permanent.

How do I upgrade?

bash

unsterwerx upgrade

Or check for updates first:

bash

unsterwerx upgrade --check

Documents

What file formats are supported?

Unsterwerx can parse and extract content from:

PDF: text extraction via pdf-extract and lopdf
DOCX: XML parsing of Word documents
XLSX: cell reading from Excel spreadsheets
PPTX: slide and notes extraction from PowerPoint
TXT: plain text files
CSV: comma-separated values
Markdown (.md): plain markdown files
SQL (.sql): SQL script files

Legacy formats (.doc, .xls, .ppt) are registered in the database but marked as unsupported since no parser handles these binary formats.

What happens to corrupt or unreadable files?

Files that fail parsing are marked with error status in the database. They are still tracked and appear in unsterwerx status, but they cannot be searched or reconstructed, and are excluded from diffing and classification. Common causes: encrypted PDFs, corrupt headers, truncated files. Image-only scanned documents get their own image_only status.

How are duplicates detected?

Unsterwerx uses a two-stage approach:

Exact duplicates: detected during ingestion by SHA-256 content hash. Identical files are skipped.
Near-duplicates: detected by MinHash + LSH similarity analysis. Documents with similar text are identified as candidate pairs with Jaccard similarity scores.

What does "canonical" mean?

Canonical content is the normalized markdown representation of a document. Regardless of the original format, the canonical version preserves structural elements (headings, body text, lists, tables, code blocks, page breaks) in markdown form. That makes cross-format comparison and search possible.

Storage

Where is data stored?

By default in ~/.unsterwerx/. Override with --data-dir or the UNSTERWERX_DATA environment variable. See Data Storage.

How much disk space does Unsterwerx use?

The tool achieves significant compaction. In benchmarks with a 2.7 GB dataset (2,074 documents):

Canonical content: 94 MB (96.6% compaction)
Database + indexes: 234 MB
Total footprint: 332 MB (87.9% reduction vs originals)

Can I use a different database?

No. SQLite is the only supported backend. The database is a single file (unsterwerx.db) in the data directory with WAL mode enabled.

Performance

How long does ingestion take?

Depends on document count, total size, and how many PDFs are in the mix. Benchmarks:

2,074 documents (2.7 GB) processed in ~46 seconds
PDF parsing is the slowest stage (~57 seconds for 879 PDFs)
Similarity analysis takes ~4 seconds for 1,806 documents

How do I speed up canonical extraction?

Set the UNSTERWERX_CANONICAL_THREADS environment variable to increase parallel workers (default: 8):

bash

export UNSTERWERX_CANONICAL_THREADS=16
unsterwerx similarity

Policies

What is the policy cascade?

Retention policies follow a hierarchy: global > organization > division > user. Each level can only tighten constraints set by the level above. See Classification Guide.

What happens to signed documents?

Signed PDFs are detected automatically during ingest and receive special treatment:

Always treated as immutable
Always placed under legal hold
Original PDF binary preserved in CAS alongside canonical markdown
Signature timestamp extracted and recorded
Cannot be archived or deleted
Reconstruction at the signing timestamp returns the original preserved PDF

What can I do about documents that failed to parse?

Use the error recovery workflow:

Review: unsterwerx status errors lists all documents in error or image_only status with error details.
Retry: unsterwerx ingest --retry-errors re-attempts canonical extraction for error documents. Transient failures can succeed on a second pass.
Dismiss: unsterwerx status dismiss <id> --reason "..." marks a document as unrecoverable. Dismissed documents are excluded from search and downstream processing but remain in the database for audit purposes.

Trust Chain

What is the audit trail?

Every operation that modifies data is recorded in an append-only, hash-chained log. Each event links to the previous event via a cryptographic hash, forming an unbroken chain that can be verified:

bash

unsterwerx audit --verify

Chain verified: 142 events, integrity OK

Can the audit trail be tampered with?

The hash chain makes tampering detectable. If any event is modified or removed, audit --verify will report a chain break. Out-of-order insertions are caught the same way. The audit trail cannot be cleared or reset.

FAQ

Installation

unsterwerx: command not found after install?

How do I upgrade?

Documents

What file formats are supported?

What happens to corrupt or unreadable files?

How are duplicates detected?

What does "canonical" mean?

Storage

Where is data stored?

How much disk space does Unsterwerx use?

Can I use a different database?

Performance

How long does ingestion take?

How do I speed up canonical extraction?

Policies

What is the policy cascade?

What happens to signed documents?

What can I do about documents that failed to parse?

Trust Chain

What is the audit trail?

Can the audit trail be tampered with?

`unsterwerx: command not found` after install?