FAQ
Installation
unsterwerx: command not found after install?
Ensure /usr/local/bin is on your PATH:
export PATH=$PATH:/usr/local/bin
Add that line to your ~/.zshrc or ~/.bashrc to make it permanent.
How do I upgrade?
unsterwerx upgrade
Or check for updates first:
unsterwerx upgrade --check
Documents
What file formats are supported?
Unsterwerx can parse and extract content from:
- PDF: text extraction via
pdf-extractandlopdf - DOCX: XML parsing of Word documents
- XLSX: cell reading from Excel spreadsheets
- PPTX: slide and notes extraction from PowerPoint
- TXT: plain text files
- CSV: comma-separated values
- Markdown (
.md): plain markdown files - SQL (
.sql): SQL script files
Legacy formats (.doc, .xls, .ppt) are registered in the database but marked as unsupported since no parser handles these binary formats.
What happens to corrupt or unreadable files?
Files that fail parsing are marked with error status in the database. They are still tracked and appear in unsterwerx status, but they cannot be searched or reconstructed, and are excluded from diffing and classification. Common causes: encrypted PDFs, corrupt headers, truncated files. Image-only scanned documents get their own image_only status.
How are duplicates detected?
Unsterwerx uses a two-stage approach:
- Exact duplicates: detected during ingestion by SHA-256 content hash. Identical files are skipped.
- Near-duplicates: detected by MinHash + LSH similarity analysis. Documents with similar text are identified as candidate pairs with Jaccard similarity scores.
What does "canonical" mean?
Canonical content is the normalized markdown representation of a document. Regardless of the original format, the canonical version preserves structural elements (headings, body text, lists, tables, code blocks, page breaks) in markdown form. That makes cross-format comparison and search possible.
Storage
Where is data stored?
By default in ~/.unsterwerx/. Override with --data-dir or the UNSTERWERX_DATA environment variable. See Data Storage.
How much disk space does Unsterwerx use?
The tool achieves significant compaction. In benchmarks with a 2.7 GB dataset (2,074 documents):
- Canonical content: 94 MB (96.6% compaction)
- Database + indexes: 234 MB
- Total footprint: 332 MB (87.9% reduction vs originals)
Can I use a different database?
No. SQLite is the only supported backend. The database is a single file (unsterwerx.db) in the data directory with WAL mode enabled.
Performance
How long does ingestion take?
Depends on document count, total size, and how many PDFs are in the mix. Benchmarks:
- 2,074 documents (2.7 GB) processed in ~46 seconds
- PDF parsing is the slowest stage (~57 seconds for 879 PDFs)
- Similarity analysis takes ~4 seconds for 1,806 documents
How do I speed up canonical extraction?
Set the UNSTERWERX_CANONICAL_THREADS environment variable to increase parallel workers (default: 8):
export UNSTERWERX_CANONICAL_THREADS=16
unsterwerx similarity
Policies
What is the policy cascade?
Retention policies follow a hierarchy: global > organization > division > user. Each level can only tighten constraints set by the level above. See Classification Guide.
What happens to signed documents?
Signed PDFs are detected automatically during ingest and receive special treatment:
- Always treated as immutable
- Always placed under legal hold
- Original PDF binary preserved in CAS alongside canonical markdown
- Signature timestamp extracted and recorded
- Cannot be archived or deleted
- Reconstruction at the signing timestamp returns the original preserved PDF
What can I do about documents that failed to parse?
Use the error recovery workflow:
- Review:
unsterwerx status errorslists all documents inerrororimage_onlystatus with error details. - Retry:
unsterwerx ingest --retry-errorsre-attempts canonical extraction for error documents. Transient failures can succeed on a second pass. - Dismiss:
unsterwerx status dismiss <id> --reason "..."marks a document as unrecoverable. Dismissed documents are excluded from search and downstream processing but remain in the database for audit purposes.
Trust Chain
What is the audit trail?
Every operation that modifies data is recorded in an append-only, hash-chained log. Each event links to the previous event via a cryptographic hash, forming an unbroken chain that can be verified:
unsterwerx audit --verify
Chain verified: 142 events, integrity OK
Can the audit trail be tampered with?
The hash chain makes tampering detectable. If any event is modified or removed, audit --verify will report a chain break. Out-of-order insertions are caught the same way. The audit trail cannot be cleared or reset.