ingest
Scans a directory tree for supported document files, computes SHA-256 content hashes, and registers new documents in the database. Duplicate files (by content hash) are automatically skipped.
Supported formats: PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown (.md), SQL (.sql), RTF, and legacy DOC/XLS/PPT (registered but marked as unsupported).
Usage
unsterwerx ingest [OPTIONS] [SOURCE]
Arguments
| Argument | Required | Description |
|---|---|---|
SOURCE | Usually | Source directory or file to ingest. Omit it only with --retry-errors or --resume. |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--dry-run | flag | Show what would be ingested without writing to the database | ||
--extension | -e | string | all supported | Only process files with this extension |
--max-size | string | 500MB | Maximum scan/discovery file size (e.g., 100MB) | |
--max-size-file | string | 100MB | Maximum parse-stage file size for in-memory parsers (e.g., 100MB) | |
--follow-symlinks | flag | Follow symbolic links during directory traversal | ||
--include-hidden | flag | Include hidden files (starting with .) | ||
--scope | string | Scope path for ingested documents (e.g., acme/sales/alice) | ||
--retry-errors | flag | Re-attempt canonical extraction for documents in error or unsupported status | ||
--background | flag | Run in background (returns immediately with job ID) | ||
--foreground | flag | Force foreground execution (overrides background_default config) | ||
--resume | string | Resume a stale/stopped/failed run (replays the stored execution spec) | ||
--capture-metadata | flag | config-driven | Enable rich metadata extraction for this ingest run | |
--metadata-extractor | string | config-driven | Run only the named metadata extractor; repeat for multiple extractors | |
--json | flag | Output as JSON |
Examples
Preview what would be ingested
unsterwerx ingest --dry-run /path/to/documents
Dry Run
══════════════════════════════════
Files found: 2873
Already ingested: 0
Errors: 0
Candidates (new): 2873
══════════════════════════════════
Ingest only PDF files
unsterwerx ingest --dry-run -e pdf /path/to/documents
Dry Run
══════════════════════════════════
Files found: 1184
Already ingested: 0
Errors: 0
Candidates (new): 1184
══════════════════════════════════
Limit file size
unsterwerx ingest --dry-run --max-size 10MB /path/to/documents
Dry Run
══════════════════════════════════
Files found: 2816
Already ingested: 0
Errors: 0
Candidates (new): 2816
══════════════════════════════════
Raise parse-stage PDF limit
unsterwerx ingest --max-size 1GB --max-size-file 1GB /path/to/documents
Include hidden files and follow symlinks
unsterwerx ingest --dry-run --follow-symlinks --include-hidden /path/to/documents
Dry Run
══════════════════════════════════
Files found: 2879
Already ingested: 0
Errors: 0
Candidates (new): 2879
══════════════════════════════════
Ingest with scope assignment
unsterwerx ingest --scope acme/sales /path/to/documents
All ingested documents are assigned to the acme/sales division scope. Scoped documents receive only policies and classification rules applicable to their scope chain.
Capture embedded metadata during ingest
unsterwerx ingest --capture-metadata /path/to/documents
--capture-metadata runs metadata extraction as part of the same foreground or background ingest run. Canonical document content still goes through the format-specific NAC into the Universal Data Set; metadata capture records native properties such as PDF producer, OOXML author, creation/modification timestamps, page counts, and origin software as control-plane provenance in the Shared Sandbox.
The default extractor set comes from [metadata].extractors in config.toml:
| Extractor | Applies to | Typical metadata |
|---|---|---|
builtin_pdf | producer, creator, creation/modification dates, page count | |
builtin_ooxml | DOCX, XLSX, PPTX | author, title, subject, last modified by, application, timestamps |
builtin_image | PNG, JPEG | image dimensions and image-format properties when image files are explicitly selected or imported |
To run only selected extractors, repeat --metadata-extractor:
unsterwerx ingest --capture-metadata \
--metadata-extractor builtin_pdf \
--metadata-extractor builtin_ooxml \
/path/to/documents
Specifying --metadata-extractor implies metadata capture, so this is equivalent:
unsterwerx ingest --metadata-extractor builtin_pdf /path/to/documents
Unknown extractor names fail before any ingest run is created, which keeps operator mistakes out of the audit trail. After capture, inspect the results with unsterwerx metadata keys, unsterwerx metadata values, unsterwerx metadata show <document-id>, or metadata-aware search filters.
Backfill metadata after ingest
unsterwerx metadata extract --file-type pdf
If documents were ingested before metadata capture was enabled, do not re-ingest them just to collect metadata. Use metadata extract to run the same extractors against existing file-backed documents and materialize semantic facts for search, Business Intelligence rules, and User Intelligence policy decisions.
Retry failed documents
unsterwerx ingest --retry-errors
Re-attempts canonical extraction for documents stuck in error or unsupported status. No source path is needed since the command works from the database. Run unsterwerx status errors first to review which documents are eligible.
Full ingest
unsterwerx ingest /path/to/documents
Notes
- Files are hashed with streaming SHA-256 (8 KB buffer), so large files are never loaded fully into memory during the hash phase.
- Duplicate files (same content hash) are automatically skipped and counted as duplicates.
--max-sizecontrols what gets scanned;--max-size-filecontrols parser in-memory guard for formats like PDF.- If both are set and
--max-sizeis lower, scan filtering happens first. - Legacy formats (
.doc,.xls,.ppt) are registered in the database but marked asunsupportedsince no parser is available. - The
--scopeflag assigns all ingested documents to a scope (e.g.,acme/sales). Scope assignment is one-way. Once set, it cannot be changed to a different value. --retry-errorscannot be combined with a source path or scan flags. It operates only on existingerrorandunsupporteddocuments.--capture-metadataenables extraction for this run even when[metadata].capture_enabled = false.--metadata-extractoris repeatable and also enables capture. Valid values arebuiltin_pdf,builtin_ooxml, andbuiltin_image.--dry-runpreviews scan/discovery counts only. It does not write documents, metadata extraction rows, semantic facts, or audit events.--resumereplays the stored execution spec for the original run. Add metadata flags to the original ingest run, not to the resume command.--retry-errorsre-attempts canonical extraction; useunsterwerx metadata extractfor metadata backfill.- After ingestion, run
unsterwerx similarityto find duplicates and near-duplicates.