ingest

Scans a directory tree for supported document files, computes SHA-256 content hashes, and registers new documents in the database. Duplicate files (by content hash) are automatically skipped.

Supported formats: PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown (.md), SQL (.sql), RTF, and legacy DOC/XLS/PPT (registered but marked as unsupported).

Usage

bash

unsterwerx ingest [OPTIONS] [SOURCE]

Arguments

Argument	Required	Description
`SOURCE`	Usually	Source directory or file to ingest. Omit it only with `--retry-errors` or `--resume`.

Options

Option	Short	Type	Default	Description
`--dry-run`		flag		Show what would be ingested without writing to the database
`--extension`	`-e`	string	all supported	Only process files with this extension
`--max-size`		string	500MB	Maximum scan/discovery file size (e.g., `100MB`)
`--max-size-file`		string	100MB	Maximum parse-stage file size for in-memory parsers (e.g., `100MB`)
`--follow-symlinks`		flag		Follow symbolic links during directory traversal
`--include-hidden`		flag		Include hidden files (starting with `.`)
`--scope`		string		Scope path for ingested documents (e.g., `acme/sales/alice`)
`--retry-errors`		flag		Re-attempt canonical extraction for documents in `error` or `unsupported` status
`--background`		flag		Run in background (returns immediately with job ID)
`--foreground`		flag		Force foreground execution (overrides `background_default` config)
`--resume`		string		Resume a stale/stopped/failed run (replays the stored execution spec)
`--capture-metadata`		flag	config-driven	Enable rich metadata extraction for this ingest run
`--metadata-extractor`		string	config-driven	Run only the named metadata extractor; repeat for multiple extractors
`--json`		flag		Output as JSON

Examples

Preview what would be ingested

bash

unsterwerx ingest --dry-run /path/to/documents

Dry Run
══════════════════════════════════
  Files found:          2873
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2873
══════════════════════════════════

Ingest only PDF files

bash

unsterwerx ingest --dry-run -e pdf /path/to/documents

Dry Run
══════════════════════════════════
  Files found:          1184
  Already ingested:        0
  Errors:                  0
  Candidates (new):     1184
══════════════════════════════════

Limit file size

bash

unsterwerx ingest --dry-run --max-size 10MB /path/to/documents

Dry Run
══════════════════════════════════
  Files found:          2816
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2816
══════════════════════════════════

Raise parse-stage PDF limit

bash

unsterwerx ingest --max-size 1GB --max-size-file 1GB /path/to/documents

Include hidden files and follow symlinks

bash

unsterwerx ingest --dry-run --follow-symlinks --include-hidden /path/to/documents

Dry Run
══════════════════════════════════
  Files found:          2879
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2879
══════════════════════════════════

Ingest with scope assignment

bash

unsterwerx ingest --scope acme/sales /path/to/documents

All ingested documents are assigned to the acme/sales division scope. Scoped documents receive only policies and classification rules applicable to their scope chain.

Capture embedded metadata during ingest

bash

unsterwerx ingest --capture-metadata /path/to/documents

--capture-metadata runs metadata extraction as part of the same foreground or background ingest run. Canonical document content still goes through the format-specific NAC into the Universal Data Set; metadata capture records native properties such as PDF producer, OOXML author, creation/modification timestamps, page counts, and origin software as control-plane provenance in the Shared Sandbox.

The default extractor set comes from [metadata].extractors in config.toml:

Extractor	Applies to	Typical metadata
`builtin_pdf`	PDF	producer, creator, creation/modification dates, page count
`builtin_ooxml`	DOCX, XLSX, PPTX	author, title, subject, last modified by, application, timestamps
`builtin_image`	PNG, JPEG	image dimensions and image-format properties when image files are explicitly selected or imported

To run only selected extractors, repeat --metadata-extractor:

bash

unsterwerx ingest --capture-metadata \
  --metadata-extractor builtin_pdf \
  --metadata-extractor builtin_ooxml \
  /path/to/documents

Specifying --metadata-extractor implies metadata capture, so this is equivalent:

bash

unsterwerx ingest --metadata-extractor builtin_pdf /path/to/documents

Unknown extractor names fail before any ingest run is created, which keeps operator mistakes out of the audit trail. After capture, inspect the results with unsterwerx metadata keys, unsterwerx metadata values, unsterwerx metadata show <document-id>, or metadata-aware search filters.

Backfill metadata after ingest

bash

unsterwerx metadata extract --file-type pdf

If documents were ingested before metadata capture was enabled, do not re-ingest them just to collect metadata. Use metadata extract to run the same extractors against existing file-backed documents and materialize semantic facts for search, Business Intelligence rules, and User Intelligence policy decisions.

Retry failed documents

bash

unsterwerx ingest --retry-errors

Re-attempts canonical extraction for documents stuck in error or unsupported status. No source path is needed since the command works from the database. Run unsterwerx status errors first to review which documents are eligible.

Full ingest

bash

unsterwerx ingest /path/to/documents

Notes

Files are hashed with streaming SHA-256 (8 KB buffer), so large files are never loaded fully into memory during the hash phase.
Duplicate files (same content hash) are automatically skipped and counted as duplicates.
--max-size controls what gets scanned; --max-size-file controls parser in-memory guard for formats like PDF.
If both are set and --max-size is lower, scan filtering happens first.
Legacy formats (.doc, .xls, .ppt) are registered in the database but marked as unsupported since no parser is available.
The --scope flag assigns all ingested documents to a scope (e.g., acme/sales). Scope assignment is one-way. Once set, it cannot be changed to a different value.
--retry-errors cannot be combined with a source path or scan flags. It operates only on existing error and unsupported documents.
--capture-metadata enables extraction for this run even when [metadata].capture_enabled = false.
--metadata-extractor is repeatable and also enables capture. Valid values are builtin_pdf, builtin_ooxml, and builtin_image.
--dry-run previews scan/discovery counts only. It does not write documents, metadata extraction rows, semantic facts, or audit events.
--resume replays the stored execution spec for the original run. Add metadata flags to the original ingest run, not to the resume command.
--retry-errors re-attempts canonical extraction; use unsterwerx metadata extract for metadata backfill.
After ingestion, run unsterwerx similarity to find duplicates and near-duplicates.