Unsterwerx

ingest

Scans a directory tree for supported document files, computes SHA-256 content hashes, and registers new documents in the database. Duplicate files (by content hash) are automatically skipped.

Supported formats: PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown (.md), SQL (.sql), RTF, and legacy DOC/XLS/PPT (registered but marked as unsupported).

Usage

bash
unsterwerx ingest [OPTIONS] <SOURCE>

Arguments

ArgumentRequiredDescription
SOURCEYesSource directory or file to ingest

Options

OptionShortTypeDefaultDescription
--dry-runflagShow what would be ingested without writing to the database
--extension-estringall supportedOnly process files with this extension
--max-sizestring500MBMaximum scan/discovery file size (e.g., 100MB)
--max-size-filestring100MBMaximum parse-stage file size for in-memory parsers (e.g., 100MB)
--follow-symlinksflagFollow symbolic links during directory traversal
--include-hiddenflagInclude hidden files (starting with .)
--scopestringScope path for ingested documents (e.g., acme/sales/alice)
--retry-errorsflagRe-attempt canonical extraction for documents in error status
--backgroundflagRun in background (returns immediately with job ID)
--foregroundflagForce foreground execution (overrides background_default config)
--resumestringResume a stale/stopped/failed run (replays the stored execution spec)
--jsonflagOutput as JSON

Examples

Preview what would be ingested

bash
unsterwerx ingest --dry-run /path/to/documents
Dry Run
══════════════════════════════════
  Files found:          2873
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2873
══════════════════════════════════

Ingest only PDF files

bash
unsterwerx ingest --dry-run -e pdf /path/to/documents
Dry Run
══════════════════════════════════
  Files found:          1184
  Already ingested:        0
  Errors:                  0
  Candidates (new):     1184
══════════════════════════════════

Limit file size

bash
unsterwerx ingest --dry-run --max-size 10MB /path/to/documents
Dry Run
══════════════════════════════════
  Files found:          2816
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2816
══════════════════════════════════

Raise parse-stage PDF limit

bash
unsterwerx ingest --max-size 1GB --max-size-file 1GB /path/to/documents
bash
unsterwerx ingest --dry-run --follow-symlinks --include-hidden /path/to/documents
Dry Run
══════════════════════════════════
  Files found:          2879
  Already ingested:        0
  Errors:                  0
  Candidates (new):     2879
══════════════════════════════════

Ingest with scope assignment

bash
unsterwerx ingest --scope acme/sales /path/to/documents

All ingested documents are assigned to the acme/sales division scope. Scoped documents receive only policies and classification rules applicable to their scope chain.

Retry failed documents

bash
unsterwerx ingest --retry-errors

Re-attempts canonical extraction for documents stuck in error status. No source path is needed since the command works from the database. Run unsterwerx status errors first to review which documents are eligible.

Full ingest

bash
unsterwerx ingest /path/to/documents

Notes