Unsterwerx

Classification Guide

Unsterwerx uses a rules-based system to classify documents by type (invoices, contracts, CVs, reports, government docs) and enforce retention policies per class. This guide covers rule creation, policy management, the cascade model, and scope assignment.

Classification Rules

Classification rules match documents using regex patterns on filenames or content. Rules that specify both patterns can require all to match. Each rule maps to a document class.

Seed Rules

Unsterwerx ships with six seed rules created during database initialization:

RuleClassFilename PatternContent Pattern
seed-contractcontractcontract, agreementhereby agree, terms and conditions
seed-cvcvcv, resume, curriculumexperience, education, skills
seed-governmentgovernmentgovernment, officialpublic notice, gazette, decree
seed-invoiceinvoiceinvoice, fakturatotal due, amount payable
seed-legallegallegal, law, regulationpursuant to, article, jurisdiction
seed-reportreportreport, analysisexecutive summary, findings

Creating Rules

Add rules with one or both pattern types:

bash
# Match by filename only
unsterwerx rules add --name "my-contracts" --class contract \
    --filename-pattern "(?i)contract"
# Match by content only
unsterwerx rules add --name "my-invoices" --class invoice \
    --content-pattern "(?i)(invoice\s+number|total\s+due)"

# Require both patterns to match
unsterwerx rules add --name "strict-legal" --class legal \
    --filename-pattern "(?i)legal" \
    --content-pattern "(?i)pursuant\s+to" \
    --match-all --priority 10

Priority

Rules with higher priority are evaluated first. When multiple rules match a document, all matches are recorded with confidence scores. The --match-all flag requires all specified patterns to match.

Confidence Scoring

Classification confidence is computed from the strength of pattern matches across filename and content signals. A document classified as cv (62%) means the rule's patterns matched with 62% confidence.

Retention Policies

Retention policies control what happens to documents after their retention period expires.

Policy Fields

FieldDescription
classDocument class this policy applies to
retention-yearsMinimum years before archival action
immutableDocument cannot be modified
legal-holdDocument is frozen for legal purposes
actionWhat happens at end-of-retention: move, delete, keep
scopePolicy level: global, organization, division, user

Creating Policies

bash
unsterwerx rules policy \
    --name "contract-7yr" \
    --class contract \
    --retention-years 7 \
    --immutable \
    --action move

Policy Cascade

Retention policies follow a hierarchical cascade:

global → organization → division → user

Each level can only tighten constraints set by the level above. A division policy cannot:

Example

bash
# Global: 7-year retention, immutable
unsterwerx rules policy --name "global-contract" --class contract \
    --retention-years 7 --immutable --action move --scope global
# Organization: tightens to 10 years (valid, stricter)
unsterwerx rules policy --name "dod-contract" --class contract \
    --retention-years 10 --immutable --action keep \
    --scope organization --scope-id "DoD"

# Division: tries 5 years (rejected, looser than org)
unsterwerx rules policy --name "div-contract" --class contract \
    --retention-years 5 --action move \
    --scope division --scope-id "Engineering"
# Error: cascade violation, retention years cannot be less than parent scope

Assigning Scope to Documents

Documents start with no scope (treated as global). Assign a scope to place documents under a specific organizational boundary:

bash
unsterwerx rules assign-scope a1b2c3 --scope acme/sales

Scope assignment is one-way. Once set, it cannot be changed to a different value. You can also assign scope at ingest time with --scope:

bash
unsterwerx ingest --scope acme/engineering /path/to/eng-docs

Inspecting Effective Policy

Use rules resolve to see the cascaded effective policy for a document or to preview what a class + scope combination produces:

bash
# Resolve for a specific document
unsterwerx rules resolve --document a1b2c3

# Preview for a class + scope
unsterwerx rules resolve --class contract --scope acme/sales

Signed Document Handling

Documents detected as digitally signed receive special treatment:

Signed PDFs are detected by scanning for /Sig, /ByteRange, and /SubFilter markers in the PDF binary.

Source Hierarchy

Trust weights assigned to knowledge sources influence how conflicting information is resolved:

bash
unsterwerx rules source list
Source Hierarchy Rules
══════════════════════════════════════════════════════════════
  [seed-aca] academic        weight=5 p=0 (active)
  [seed-ai-] ai-generated    weight=1 p=0 (active)
  [seed-cur] curated         weight=2 p=0 (active)
  [seed-gov] government      weight=3 p=0 (active)
══════════════════════════════════════════════════════════════

Higher weights indicate higher trust. Academic sources (weight 5) are prioritized over AI-generated content (weight 1).