Classification Guide
Unsterwerx uses a rules-based system to classify documents by type (invoices, contracts, CVs, reports, government docs) and enforce retention policies per class. This guide covers rule creation, policy management, the cascade model, and scope assignment.
Classification Rules
Classification rules match documents using regex patterns on filenames or content. Rules that specify both patterns can require all to match. Each rule maps to a document class.
Seed Rules
Unsterwerx ships with six seed rules created during database initialization:
| Rule | Class | Filename Pattern | Content Pattern |
|---|---|---|---|
seed-contract | contract | contract, agreement | hereby agree, terms and conditions |
seed-cv | cv | cv, resume, curriculum | experience, education, skills |
seed-government | government | government, official | public notice, gazette, decree |
seed-invoice | invoice | invoice, faktura | total due, amount payable |
seed-legal | legal | legal, law, regulation | pursuant to, article, jurisdiction |
seed-report | report | report, analysis | executive summary, findings |
Creating Rules
Add rules with one or both pattern types:
# Match by filename only
unsterwerx rules add --name "my-contracts" --class contract \
--filename-pattern "(?i)contract"
# Match by content only
unsterwerx rules add --name "my-invoices" --class invoice \
--content-pattern "(?i)(invoice\s+number|total\s+due)"
# Require both patterns to match
unsterwerx rules add --name "strict-legal" --class legal \
--filename-pattern "(?i)legal" \
--content-pattern "(?i)pursuant\s+to" \
--match-all --priority 10
Priority
Rules with higher priority are evaluated first. When multiple rules match a document, all matches are recorded with confidence scores. The --match-all flag requires all specified patterns to match.
Confidence Scoring
Classification confidence is computed from the strength of pattern matches across filename and content signals. A document classified as cv (62%) means the rule's patterns matched with 62% confidence.
Retention Policies
Retention policies control what happens to documents after their retention period expires.
Policy Fields
| Field | Description |
|---|---|
| class | Document class this policy applies to |
| retention-years | Minimum years before archival action |
| immutable | Document cannot be modified |
| legal-hold | Document is frozen for legal purposes |
| action | What happens at end-of-retention: move, delete, keep |
| scope | Policy level: global, organization, division, user |
Creating Policies
unsterwerx rules policy \
--name "contract-7yr" \
--class contract \
--retention-years 7 \
--immutable \
--action move
Policy Cascade
Retention policies follow a hierarchical cascade:
global → organization → division → user
Each level can only tighten constraints set by the level above. A division policy cannot:
- Set a shorter retention period than its organization
- Remove immutability set at a higher level
- Release a legal hold set at a higher level
- Downgrade the archive action severity
Example
# Global: 7-year retention, immutable
unsterwerx rules policy --name "global-contract" --class contract \
--retention-years 7 --immutable --action move --scope global
# Organization: tightens to 10 years (valid, stricter)
unsterwerx rules policy --name "dod-contract" --class contract \
--retention-years 10 --immutable --action keep \
--scope organization --scope-id "DoD"
# Division: tries 5 years (rejected, looser than org)
unsterwerx rules policy --name "div-contract" --class contract \
--retention-years 5 --action move \
--scope division --scope-id "Engineering"
# Error: cascade violation, retention years cannot be less than parent scope
Assigning Scope to Documents
Documents start with no scope (treated as global). Assign a scope to place documents under a specific organizational boundary:
unsterwerx rules assign-scope a1b2c3 --scope acme/sales
Scope assignment is one-way. Once set, it cannot be changed to a different value. You can also assign scope at ingest time with --scope:
unsterwerx ingest --scope acme/engineering /path/to/eng-docs
Inspecting Effective Policy
Use rules resolve to see the cascaded effective policy for a document or to preview what a class + scope combination produces:
# Resolve for a specific document
unsterwerx rules resolve --document a1b2c3
# Preview for a class + scope
unsterwerx rules resolve --class contract --scope acme/sales
Signed Document Handling
Documents detected as digitally signed receive special treatment:
- Always treated as immutable regardless of policy
- Always placed under legal hold
- Original PDF binary is preserved in CAS alongside canonical markdown
- Signature timestamp is extracted and recorded
Signed PDFs are detected by scanning for /Sig, /ByteRange, and /SubFilter markers in the PDF binary.
Source Hierarchy
Trust weights assigned to knowledge sources influence how conflicting information is resolved:
unsterwerx rules source list
Source Hierarchy Rules
══════════════════════════════════════════════════════════════
[seed-aca] academic weight=5 p=0 (active)
[seed-ai-] ai-generated weight=1 p=0 (active)
[seed-cur] curated weight=2 p=0 (active)
[seed-gov] government weight=3 p=0 (active)
══════════════════════════════════════════════════════════════
Higher weights indicate higher trust. Academic sources (weight 5) are prioritized over AI-generated content (weight 1).