How To Classify Documents and Set Retention Policies with Unsterwerx
Unsterwerx can automatically classify hundreds of documents by type and enforce retention policies that control how long each class of document is kept, whether it can be modified, and what happens when the retention period expires. This tutorial walks you through the full lifecycle: creating rules, setting policies, classifying your corpus, scoping documents to organizations, and previewing archival actions.
In the patent architecture, classification rules are Business Intelligence (rules of hierarchy that govern what a document is), while retention policies are User Intelligence (rules of engagement that govern what happens to it). Together they form the governance layer of the Trusted Client-Centric Application Architecture (TCA).
Prerequisites
- Unsterwerx v0.5.4 or newer installed and configured (
unsterwerx config init) - A document corpus ingested and normalized into the Universal Data Set (
unsterwerx ingest+ canonical extraction complete) - Familiarity with basic Unsterwerx commands. If you followed How To Search and Compare Documents with Unsterwerx, your corpus is ready.
Step 1 - Understand the Rule System
Before creating rules, it helps to know what you start with. Unsterwerx ships with six seed rules that are created automatically when you initialize the database. These cover common document types like contracts, CVs, invoices, legal documents, government records, and reports.
Seed rules have priority 0. Any rule you create with a higher priority will be evaluated first. When multiple rules match a document, all matches are recorded with confidence scores. A single document can belong to multiple classes.
Rules match documents using three signal types:
- Filename patterns - regex matched against the document's original filename
- Content patterns - regex matched against the extracted text
- File type filters - match by extension (e.g.,
pptx)
By default, patterns use OR logic: if any pattern matches, the rule fires. The --match-all flag switches to AND logic, requiring all specified patterns to match.
Step 2 - Create Classification Rules
Start by creating a rule for tax documents. This rule uses a filename pattern to catch common tax form names:
unsterwerx rules add \
--name "tax-documents" \
--class tax \
--filename-pattern "(?i)(tax|w2|1099|1095)" \
--priority 10
Rule 'tax-documents' created → class 'tax'
The (?i) makes the pattern case-insensitive. Priority 10 means this rule is evaluated before all seed rules.
Next, create a rule for resumes that matches on either filename or content:
unsterwerx rules add \
--name "resumes" \
--class resume \
--filename-pattern "(?i)(resume|cv|RST)" \
--content-pattern "(?i)(experience|education|skills)" \
--priority 8
Rule 'resumes' created → class 'resume'
This rule uses the default OR mode. A document named resume.pdf matches even if its content does not contain "experience" or "education." A document named report.docx that mentions "skills" and "education" also matches.
For financial documents, you want higher precision. Use --match-all to require both filename and content patterns to match:
unsterwerx rules add \
--name "financial-reports" \
--class financial \
--filename-pattern "(?i)(bank|statement|brokerage|finance)" \
--content-pattern "(?i)(balance|account|transaction)" \
--priority 9 \
--match-all
Rule 'financial-reports' created → class 'financial'
With --match-all, a file named bank-statement.pdf must also contain words like "balance" or "transaction" to be classified as financial. This reduces false positives on files that happen to have "bank" in the name but are not financial reports.
Finally, create a rule scoped to a specific organization:
unsterwerx rules add \
--name "government-docs" \
--class government \
--filename-pattern "(?i)(DoD|DD214|LIK|CPIC|CDER)" \
--priority 7
Rule 'government-docs' created → class 'government'
Verify all your rules with rules list:
unsterwerx rules list
Classification Rules
══════════════════════════════════════════════════════════════
[bf0] tax-documents → tax (active, p=10)
filename: (?i)(tax|w2|1099|1095)
mode: match-any (OR)
[e18] financial-reports → financial (active, p=9)
filename: (?i)(bank|statement|brokerage|finance)
content: (?i)(balance|account|transaction)
mode: match-all (AND)
[c00] resumes → resume (active, p=8)
filename: (?i)(resume|cv|RST)
content: (?i)(experience|education|skills)
mode: match-any (OR)
[efe] government-docs → government (active, p=7)
filename: (?i)(DoD|DD214|LIK|CPIC|CDER)
mode: match-any (OR)
[see] seed-contract → contract (active, p=0)
filename: (?i)(contract|agreement|pogodba)
content: (?i)(hereby\s+agree|party\s+of\s+the|terms\s+and\s+conditions|effective\s+date)
mode: match-any (OR)
[see] seed-invoice → invoice (active, p=0)
filename: (?i)(invoice|faktura|račun)
content: (?i)(total\s+due|amount\s+payable|invoice\s+number|payment\s+terms)
mode: match-any (OR)
...
══════════════════════════════════════════════════════════════
Rules are displayed in priority order. Your custom rules appear first, followed by the seed rules at priority 0. The three-character prefix in brackets (e.g., [bf0]) is a short ID you can use to reference the rule.
Step 3 - Set Retention Policies
Retention policies define what happens to classified documents over time. Each policy targets a document class and specifies:
- Retention period - how many years the document must be kept
- Retention period - how many years and/or days the document must be kept
- Mutability - whether the document can be modified during retention
- Legal hold - whether the document is frozen for legal or compliance purposes
- Action - what happens after retention expires:
keep,move, ordelete
Create a 7-year immutable retention policy for tax documents:
unsterwerx rules policy \
--name "tax-7yr" \
--class tax \
--retention-years 7 \
--immutable \
--action keep
Policy 'tax-7yr' created for class 'tax' (scope: global)
The --immutable flag means classified tax documents cannot be modified or deleted during the retention period. The keep action means they remain in place after the 7 years expire.
Note: Omit both retention fields for indefinite retention. A zero-valued retention component means the document is eligible immediately once classified; for example, --retention-days 0 --action move archives on the next archive run.
For financial documents, compliance requirements are stricter. Add both --immutable and --legal-hold:
unsterwerx rules policy \
--name "financial-10yr" \
--class financial \
--retention-years 10 \
--immutable \
--legal-hold \
--action keep
Policy 'financial-10yr' created for class 'financial' (scope: global)
Note: Documents under --legal-hold cannot be modified, moved, or deleted by any operation, including the archive command and the knowledge dedup pipeline. Legal hold overrides all other actions.
For government records that belong to a specific organization, create a scoped policy with a very long retention period:
unsterwerx rules policy \
--name "gov-permanent" \
--class government \
--retention-years 100 \
--immutable \
--legal-hold \
--action keep \
--scope organization \
--scope-id govwerx
Policy 'gov-permanent' created for class 'government' (scope: organization)
Scope: organization:govwerx - policies will apply only to documents in this scope chain.
List all policies to see the full picture:
unsterwerx rules policies
Retention Policies
══════════════════════════════════════════════════════════════
financial-10yr class=financial scope=global retain=10 years IMMUTABLE action=keep [LEGAL HOLD]
gov-permanent class=government scope=organization:govwerx retain=100 years IMMUTABLE action=keep [LEGAL HOLD]
tax-7yr class=tax scope=global retain=7 years IMMUTABLE action=keep
══════════════════════════════════════════════════════════════
Notice the [LEGAL HOLD] tag on the financial and government policies. These documents are locked down.
Step 4 - Classify Your Documents
With rules in place, run classification across the entire corpus:
unsterwerx classify
Classification Summary
══════════════════════════════════
Documents classified: 668
Rules applied: 969
Errors: 0
══════════════════════════════════
668 documents matched at least one rule, with 969 total rule applications. That means many documents matched multiple rules. A tax return might match both your tax-documents rule and the seed-legal rule, for example.
Inspect a specific document's classifications with --show:
unsterwerx classify --show e6d22e4d
Classifications for e6d22e4d
══════════════════════════════════════════
tax (100%) via rule 'tax-documents' at 2026-04-14
legal (62%) via rule 'seed-legal' at 2026-04-14
══════════════════════════════════════════
This document is classified as tax with 100% confidence (its filename matched the tax rule's pattern exactly) and also as legal with 62% confidence (some of its content matched the seed-legal rule's patterns). The highest-confidence class is used when resolving retention policy.
Step 5 - Assign Organizational Scope to Documents
Documents start with no scope, which means they fall under global policies. To place a document under a specific organizational boundary, use rules assign-scope:
unsterwerx rules assign-scope e6d22e4d-dbfd-40be-904c-2050e6c9d5d5 \
--scope whetsel/taxes/2020
Scope 'whetsel/taxes/2020' assigned to document 'e6d22e4d-dbfd-40be-904c-2050e6c9d5d5'
Scope paths follow a hierarchical format: organization/division/user. Here, whetsel is the organization, taxes is the division, and 2020 is the user-level scope.
Warning: Scope assignment is permanent. Once you assign a scope to a document, it cannot be changed to a different value. The same scope can be re-applied, but switching to a different scope path is blocked. Plan your organizational hierarchy before assigning scopes in bulk.
You can also assign scope at ingest time using unsterwerx ingest --scope acme/engineering /path/to/docs, which is more practical for large batches.
Step 6 - Resolve Effective Policy for a Document
The policy cascade flows from general to specific: global > organization > division > user. Each narrower scope can only tighten constraints. A division cannot set a shorter retention period than the organization above it.
To see what policy actually applies to a specific document after cascading:
unsterwerx rules resolve --document e6d22e4d-dbfd-40be-904c-2050e6c9d5d5
Effective Policy for e6d22e4d-dbfd-40be-904c-2050e6c9d5d5
Scope: whetsel/taxes/2020
══════════════════════════════════════════════════════════════
Class: tax
Retention: 7 years
Mutable: IMMUTABLE
Legal hold: no
Action: keep
Scopes: global
Rules: tax-documents, seed-legal
══════════════════════════════════════════════════════════════
The document's highest-confidence class is tax, so the tax-7yr global policy applies. The Scopes: global line tells you there are no tighter organization or division policies narrowing things further. The Rules field lists all classification rules that matched this document.
You can also preview what the cascaded policy looks like for a class without targeting a specific document. Omit --scope for the global Business Intelligence defaults, or provide a real scope path for a scoped preview:
unsterwerx rules resolve --class government
Effective Policy for class 'government' in scope 'global'
══════════════════════════════════════════════════════════════
Input policies:
gov-global scope=global retain=7 years IMMUTABLE action=keep
Cascaded result:
Retention: 7 years
Mutable: IMMUTABLE
Legal hold: no
Action: keep
Scopes: global
══════════════════════════════════════════════════════════════
For a scoped preview, pass an organization, division, or user path:
unsterwerx rules resolve --class government --scope govwerx
Effective Policy for class 'government' in scope 'govwerx'
══════════════════════════════════════════════════════════════
Input policies:
gov-permanent scope=organization:govwerx retain=100 years IMMUTABLE action=keep [LEGAL HOLD]
Cascaded result:
Retention: 100 years
Mutable: IMMUTABLE
Legal hold: YES
Action: keep
Scopes: organization
══════════════════════════════════════════════════════════════
This preview is useful when designing your policy hierarchy. You can test how global and scoped policies interact before assigning scopes to real documents.
Step 7 - Preview Archival Actions
Before the archive command moves or deletes anything, always preview with --dry-run:
unsterwerx archive --dry-run
Archive Dry Run
══════════════════════════════════
Processed: 485
Moved: 0
Deleted: 0
Skipped: 30
Retention: 455
Errors: 0
Freed: 0 B
══════════════════════════════════
Here is what each counter means:
- Processed - total documents evaluated against their retention policies
- Moved/Deleted - documents whose retention period expired and whose policy action is
moveordelete - Skipped - documents without an applicable retention policy (no matching class, or no policy for that class)
- Retention - documents still within their retention period, untouched
- Freed - disk space reclaimed (zero in a dry run since nothing actually moves)
In this case, all 455 documents with policies are still within their retention windows. The 30 skipped documents have classifications but no retention policy defined for their class. No documents have expired yet, so nothing would be moved or deleted.
Note: Documents marked as immutable or under legal hold are never archived, even if their retention period has technically expired. You must explicitly release a legal hold before archival can proceed.
Step 8 - Manage Rule Lifecycle
Rules are not permanent. You can retire, reactivate, or permanently purge them.
Retiring a rule is a soft delete. The rule stops matching documents, but its definition and classification history are preserved:
unsterwerx rules remove book-templates
Rule 'book-templates' retired. 0 classification(s) removed, 0 document(s) reset to canonical.
Rule definition preserved. This removes the rule's live classifications; re-run 'classify' after reactivation to rebuild them.
Use 'rules reactivate book-templates' to re-enable, or 'rules remove book-templates --purge' to permanently delete.
Retired rules appear in a separate section of rules list, so you always know they exist.
To bring a retired rule back:
unsterwerx rules reactivate book-templates
Rule 'book-templates' reactivated. Run 'classify' to apply to all documents (including already-classified).
After reactivation, run classify again to re-apply the rule across the corpus.
To permanently delete a rule and all its classification records, use --purge:
unsterwerx rules remove book-templates --purge
Warning: Purging is irreversible. The rule definition and every classification it ever produced are permanently deleted. Use retire (the default) unless you are certain you want to erase the rule's history entirely.
Conclusion
You now have a working classification and retention system. Your documents are classified by type through pattern-matching rules, governed by retention policies that enforce how long they are kept and whether they can be modified, and scoped to organizational boundaries that control policy cascading.
Here is what you set up:
- Classification rules with filename patterns, content patterns, and match-all AND logic
- Retention policies with immutability, legal hold, and configurable end-of-life actions
- Organizational scope assignment for hierarchical policy enforcement
- Archival preview through dry-run to verify policy behavior before it takes effect
The classification and retention system feeds directly into the deduplication pipeline. When the knowledge dedup command evaluates which documents to remove, it respects legal holds and immutability flags set by your retention policies. Documents under legal hold are never removed, even if they are exact duplicates.
To continue building on this foundation, see How To Detect and Remove Duplicate Documents with Unsterwerx, which covers the Bayesian knowledge scoring and automated deduplication workflow.