Unsterwerx

How To Classify Documents and Set Retention Policies with Unsterwerx

Unsterwerx can automatically classify hundreds of documents by type and enforce retention policies that control how long each class of document is kept, whether it can be modified, and what happens when the retention period expires. This tutorial walks you through the full lifecycle: creating rules, setting policies, classifying your corpus, scoping documents to organizations, and previewing archival actions.

In the patent architecture, classification rules are Business Intelligence (rules of hierarchy that govern what a document is), while retention policies are User Intelligence (rules of engagement that govern what happens to it). Together they form the governance layer of the Trusted Client-Centric Application Architecture (TCA).

Prerequisites

Step 1 - Understand the Rule System

Before creating rules, it helps to know what you start with. Unsterwerx ships with six seed rules that are created automatically when you initialize the database. These cover common document types like contracts, CVs, invoices, legal documents, government records, and reports.

Seed rules have priority 0. Any rule you create with a higher priority will be evaluated first. When multiple rules match a document, all matches are recorded with confidence scores. A single document can belong to multiple classes.

Rules match documents using three signal types:

By default, patterns use OR logic: if any pattern matches, the rule fires. The --match-all flag switches to AND logic, requiring all specified patterns to match.

Step 2 - Create Classification Rules

Start by creating a rule for tax documents. This rule uses a filename pattern to catch common tax form names:

bash
unsterwerx rules add \
    --name "tax-documents" \
    --class tax \
    --filename-pattern "(?i)(tax|w2|1099|1095)" \
    --priority 10
text
Rule 'tax-documents' created → class 'tax'

The (?i) makes the pattern case-insensitive. Priority 10 means this rule is evaluated before all seed rules.

Next, create a rule for resumes that matches on either filename or content:

bash
unsterwerx rules add \
    --name "resumes" \
    --class resume \
    --filename-pattern "(?i)(resume|cv|RST)" \
    --content-pattern "(?i)(experience|education|skills)" \
    --priority 8
text
Rule 'resumes' created → class 'resume'

This rule uses the default OR mode. A document named resume.pdf matches even if its content does not contain "experience" or "education." A document named report.docx that mentions "skills" and "education" also matches.

For financial documents, you want higher precision. Use --match-all to require both filename and content patterns to match:

bash
unsterwerx rules add \
    --name "financial-reports" \
    --class financial \
    --filename-pattern "(?i)(bank|statement|brokerage|finance)" \
    --content-pattern "(?i)(balance|account|transaction)" \
    --priority 9 \
    --match-all
text
Rule 'financial-reports' created → class 'financial'

With --match-all, a file named bank-statement.pdf must also contain words like "balance" or "transaction" to be classified as financial. This reduces false positives on files that happen to have "bank" in the name but are not financial reports.

Finally, create a rule scoped to a specific organization:

bash
unsterwerx rules add \
    --name "government-docs" \
    --class government \
    --filename-pattern "(?i)(DoD|DD214|LIK|CPIC|CDER)" \
    --priority 7
text
Rule 'government-docs' created → class 'government'

Verify all your rules with rules list:

bash
unsterwerx rules list
text
Classification Rules
══════════════════════════════════════════════════════════════
  [bf0] tax-documents        → tax             (active, p=10)
        filename: (?i)(tax|w2|1099|1095)
        mode:     match-any (OR)
  [e18] financial-reports    → financial       (active, p=9)
        filename: (?i)(bank|statement|brokerage|finance)
        content:  (?i)(balance|account|transaction)
        mode:     match-all (AND)
  [c00] resumes              → resume          (active, p=8)
        filename: (?i)(resume|cv|RST)
        content:  (?i)(experience|education|skills)
        mode:     match-any (OR)
  [efe] government-docs      → government      (active, p=7)
        filename: (?i)(DoD|DD214|LIK|CPIC|CDER)
        mode:     match-any (OR)
  [see] seed-contract        → contract        (active, p=0)
        filename: (?i)(contract|agreement|pogodba)
        content:  (?i)(hereby\s+agree|party\s+of\s+the|terms\s+and\s+conditions|effective\s+date)
        mode:     match-any (OR)
  [see] seed-invoice         → invoice         (active, p=0)
        filename: (?i)(invoice|faktura|račun)
        content:  (?i)(total\s+due|amount\s+payable|invoice\s+number|payment\s+terms)
        mode:     match-any (OR)
  ...
══════════════════════════════════════════════════════════════

Rules are displayed in priority order. Your custom rules appear first, followed by the seed rules at priority 0. The three-character prefix in brackets (e.g., [bf0]) is a short ID you can use to reference the rule.

Step 3 - Set Retention Policies

Retention policies define what happens to classified documents over time. Each policy targets a document class and specifies:

Create a 7-year immutable retention policy for tax documents:

bash
unsterwerx rules policy \
    --name "tax-7yr" \
    --class tax \
    --retention-years 7 \
    --immutable \
    --action keep
text
Policy 'tax-7yr' created for class 'tax' (scope: global)

The --immutable flag means classified tax documents cannot be modified or deleted during the retention period. The keep action means they remain in place after the 7 years expire.

Note: Omit both retention fields for indefinite retention. A zero-valued retention component means the document is eligible immediately once classified; for example, --retention-days 0 --action move archives on the next archive run.

For financial documents, compliance requirements are stricter. Add both --immutable and --legal-hold:

bash
unsterwerx rules policy \
    --name "financial-10yr" \
    --class financial \
    --retention-years 10 \
    --immutable \
    --legal-hold \
    --action keep
text
Policy 'financial-10yr' created for class 'financial' (scope: global)

Note: Documents under --legal-hold cannot be modified, moved, or deleted by any operation, including the archive command and the knowledge dedup pipeline. Legal hold overrides all other actions.

For government records that belong to a specific organization, create a scoped policy with a very long retention period:

bash
unsterwerx rules policy \
    --name "gov-permanent" \
    --class government \
    --retention-years 100 \
    --immutable \
    --legal-hold \
    --action keep \
    --scope organization \
    --scope-id govwerx
text
Policy 'gov-permanent' created for class 'government' (scope: organization)
  Scope: organization:govwerx - policies will apply only to documents in this scope chain.

List all policies to see the full picture:

bash
unsterwerx rules policies
text
Retention Policies
══════════════════════════════════════════════════════════════
  financial-10yr       class=financial       scope=global               retain=10 years     IMMUTABLE action=keep [LEGAL HOLD]
  gov-permanent        class=government      scope=organization:govwerx retain=100 years    IMMUTABLE action=keep [LEGAL HOLD]
  tax-7yr              class=tax             scope=global               retain=7 years      IMMUTABLE action=keep
══════════════════════════════════════════════════════════════

Notice the [LEGAL HOLD] tag on the financial and government policies. These documents are locked down.

Step 4 - Classify Your Documents

With rules in place, run classification across the entire corpus:

bash
unsterwerx classify
text
Classification Summary
══════════════════════════════════
  Documents classified:    668
  Rules applied:           969
  Errors:                    0
══════════════════════════════════

668 documents matched at least one rule, with 969 total rule applications. That means many documents matched multiple rules. A tax return might match both your tax-documents rule and the seed-legal rule, for example.

Inspect a specific document's classifications with --show:

bash
unsterwerx classify --show e6d22e4d
text
Classifications for e6d22e4d
══════════════════════════════════════════
  tax             (100%) via rule 'tax-documents' at 2026-04-14
  legal           (62%) via rule 'seed-legal' at 2026-04-14
══════════════════════════════════════════

This document is classified as tax with 100% confidence (its filename matched the tax rule's pattern exactly) and also as legal with 62% confidence (some of its content matched the seed-legal rule's patterns). The highest-confidence class is used when resolving retention policy.

Step 5 - Assign Organizational Scope to Documents

Documents start with no scope, which means they fall under global policies. To place a document under a specific organizational boundary, use rules assign-scope:

bash
unsterwerx rules assign-scope e6d22e4d-dbfd-40be-904c-2050e6c9d5d5 \
    --scope whetsel/taxes/2020
text
Scope 'whetsel/taxes/2020' assigned to document 'e6d22e4d-dbfd-40be-904c-2050e6c9d5d5'

Scope paths follow a hierarchical format: organization/division/user. Here, whetsel is the organization, taxes is the division, and 2020 is the user-level scope.

Warning: Scope assignment is permanent. Once you assign a scope to a document, it cannot be changed to a different value. The same scope can be re-applied, but switching to a different scope path is blocked. Plan your organizational hierarchy before assigning scopes in bulk.

You can also assign scope at ingest time using unsterwerx ingest --scope acme/engineering /path/to/docs, which is more practical for large batches.

Step 6 - Resolve Effective Policy for a Document

The policy cascade flows from general to specific: global > organization > division > user. Each narrower scope can only tighten constraints. A division cannot set a shorter retention period than the organization above it.

To see what policy actually applies to a specific document after cascading:

bash
unsterwerx rules resolve --document e6d22e4d-dbfd-40be-904c-2050e6c9d5d5
text
Effective Policy for e6d22e4d-dbfd-40be-904c-2050e6c9d5d5
  Scope: whetsel/taxes/2020
══════════════════════════════════════════════════════════════
  Class:      tax
  Retention:  7 years
  Mutable:    IMMUTABLE
  Legal hold: no
  Action:     keep
  Scopes:     global
  Rules:      tax-documents, seed-legal
══════════════════════════════════════════════════════════════

The document's highest-confidence class is tax, so the tax-7yr global policy applies. The Scopes: global line tells you there are no tighter organization or division policies narrowing things further. The Rules field lists all classification rules that matched this document.

You can also preview what the cascaded policy looks like for a class without targeting a specific document. Omit --scope for the global Business Intelligence defaults, or provide a real scope path for a scoped preview:

bash
unsterwerx rules resolve --class government
text
Effective Policy for class 'government' in scope 'global'
══════════════════════════════════════════════════════════════
  Input policies:
    gov-global           scope=global               retain=7 years      IMMUTABLE action=keep

  Cascaded result:
    Retention:  7 years
    Mutable:    IMMUTABLE
    Legal hold: no
    Action:     keep
    Scopes:     global
══════════════════════════════════════════════════════════════

For a scoped preview, pass an organization, division, or user path:

bash
unsterwerx rules resolve --class government --scope govwerx
text
Effective Policy for class 'government' in scope 'govwerx'
══════════════════════════════════════════════════════════════
  Input policies:
    gov-permanent        scope=organization:govwerx retain=100 years    IMMUTABLE action=keep [LEGAL HOLD]

  Cascaded result:
    Retention:  100 years
    Mutable:    IMMUTABLE
    Legal hold: YES
    Action:     keep
    Scopes:     organization
══════════════════════════════════════════════════════════════

This preview is useful when designing your policy hierarchy. You can test how global and scoped policies interact before assigning scopes to real documents.

Step 7 - Preview Archival Actions

Before the archive command moves or deletes anything, always preview with --dry-run:

bash
unsterwerx archive --dry-run
text
Archive Dry Run
══════════════════════════════════
  Processed:       485
  Moved:             0
  Deleted:           0
  Skipped:          30
  Retention:       455
  Errors:            0
  Freed:           0 B
══════════════════════════════════

Here is what each counter means:

In this case, all 455 documents with policies are still within their retention windows. The 30 skipped documents have classifications but no retention policy defined for their class. No documents have expired yet, so nothing would be moved or deleted.

Note: Documents marked as immutable or under legal hold are never archived, even if their retention period has technically expired. You must explicitly release a legal hold before archival can proceed.

Step 8 - Manage Rule Lifecycle

Rules are not permanent. You can retire, reactivate, or permanently purge them.

Retiring a rule is a soft delete. The rule stops matching documents, but its definition and classification history are preserved:

bash
unsterwerx rules remove book-templates
text
Rule 'book-templates' retired. 0 classification(s) removed, 0 document(s) reset to canonical.
  Rule definition preserved. This removes the rule's live classifications; re-run 'classify' after reactivation to rebuild them.
  Use 'rules reactivate book-templates' to re-enable, or 'rules remove book-templates --purge' to permanently delete.

Retired rules appear in a separate section of rules list, so you always know they exist.

To bring a retired rule back:

bash
unsterwerx rules reactivate book-templates
text
Rule 'book-templates' reactivated. Run 'classify' to apply to all documents (including already-classified).

After reactivation, run classify again to re-apply the rule across the corpus.

To permanently delete a rule and all its classification records, use --purge:

bash
unsterwerx rules remove book-templates --purge

Warning: Purging is irreversible. The rule definition and every classification it ever produced are permanently deleted. Use retire (the default) unless you are certain you want to erase the rule's history entirely.

Conclusion

You now have a working classification and retention system. Your documents are classified by type through pattern-matching rules, governed by retention policies that enforce how long they are kept and whether they can be modified, and scoped to organizational boundaries that control policy cascading.

Here is what you set up:

The classification and retention system feeds directly into the deduplication pipeline. When the knowledge dedup command evaluates which documents to remove, it respects legal holds and immutability flags set by your retention policies. Documents under legal hold are never removed, even if they are exact duplicates.

To continue building on this foundation, see How To Detect and Remove Duplicate Documents with Unsterwerx, which covers the Bayesian knowledge scoring and automated deduplication workflow.