Unsterwerx

How To Ingest and Normalize Enterprise Documents with Unsterwerx

Unsterwerx turns a messy folder tree of PDFs, Word docs, spreadsheets, and presentations into a compact, searchable, hash-verified knowledge store. In this tutorial you will initialize a data directory, ingest documents from disk, preview what will be processed, organize your library with scopes, and confirm the results. By the end you will have a working Universal Data Set ready for search, similarity analysis, and classification.

Prerequisites

Step 1 - Initialize the Data Directory

Before you ingest anything, Unsterwerx needs a data directory. This is the Shared Sandbox in patent terminology: a trusted local processing environment where all canonical data, indexes, and audit logs live.

Run the initialization command:

bash
unsterwerx config init
text
Config initialized: /Users/you/.unsterwerx/config.toml

This creates the ~/.unsterwerx directory with a default config.toml, an empty SQLite database, and the audit log. All subsequent commands read from and write to this location.

To see what the defaults look like:

bash
unsterwerx config show
text
[ingest]
extensions = [
    "pdf",
    "docx",
    "xlsx",
    "pptx",
    "doc",
    "xls",
    "ppt",
    "txt",
    "csv",
    "rtf",
    "md",
    "markdown",
    "sql",
]
max_file_size = 524288000
max_size_file = 104857600
skip_hidden = true
follow_symlinks = false
pdf_fallback_pdftotext = true

[similarity]
shingle_k = 3
num_hashes = 128
lsh_bands = 32
lsh_rows = 4
threshold = 0.3

[storage]
journal_mode = "wal"
busy_timeout_ms = 5000
zstd_level = 3

[metadata]
capture_enabled = false
extractors = [
    "builtin_pdf",
    "builtin_ooxml",
    "builtin_image",
]

A few things to note. The max_file_size (500 MB) controls which files are discovered during scans. The max_size_file (100 MB) is the in-memory guard for parsers. Storage uses WAL-mode SQLite with Zstandard level 3 compression. These defaults work well for most enterprise document sets; you can tune them later with unsterwerx config set.

Step 2 - Ingest Your First Document Directory

Point Unsterwerx at a directory and it will recursively scan for supported files, compute SHA-256 content hashes, and perform normalization on each document. Normalization is the core of the architecture: each file is routed through a format-specific NAC (Normalized Application Container) that extracts text and structure into a canonical form. The results are stored as the Universal Data Set, a normalized representation of all your ingested content.

This all happens in a single pass. Scan, hash, parse, canonicalize, index.

bash
unsterwerx ingest ~/documents
text
Ingest Summary
══════════════════════════════════
  Files discovered:     754
  Empty (skipped):        1
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:       753
  Files ingested:       631
  Duplicates:            75
  Unsupported:           15
  Skipped:                0
  Errors:                32
    Image-only:          17   (of errors)
  Indexed (FTS5):       631
══════════════════════════════════

Here is how to read the summary:

Note: Unsterwerx streams files through an 8 KB buffer for hashing. Large files are never loaded entirely into memory during the hash phase.

Step 3 - Preview What Will Be Ingested

If you want to see what Unsterwerx would do before committing anything, use --dry-run. Combine it with --extension to filter by file type.

To preview only PDF files in a directory:

bash
unsterwerx ingest --dry-run --extension pdf ~/documents
text
Dry Run
══════════════════════════════════
  Files discovered:      150
  Empty (skipped):         0
  Oversized (skipped):     0
  ──────────────────────────────
  Files eligible:        150
  Already ingested:        0
  Errors:                  0
  ──────────────────────────────
  Candidates (new):      150
══════════════════════════════════

Dry-run scans the directory and checks content hashes against the database, but writes nothing. The "Already ingested" line tells you how many files are duplicates of documents you have already processed. "Candidates (new)" is the count of files that would actually be added.

This is useful when you want to ingest a single format first, or when you need to estimate the scope of a new directory before committing.

Step 4 - Organize Documents with Scopes

Scopes give you a hierarchical tag system for organizing ingested documents. Think of them as a path: organization/division/user. Scopes feed into the classification and policy engines, so a document scoped to acme/gov will only receive rules and policies applicable to that branch.

To ingest a directory under a specific scope:

bash
unsterwerx ingest --scope acme/gov ~/documents/government
text
Ingest Summary
══════════════════════════════════
  Files discovered:      96
  Empty (skipped):        0
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:        96
  Files ingested:         5
  Duplicates:            86
  Unsupported:            2
  Skipped:                0
  Errors:                 3
  Indexed (FTS5):         5
══════════════════════════════════

Notice the 86 duplicates. This directory contained files that were already ingested in Step 2 from a different path. Unsterwerx detected this through content hashing. The file names and locations were different, but the content was identical. Only 5 genuinely new documents were added, and those 5 were assigned the acme/gov scope.

Note: Scope assignment is one-way. Once a document has a scope, it cannot be reassigned to a different one. Choose your scope hierarchy before bulk-ingesting.

Step 5 - Check Your Document Library

The status command gives you an overview of everything in the database.

bash
unsterwerx status
text
Unsterwerx Status
══════════════════════════════════════════
  Data directory:  /Users/you/.unsterwerx
  Total documents:     1128
  Total size:        2.1 GB
  Indexed (FTS5):      1042
  Audit events:         103

  By Status:
    canonical         1042
    error               36
    image_only          25
    unsupported         25

══════════════════════════════════════════

Four document statuses to understand:

Add --detailed to see the full breakdown by file type:

bash
unsterwerx status --detailed
text
Unsterwerx Status
══════════════════════════════════════════
  Data directory:  /Users/you/.unsterwerx
  Total documents:     1128
  Total size:        2.1 GB
  Indexed (FTS5):      1042
  Audit events:         103

  By Status:
    canonical         1042
    error               36
    image_only          25
    unsupported         25

  By File Type:
    csv            17
    doc            14
    docx          246
    markdown       20
    pdf           441
    ppt             4
    pptx          233
    sql            16
    txt            46
    unknown         2
    xls             5
    xlsx           84

  Similarity:
    Candidate pairs:      0
    Exact dupes:          0

  Classification:
    Active rules:         6
    Classified docs:      0
══════════════════════════════════════════

This dataset contains 1,128 documents totaling 2.1 GB of original data. PDFs are the most common format (441), followed by DOCX (246) and PPTX (233). The similarity and classification sections show zeros because those analysis steps have not been run yet.

Here is the storage picture. In benchmarks against this same 2.1 GB dataset, the Universal Data Set compacts to 74 MB of canonical text (96.5% compaction), with a total database footprint of 285 MB (86.5% reduction). The original files stay where they are; Unsterwerx stores only the extracted canonical representation and its indexes.

Step 6 - Enable Rich Metadata Capture

By default, Unsterwerx extracts and indexes document content but skips embedded metadata (author, creation date, software, page count). To capture metadata during ingest, add the --capture-metadata flag:

bash
unsterwerx ingest --capture-metadata ~/documents/reports
text
Ingest Summary
══════════════════════════════════
  Files discovered:      29
  Empty (skipped):        0
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:        29
  Files ingested:        28
  Duplicates:             0
  Unsupported:            1
  Skipped:                0
  Errors:                 0
  Indexed (FTS5):        28
══════════════════════════════════

With metadata capture enabled, Unsterwerx runs the built-in extractors (builtin_pdf, builtin_ooxml, builtin_image) alongside the canonical extraction pass. You can then query metadata with unsterwerx metadata keys and unsterwerx metadata show <document-id>.

Note: If you already ingested documents without --capture-metadata, you can extract metadata after the fact with unsterwerx metadata extract. No re-ingestion required.

Step 7 - Script with JSON Output

Every Unsterwerx command supports --json for machine-readable output. This makes it straightforward to integrate with CI/CD pipelines, monitoring scripts, or anything that consumes structured data.

bash
unsterwerx ingest --json ~/documents/contracts
json
{
  "command": "ingest",
  "version": "0.5.4",
  "timestamp": "2026-04-14T15:23:21.636681+00:00",
  "data": {
    "run_id": "7f0b52fc-a77e-4020-8235-694c1de78bab",
    "source_type": "local",
    "status": "completed",
    "counters": {
      "files_discovered": 23,
      "files_scanned": 23,
      "files_ingested": 17,
      "duplicates": 6,
      "unsupported": 0,
      "skipped": 0,
      "empty_skipped": 0,
      "oversized_skipped": 0,
      "errors": 0,
      "indexed": 17
    }
  }
}

Pipe it to jq to extract specific values:

bash
unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.files_ingested'
text
17

Or check for errors in a script:

bash
errors=$(unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.errors')
if [ "$errors" -gt 0 ]; then
  echo "Ingest completed with $errors errors"
fi

The JSON envelope always includes command, version, and timestamp, so your scripts can verify they are parsing the expected output format.

Conclusion

You initialized a Shared Sandbox data directory, ingested over a thousand documents across multiple formats, previewed ingestion with dry-run, organized documents with scopes, and checked the library status. Unsterwerx compressed 2.1 GB of enterprise documents into a 74 MB canonical representation, automatically detected duplicates across different directory paths, and indexed everything for full-text search.

Your Universal Data Set is ready. Next steps: