How To Ingest and Normalize Enterprise Documents with Unsterwerx

Unsterwerx turns a messy folder tree of PDFs, Word docs, spreadsheets, and presentations into a compact, searchable, hash-verified knowledge store. In this tutorial you will initialize a data directory, ingest documents from disk, preview what will be processed, organize your library with scopes, and confirm the results. By the end you will have a working Universal Data Set ready for search, similarity analysis, and classification.

Prerequisites

Unsterwerx v0.5.4 or newer installed and available on your PATH. See Installation if you have not set it up yet.
A directory of enterprise documents (PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown, SQL, RTF). This tutorial uses ~/documents as a placeholder; replace it with your actual path.
Basic familiarity with the terminal.

Step 1 - Initialize the Data Directory

Before you ingest anything, Unsterwerx needs a data directory. This is the Shared Sandbox in patent terminology: a trusted local processing environment where all canonical data, indexes, and audit logs live.

Run the initialization command:

bash

unsterwerx config init

text

Config initialized: /Users/you/.unsterwerx/config.toml

This creates the ~/.unsterwerx directory with a default config.toml, an empty SQLite database, and the audit log. All subsequent commands read from and write to this location.

To see what the defaults look like:

bash

unsterwerx config show

text

[ingest]
extensions = [
    "pdf",
    "docx",
    "xlsx",
    "pptx",
    "doc",
    "xls",
    "ppt",
    "txt",
    "csv",
    "rtf",
    "md",
    "markdown",
    "sql",
]
max_file_size = 524288000
max_size_file = 104857600
skip_hidden = true
follow_symlinks = false
pdf_fallback_pdftotext = true

[similarity]
shingle_k = 3
num_hashes = 128
lsh_bands = 32
lsh_rows = 4
threshold = 0.3

[storage]
journal_mode = "wal"
busy_timeout_ms = 5000
zstd_level = 3

[metadata]
capture_enabled = false
extractors = [
    "builtin_pdf",
    "builtin_ooxml",
    "builtin_image",
]

A few things to note. The max_file_size (500 MB) controls which files are discovered during scans. The max_size_file (100 MB) is the in-memory guard for parsers. Storage uses WAL-mode SQLite with Zstandard level 3 compression. These defaults work well for most enterprise document sets; you can tune them later with unsterwerx config set.

Step 2 - Ingest Your First Document Directory

Point Unsterwerx at a directory and it will recursively scan for supported files, compute SHA-256 content hashes, and perform normalization on each document. Normalization is the core of the architecture: each file is routed through a format-specific NAC (Normalized Application Container) that extracts text and structure into a canonical form. The results are stored as the Universal Data Set, a normalized representation of all your ingested content.

This all happens in a single pass. Scan, hash, parse, canonicalize, index.

bash

unsterwerx ingest ~/documents

text

Ingest Summary
══════════════════════════════════
  Files discovered:     754
  Empty (skipped):        1
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:       753
  Files ingested:       631
  Duplicates:            75
  Unsupported:           15
  Skipped:                0
  Errors:                32
    Image-only:          17   (of errors)
  Indexed (FTS5):       631
══════════════════════════════════

Here is how to read the summary:

Files discovered / eligible: 754 files were found; 1 was empty and got skipped, leaving 753 eligible candidates.
Files ingested: 631 documents were successfully normalized and added to the Universal Data Set.
Duplicates: 75 files had the same SHA-256 content hash as something already in the database. Unsterwerx skips these automatically.
Unsupported: 15 files were in legacy formats (.doc, .xls, .ppt) that get registered but cannot be parsed yet.
Errors: 32 files failed extraction. Of those, 17 were image-only (scanned) PDFs that require OCR. The rest were encrypted or structurally damaged.
Indexed (FTS5): 631 documents were added to the full-text search index and are immediately searchable.

Note: Unsterwerx streams files through an 8 KB buffer for hashing. Large files are never loaded entirely into memory during the hash phase.

Step 3 - Preview What Will Be Ingested

If you want to see what Unsterwerx would do before committing anything, use --dry-run. Combine it with --extension to filter by file type.

To preview only PDF files in a directory:

bash

unsterwerx ingest --dry-run --extension pdf ~/documents

text

Dry Run
══════════════════════════════════
  Files discovered:      150
  Empty (skipped):         0
  Oversized (skipped):     0
  ──────────────────────────────
  Files eligible:        150
  Already ingested:        0
  Errors:                  0
  ──────────────────────────────
  Candidates (new):      150
══════════════════════════════════

Dry-run scans the directory and checks content hashes against the database, but writes nothing. The "Already ingested" line tells you how many files are duplicates of documents you have already processed. "Candidates (new)" is the count of files that would actually be added.

This is useful when you want to ingest a single format first, or when you need to estimate the scope of a new directory before committing.

Step 4 - Organize Documents with Scopes

Scopes give you a hierarchical tag system for organizing ingested documents. Think of them as a path: organization/division/user. Scopes feed into the classification and policy engines, so a document scoped to acme/gov will only receive rules and policies applicable to that branch.

To ingest a directory under a specific scope:

bash

unsterwerx ingest --scope acme/gov ~/documents/government

text

Ingest Summary
══════════════════════════════════
  Files discovered:      96
  Empty (skipped):        0
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:        96
  Files ingested:         5
  Duplicates:            86
  Unsupported:            2
  Skipped:                0
  Errors:                 3
  Indexed (FTS5):         5
══════════════════════════════════

Notice the 86 duplicates. This directory contained files that were already ingested in Step 2 from a different path. Unsterwerx detected this through content hashing. The file names and locations were different, but the content was identical. Only 5 genuinely new documents were added, and those 5 were assigned the acme/gov scope.

Note: Scope assignment is one-way. Once a document has a scope, it cannot be reassigned to a different one. Choose your scope hierarchy before bulk-ingesting.

Step 5 - Check Your Document Library

The status command gives you an overview of everything in the database.

bash

unsterwerx status

text

Unsterwerx Status
══════════════════════════════════════════
  Data directory:  /Users/you/.unsterwerx
  Total documents:     1128
  Total size:        2.1 GB
  Indexed (FTS5):      1042
  Audit events:         103

  By Status:
    canonical         1042
    error               36
    image_only          25
    unsupported         25

══════════════════════════════════════════

Four document statuses to understand:

canonical - successfully normalized and indexed. These are searchable and ready for similarity analysis.
error - extraction failed, usually due to encryption, corruption, or malformed archive headers. Review these with unsterwerx status errors.
image_only - scanned PDFs that contain no extractable text. They are registered but need OCR to be fully processed.
unsupported - legacy formats (.doc, .xls, .ppt) that are tracked but cannot be parsed.

Add --detailed to see the full breakdown by file type:

bash

unsterwerx status --detailed

text

Unsterwerx Status
══════════════════════════════════════════
  Data directory:  /Users/you/.unsterwerx
  Total documents:     1128
  Total size:        2.1 GB
  Indexed (FTS5):      1042
  Audit events:         103

  By Status:
    canonical         1042
    error               36
    image_only          25
    unsupported         25

  By File Type:
    csv            17
    doc            14
    docx          246
    markdown       20
    pdf           441
    ppt             4
    pptx          233
    sql            16
    txt            46
    unknown         2
    xls             5
    xlsx           84

  Similarity:
    Candidate pairs:      0
    Exact dupes:          0

  Classification:
    Active rules:         6
    Classified docs:      0
══════════════════════════════════════════

This dataset contains 1,128 documents totaling 2.1 GB of original data. PDFs are the most common format (441), followed by DOCX (246) and PPTX (233). The similarity and classification sections show zeros because those analysis steps have not been run yet.

Here is the storage picture. In benchmarks against this same 2.1 GB dataset, the Universal Data Set compacts to 74 MB of canonical text (96.5% compaction), with a total database footprint of 285 MB (86.5% reduction). The original files stay where they are; Unsterwerx stores only the extracted canonical representation and its indexes.

Step 6 - Enable Rich Metadata Capture

By default, Unsterwerx extracts and indexes document content but skips embedded metadata (author, creation date, software, page count). To capture metadata during ingest, add the --capture-metadata flag:

bash

unsterwerx ingest --capture-metadata ~/documents/reports

text

Ingest Summary
══════════════════════════════════
  Files discovered:      29
  Empty (skipped):        0
  Oversized (skipped):    0
  ──────────────────────────────
  Files eligible:        29
  Files ingested:        28
  Duplicates:             0
  Unsupported:            1
  Skipped:                0
  Errors:                 0
  Indexed (FTS5):        28
══════════════════════════════════

With metadata capture enabled, Unsterwerx runs the built-in extractors (builtin_pdf, builtin_ooxml, builtin_image) alongside the canonical extraction pass. You can then query metadata with unsterwerx metadata keys and unsterwerx metadata show <document-id>.

Note: If you already ingested documents without --capture-metadata, you can extract metadata after the fact with unsterwerx metadata extract. No re-ingestion required.

Step 7 - Script with JSON Output

Every Unsterwerx command supports --json for machine-readable output. This makes it straightforward to integrate with CI/CD pipelines, monitoring scripts, or anything that consumes structured data.

bash

unsterwerx ingest --json ~/documents/contracts

json

{
  "command": "ingest",
  "version": "0.5.4",
  "timestamp": "2026-04-14T15:23:21.636681+00:00",
  "data": {
    "run_id": "7f0b52fc-a77e-4020-8235-694c1de78bab",
    "source_type": "local",
    "status": "completed",
    "counters": {
      "files_discovered": 23,
      "files_scanned": 23,
      "files_ingested": 17,
      "duplicates": 6,
      "unsupported": 0,
      "skipped": 0,
      "empty_skipped": 0,
      "oversized_skipped": 0,
      "errors": 0,
      "indexed": 17
    }
  }
}

Pipe it to jq to extract specific values:

bash

unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.files_ingested'

text

Or check for errors in a script:

bash

errors=$(unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.errors')
if [ "$errors" -gt 0 ]; then
  echo "Ingest completed with $errors errors"
fi

The JSON envelope always includes command, version, and timestamp, so your scripts can verify they are parsing the expected output format.

Conclusion

You initialized a Shared Sandbox data directory, ingested over a thousand documents across multiple formats, previewed ingestion with dry-run, organized documents with scopes, and checked the library status. Unsterwerx compressed 2.1 GB of enterprise documents into a 74 MB canonical representation, automatically detected duplicates across different directory paths, and indexed everything for full-text search.

Your Universal Data Set is ready. Next steps:

Run unsterwerx similarity to find near-duplicate document pairs.
Run unsterwerx search "your query" to search across all canonical content.
Read the next tutorial, How To Search and Compare Documents with Unsterwerx, for a deep dive into search, similarity analysis, and document diffing.