How To Ingest and Normalize Enterprise Documents with Unsterwerx
Unsterwerx turns a messy folder tree of PDFs, Word docs, spreadsheets, and presentations into a compact, searchable, hash-verified knowledge store. In this tutorial you will initialize a data directory, ingest documents from disk, preview what will be processed, organize your library with scopes, and confirm the results. By the end you will have a working Universal Data Set ready for search, similarity analysis, and classification.
Prerequisites
- Unsterwerx v0.5.4 or newer installed and available on your
PATH. See Installation if you have not set it up yet. - A directory of enterprise documents (PDF, DOCX, XLSX, PPTX, TXT, CSV, Markdown, SQL, RTF). This tutorial uses
~/documentsas a placeholder; replace it with your actual path. - Basic familiarity with the terminal.
Step 1 - Initialize the Data Directory
Before you ingest anything, Unsterwerx needs a data directory. This is the Shared Sandbox in patent terminology: a trusted local processing environment where all canonical data, indexes, and audit logs live.
Run the initialization command:
unsterwerx config init
Config initialized: /Users/you/.unsterwerx/config.toml
This creates the ~/.unsterwerx directory with a default config.toml, an empty SQLite database, and the audit log. All subsequent commands read from and write to this location.
To see what the defaults look like:
unsterwerx config show
[ingest]
extensions = [
"pdf",
"docx",
"xlsx",
"pptx",
"doc",
"xls",
"ppt",
"txt",
"csv",
"rtf",
"md",
"markdown",
"sql",
]
max_file_size = 524288000
max_size_file = 104857600
skip_hidden = true
follow_symlinks = false
pdf_fallback_pdftotext = true
[similarity]
shingle_k = 3
num_hashes = 128
lsh_bands = 32
lsh_rows = 4
threshold = 0.3
[storage]
journal_mode = "wal"
busy_timeout_ms = 5000
zstd_level = 3
[metadata]
capture_enabled = false
extractors = [
"builtin_pdf",
"builtin_ooxml",
"builtin_image",
]
A few things to note. The max_file_size (500 MB) controls which files are discovered during scans. The max_size_file (100 MB) is the in-memory guard for parsers. Storage uses WAL-mode SQLite with Zstandard level 3 compression. These defaults work well for most enterprise document sets; you can tune them later with unsterwerx config set.
Step 2 - Ingest Your First Document Directory
Point Unsterwerx at a directory and it will recursively scan for supported files, compute SHA-256 content hashes, and perform normalization on each document. Normalization is the core of the architecture: each file is routed through a format-specific NAC (Normalized Application Container) that extracts text and structure into a canonical form. The results are stored as the Universal Data Set, a normalized representation of all your ingested content.
This all happens in a single pass. Scan, hash, parse, canonicalize, index.
unsterwerx ingest ~/documents
Ingest Summary
══════════════════════════════════
Files discovered: 754
Empty (skipped): 1
Oversized (skipped): 0
──────────────────────────────
Files eligible: 753
Files ingested: 631
Duplicates: 75
Unsupported: 15
Skipped: 0
Errors: 32
Image-only: 17 (of errors)
Indexed (FTS5): 631
══════════════════════════════════
Here is how to read the summary:
- Files discovered / eligible: 754 files were found; 1 was empty and got skipped, leaving 753 eligible candidates.
- Files ingested: 631 documents were successfully normalized and added to the Universal Data Set.
- Duplicates: 75 files had the same SHA-256 content hash as something already in the database. Unsterwerx skips these automatically.
- Unsupported: 15 files were in legacy formats (
.doc,.xls,.ppt) that get registered but cannot be parsed yet. - Errors: 32 files failed extraction. Of those, 17 were image-only (scanned) PDFs that require OCR. The rest were encrypted or structurally damaged.
- Indexed (FTS5): 631 documents were added to the full-text search index and are immediately searchable.
Note: Unsterwerx streams files through an 8 KB buffer for hashing. Large files are never loaded entirely into memory during the hash phase.
Step 3 - Preview What Will Be Ingested
If you want to see what Unsterwerx would do before committing anything, use --dry-run. Combine it with --extension to filter by file type.
To preview only PDF files in a directory:
unsterwerx ingest --dry-run --extension pdf ~/documents
Dry Run
══════════════════════════════════
Files discovered: 150
Empty (skipped): 0
Oversized (skipped): 0
──────────────────────────────
Files eligible: 150
Already ingested: 0
Errors: 0
──────────────────────────────
Candidates (new): 150
══════════════════════════════════
Dry-run scans the directory and checks content hashes against the database, but writes nothing. The "Already ingested" line tells you how many files are duplicates of documents you have already processed. "Candidates (new)" is the count of files that would actually be added.
This is useful when you want to ingest a single format first, or when you need to estimate the scope of a new directory before committing.
Step 4 - Organize Documents with Scopes
Scopes give you a hierarchical tag system for organizing ingested documents. Think of them as a path: organization/division/user. Scopes feed into the classification and policy engines, so a document scoped to acme/gov will only receive rules and policies applicable to that branch.
To ingest a directory under a specific scope:
unsterwerx ingest --scope acme/gov ~/documents/government
Ingest Summary
══════════════════════════════════
Files discovered: 96
Empty (skipped): 0
Oversized (skipped): 0
──────────────────────────────
Files eligible: 96
Files ingested: 5
Duplicates: 86
Unsupported: 2
Skipped: 0
Errors: 3
Indexed (FTS5): 5
══════════════════════════════════
Notice the 86 duplicates. This directory contained files that were already ingested in Step 2 from a different path. Unsterwerx detected this through content hashing. The file names and locations were different, but the content was identical. Only 5 genuinely new documents were added, and those 5 were assigned the acme/gov scope.
Note: Scope assignment is one-way. Once a document has a scope, it cannot be reassigned to a different one. Choose your scope hierarchy before bulk-ingesting.
Step 5 - Check Your Document Library
The status command gives you an overview of everything in the database.
unsterwerx status
Unsterwerx Status
══════════════════════════════════════════
Data directory: /Users/you/.unsterwerx
Total documents: 1128
Total size: 2.1 GB
Indexed (FTS5): 1042
Audit events: 103
By Status:
canonical 1042
error 36
image_only 25
unsupported 25
══════════════════════════════════════════
Four document statuses to understand:
- canonical - successfully normalized and indexed. These are searchable and ready for similarity analysis.
- error - extraction failed, usually due to encryption, corruption, or malformed archive headers. Review these with
unsterwerx status errors. - image_only - scanned PDFs that contain no extractable text. They are registered but need OCR to be fully processed.
- unsupported - legacy formats (
.doc,.xls,.ppt) that are tracked but cannot be parsed.
Add --detailed to see the full breakdown by file type:
unsterwerx status --detailed
Unsterwerx Status
══════════════════════════════════════════
Data directory: /Users/you/.unsterwerx
Total documents: 1128
Total size: 2.1 GB
Indexed (FTS5): 1042
Audit events: 103
By Status:
canonical 1042
error 36
image_only 25
unsupported 25
By File Type:
csv 17
doc 14
docx 246
markdown 20
pdf 441
ppt 4
pptx 233
sql 16
txt 46
unknown 2
xls 5
xlsx 84
Similarity:
Candidate pairs: 0
Exact dupes: 0
Classification:
Active rules: 6
Classified docs: 0
══════════════════════════════════════════
This dataset contains 1,128 documents totaling 2.1 GB of original data. PDFs are the most common format (441), followed by DOCX (246) and PPTX (233). The similarity and classification sections show zeros because those analysis steps have not been run yet.
Here is the storage picture. In benchmarks against this same 2.1 GB dataset, the Universal Data Set compacts to 74 MB of canonical text (96.5% compaction), with a total database footprint of 285 MB (86.5% reduction). The original files stay where they are; Unsterwerx stores only the extracted canonical representation and its indexes.
Step 6 - Enable Rich Metadata Capture
By default, Unsterwerx extracts and indexes document content but skips embedded metadata (author, creation date, software, page count). To capture metadata during ingest, add the --capture-metadata flag:
unsterwerx ingest --capture-metadata ~/documents/reports
Ingest Summary
══════════════════════════════════
Files discovered: 29
Empty (skipped): 0
Oversized (skipped): 0
──────────────────────────────
Files eligible: 29
Files ingested: 28
Duplicates: 0
Unsupported: 1
Skipped: 0
Errors: 0
Indexed (FTS5): 28
══════════════════════════════════
With metadata capture enabled, Unsterwerx runs the built-in extractors (builtin_pdf, builtin_ooxml, builtin_image) alongside the canonical extraction pass. You can then query metadata with unsterwerx metadata keys and unsterwerx metadata show <document-id>.
Note: If you already ingested documents without --capture-metadata, you can extract metadata after the fact with unsterwerx metadata extract. No re-ingestion required.
Step 7 - Script with JSON Output
Every Unsterwerx command supports --json for machine-readable output. This makes it straightforward to integrate with CI/CD pipelines, monitoring scripts, or anything that consumes structured data.
unsterwerx ingest --json ~/documents/contracts
{
"command": "ingest",
"version": "0.5.4",
"timestamp": "2026-04-14T15:23:21.636681+00:00",
"data": {
"run_id": "7f0b52fc-a77e-4020-8235-694c1de78bab",
"source_type": "local",
"status": "completed",
"counters": {
"files_discovered": 23,
"files_scanned": 23,
"files_ingested": 17,
"duplicates": 6,
"unsupported": 0,
"skipped": 0,
"empty_skipped": 0,
"oversized_skipped": 0,
"errors": 0,
"indexed": 17
}
}
}
Pipe it to jq to extract specific values:
unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.files_ingested'
17
Or check for errors in a script:
errors=$(unsterwerx ingest --json ~/documents/contracts | jq '.data.counters.errors')
if [ "$errors" -gt 0 ]; then
echo "Ingest completed with $errors errors"
fi
The JSON envelope always includes command, version, and timestamp, so your scripts can verify they are parsing the expected output format.
Conclusion
You initialized a Shared Sandbox data directory, ingested over a thousand documents across multiple formats, previewed ingestion with dry-run, organized documents with scopes, and checked the library status. Unsterwerx compressed 2.1 GB of enterprise documents into a 74 MB canonical representation, automatically detected duplicates across different directory paths, and indexed everything for full-text search.
Your Universal Data Set is ready. Next steps:
- Run
unsterwerx similarityto find near-duplicate document pairs. - Run
unsterwerx search "your query"to search across all canonical content. - Read the next tutorial, How To Search and Compare Documents with Unsterwerx, for a deep dive into search, similarity analysis, and document diffing.