How To Monitor and Audit Your Document Operations with Unsterwerx
Every operation Unsterwerx performs on your documents is recorded in a cryptographic audit chain. Every ingest job tracks its own diagnostics. Every failure is surfaced, categorized, and actionable. This tutorial walks you through the full observability stack: viewing audit history, verifying data integrity, tracking individual documents, monitoring jobs, diagnosing failures, and handling errors that cannot be automatically resolved.
Prerequisites
- Unsterwerx v0.5.4 or newer installed and configured
- A data directory with at least one completed ingest job (see the Quick Start guide)
- Familiarity with the ingest workflow and the
unsterwerx reconstructoperator surface
Step 1 -- View the Audit Trail
The audit log records every mutation that happens inside your Shared Sandbox -- the local processing environment where Unsterwerx operates. Ingests, classifications, deduplication events, rule updates, reconstructions: all of it goes into the chain.
Run the default audit command:
unsterwerx audit
This shows the 50 most recent events:
Audit Log (50 events)
══════════════════════════════════════════════════════════════════════
2026-04-14T15:28:08 984 [rule_update ] all → success 4353bb07cb32
2026-04-14T15:28:03 983 [knowledge_dedup ] ec5ff0f3 → success ccecd5db9551
2026-04-14T15:28:03 982 [knowledge_dedup ] b2767c56 → success 7458af9d391e
2026-04-14T15:28:03 981 [knowledge_dedup ] 4290b6b7 → success 01829667733e
2026-04-14T15:28:03 980 [knowledge_dedup ] cedbed37 → success 2f5b8f4b97a7
...
2026-04-14T15:28:03 935 [knowledge_dedup ] 4cf730d9 → success f47ebfc08189
══════════════════════════════════════════════════════════════════════
Each line shows a timestamp, sequence number, action type, target (document ID or all), result, and a truncated hash. That trailing hex string is the event's position in the cryptographic chain -- more on that in the next step.
To limit output, use --limit:
unsterwerx audit --limit 10
You can also filter by action type with --action. Action types include ingest, classify, knowledge_dedup, rule_update, reconstruct, archive, document_dismiss, and many others. Check the full list in the audit command reference.
Step 2 -- Verify Cryptographic Integrity
The audit log is not just a list. It is a hash chain. Each event contains a SHA-256 hash that incorporates the previous event's hash, forming a tamper-evident sequence. If anyone modifies or deletes an event, the chain breaks.
To verify integrity:
unsterwerx audit --verify
Verifying audit hash chain...
Chain verified: 984 events, integrity OK
This walks every event from the first to the last and confirms that each hash correctly links to its predecessor. The entire Trusted Client-Centric Application Architecture (TCA) depends on this property: you can trust that the audit trail has not been altered after the fact.
Run this periodically, especially after upgrades or any unexpected system behavior.
Step 3 -- Track a Document's History
When you need to know exactly what happened to a specific document, filter the audit log by its ID using --target:
unsterwerx audit --target e6d22e4d
Audit Log (3 events)
══════════════════════════════════════════════════════════════════════
2026-04-14T15:28:03 892 [knowledge_dedup ] e6d22e4d → success 3e1f2032ae64
2026-04-14T15:27:55 804 [assign_scope ] e6d22e4d → success 73d214bb373c
2026-04-14T15:27:25 739 [classify ] e6d22e4d → success 9719ab298536
══════════════════════════════════════════════════════════════════════
This document was classified (event 739), then assigned a scope under a classification rule (event 804), then deduplicated as part of knowledge compaction (event 892). Reading bottom to top gives you the document's full lifecycle through the Universal Data Set -- the normalized canonical representation that Unsterwerx maintains.
For machine-readable output, add --json:
unsterwerx audit --json --limit 5
{
"command": "audit",
"version": "0.5.4",
"timestamp": "2026-04-14T15:28:22.830349+00:00",
"data": {
"mode": "list",
"filters": {
"action": null,
"target": null,
"limit": 5
},
"events": [
{
"id": 984,
"actor": "system",
"action": "rule_update",
"target_type": "source_hierarchy_recompute",
"target_id": "all",
"details": {
"errors": 0,
"unchanged": 1128,
"updated": 0
},
"result": "success",
"timestamp": "2026-04-14T15:28:08.163401+00:00",
"prev_hash": "ccecd5db9551...feda6be",
"event_hash": "4353bb07cb32...6a2a3de8"
}
]
}
}
The JSON output exposes the full hash values, event details, and the prev_hash / event_hash pair that forms the chain link. This is useful for integrating audit data into external compliance or reporting systems.
Step 4 -- Monitor Background Jobs
When you ingest documents, Unsterwerx tracks each run as a job. Use jobs list to see all recent jobs:
unsterwerx jobs list
ID Type Status Path Progress Queued At
------------------------------------------------------------------------------------------
8982ec9e import completed ...set/awesome-chatgpt-prompts 3/3 2026-04-14 16:32:04
2a24bbcc ingest completed ...der/unsterwerx/dataset/2021 29/29 2026-04-14 15:23:25
7f0b52fc ingest completed ...AnyLogic Training Materials 23/23 2026-04-14 15:23:16
28054aee ingest completed ...jder/unsterwerx/dataset/ABM - 2026-04-14 15:22:34
9096487e ingest completed ...terwerx/dataset/1. bookWERX 96/96 2026-04-14 15:22:25
3de87c17 ingest completed ...sterwerx/dataset/0. GOVWERX 316/316 2026-04-14 15:22:05
f92a82b1 ingest completed ...der/unsterwerx/dataset/2022 33/33 2026-04-14 15:21:54
87c38dc4 ingest completed ...sterwerx/dataset/Book Ideas 753/753 2026-04-14 15:21:06
d3ba33a0 ingest completed ...sterwerx/dataset/accounting 22/22 2026-04-14 15:21:03
fe536a0a ingest completed ...r/unsterwerx/dataset/2-sort 163/163 2026-04-14 15:20:47
The progress column shows items processed versus total. A completed job shows matching numbers (e.g., 316/316). Jobs can also be running, paused, stopped, failed, or stale.
To inspect a single job, use jobs status with the job ID or a unique prefix:
unsterwerx jobs status 8982
Job Details
══════════════════════════════════════════════════════════════
ID: 8982ec9e-fc81-4bb6-bf0a-4c53d768ab80
Type: import
Mode: foreground
Status: completed
Input path: /Users/frajder/unsterwerx/dataset/awesome-chatgpt-prompts
Source type: local
PID: 7872
Spec version: 1
Resume count: 0
Queued at: 2026-04-14 16:32:04
Started at: 2026-04-14 16:32:04
Heartbeat at: 2026-04-14 16:32:05
Completed at: 2026-04-14 16:32:05
Items total: 3
Items imported: 3
Items duplicate: 0
Items unsupported: 0
Items skipped: 0
Items error: 0
Import batches:
d73ba93c
══════════════════════════════════════════════════════════════
This gives you the full picture: timestamps, item counts by disposition, the worker PID, and which import batches were created. If a job has errors, you will see them in Items error.
Step 5 -- Diagnose Job Failures
When a job reports errors, you need to know what failed and why. Use jobs errors to see per-file error diagnostics:
unsterwerx jobs errors 3de8
Diagnostics for job 3de87c17 (19 errors, 5 warnings)
Timestamp Level Phase Item Message
------------------------------------------------------------------------------------------
2026-04-14 15:22:06 error parse .../references/RAND_RR1600.pdf Parse returned zero elements - treating as extraction failure
2026-04-14 15:22:08 error parse ...Readiness/Blank DD 2813.pdf Parse returned zero elements - treating as extraction failure
2026-04-14 15:22:08 error parse .../Readiness/DA FORM 7655.pdf Parse returned zero elements - treating as extraction failure
2026-04-14 15:22:08 error parse ...elcome/RST policy memeo.pdf PDF appears to be image-only (scanned) - requires OCR (3 pages)
2026-04-14 15:22:12 error parse ...sses Cyberwarfare U (1).pdf PDF appears to be image-only (scanned) - requires OCR (5 pages)
...
Each diagnostic includes the processing phase where the error occurred. The phase tells you where in the NAC (Normalized Application Container) pipeline the file failed. Common phases:
- scan -- file discovery and size filtering
- parse -- format-specific parsing (PDF, DOCX, XLSX, etc.)
- canonical -- extraction into the Universal Data Set format
Most parse-phase errors fall into two categories: files that returned zero extractable elements (blank forms, protected documents) and image-only scanned PDFs that contain no text layer.
To see warnings alongside errors, use jobs logs instead:
unsterwerx jobs logs 87c3
Diagnostics for job 87c38dc4 (32 errors, 3 warnings)
Timestamp Level Phase Item Message
------------------------------------------------------------------------------------------
2026-04-14 15:21:07 error parse ...ssignment#2_template-2.pptx invalid Zip archive: Could not find EOCD
2026-04-14 15:21:07 error parse ...ssignment#3_template-1.pptx invalid Zip archive: Could not find EOCD
2026-04-14 15:21:07 error parse ...es/2. DODWERX/1144 Form.pdf PDF appears to be image-only (scanned) - requires OCR (6 pages)
2026-04-14 15:21:08 error parse ...es/2. DODWERX/2019 CSAC.pdf Parse returned zero elements - treating as extraction failure
2026-04-14 15:21:28 warning parse ...enter_osd008412-18_r....pdf Signature repair failed for duplicate
2026-04-14 15:21:28 error parse ...--U.S.-Patent-9,921,771.pdf PDF appears to be image-only (scanned) - requires OCR (57 pages)
...
This job had 32 errors and 3 warnings. The "invalid Zip archive" errors indicate corrupted OOXML files (PPTX, XLSX). The image-only errors are scanned PDFs with no text layer. Warnings flag non-fatal issues like signature repair attempts on duplicates.
Step 6 -- Handle Document Errors
After ingestion completes, some documents end up in error states. Use status errors to see all of them across every job:
unsterwerx status errors
Stranded Documents (61 total)
══════════════════════════════════════════════════════════════
d0d8c00b AIM-008.pdf [pdf] (error)
Error: Parse returned zero elements - treating as extraction failure
ae3cd159 MIT-DT_Team Assignment#3_template-1.pptx [pptx] (error)
Error: invalid Zip archive: Could not find EOCD
1af0420d Data Strategy Risk and Opportunity_SPAWAR.pdf [pdf] (image_only)
Error: PDF appears to be image-only (scanned) - requires OCR (2 pages)
48f01072 1--U.S.-Patent-9,921,771.pdf [pdf] (image_only)
Error: PDF appears to be image-only (scanned) - requires OCR (57 pages)
581b8f9b General Online Domain.xlsx [xlsx] (error)
Error: Failed to open XLSX workbook: Zip error: invalid Zip archive: Could not find EOCD
...
══════════════════════════════════════════════════════════════
Error: 36 | Image-only: 25 | Total: 61
Retry transient errors: unsterwerx ingest --retry-errors
Dismiss unrecoverable: unsterwerx status dismiss <id> --reason "..."
The output separates documents into two categories:
- error (36 documents) -- extraction failures, corrupted archives, zero-element parses. Some of these may succeed on retry if the underlying issue was transient.
- image_only (25 documents) -- scanned PDFs containing only images with no text layer. These require OCR processing, which Unsterwerx does not yet support.
The summary at the bottom gives you the exact commands for the two resolution paths.
Step 7 -- Dismiss Unrecoverable Documents
Some documents cannot be processed. Scanned PDFs without a text layer will not yield extractable content until OCR support is added. Corrupted ZIP archives are unlikely to self-repair. For these, the correct action is to dismiss them: acknowledge the limitation, record a reason, and move on.
unsterwerx status dismiss 1af0420d --reason "Image-only scanned PDF, no OCR available"
Dismissed document 1af0420d (was: image_only)
Reason: Image-only scanned PDF, no OCR available
The dismiss operation does three things:
- Transitions the document status from
errororimage_onlytodismissed - Records your reason in the audit trail
- Excludes the document from search results, knowledge scoring, and reconstruction
This is not deletion. The document record stays in the database. If OCR support is added in a future version, you can revisit dismissed documents.
Note: Only documents in error or image_only status can be dismissed. You cannot dismiss a document that was successfully processed.
Step 8 -- Retry Transient Failures
Before dismissing everything, try re-processing the error documents. Some failures are transient -- a file lock, a temporary memory constraint, or a parser edge case that has since been fixed by an upgrade.
unsterwerx ingest --retry-errors
Found 61 documents to retry.
Retry Summary
══════════════════════════════════
Inspected: 61
Extracted: 1
Still failed: 35
══════════════════════════════════
Use 'unsterwerx status errors' to see remaining failures.
Use 'unsterwerx status dismiss <id> --reason "..."' to acknowledge.
In this case, 1 document was recovered on retry. 35 still failed (the remaining 25 image-only documents were not retried since they require OCR). The retry operation re-runs the full NAC pipeline for each error document and updates its status accordingly.
After retrying, run status errors again to review what remains, then dismiss the truly unrecoverable ones.
Now verify the audit chain one final time to confirm all these operations were properly recorded:
unsterwerx audit --verify
Verifying audit hash chain...
Chain verified: 1029 events, integrity OK
The chain grew from 984 to 1029 events. Every retry, every dismiss, every status change added a new link. The chain remains intact.
Conclusion
You now have the tools to maintain full visibility into your Unsterwerx document processing pipeline:
- Audit trail gives you a tamper-evident history of every operation performed on every document in the Universal Data Set
- Hash chain verification proves that history has not been altered
- Document targeting lets you trace any single document's lifecycle
- Job monitoring shows progress and status across all ingest runs
- Job diagnostics pinpoint exactly which files failed and in which NAC processing phase
- Error handling provides a structured workflow: review, retry, then dismiss
The combination of cryptographic auditing and structured error handling is a direct implementation of the TCA pattern: the Shared Sandbox maintains operator control and trust over all data processing.
For configuring the rules that classify documents into organizational hierarchies, continue to the rules source reference.