Unsterwerx

How To Monitor and Audit Your Document Operations with Unsterwerx

Every operation Unsterwerx performs on your documents is recorded in a cryptographic audit chain. Every ingest job tracks its own diagnostics. Every failure is surfaced, categorized, and actionable. This tutorial walks you through the full observability stack: viewing audit history, verifying data integrity, tracking individual documents, monitoring jobs, diagnosing failures, and handling errors that cannot be automatically resolved.

Prerequisites

Step 1 -- View the Audit Trail

The audit log records every mutation that happens inside your Shared Sandbox -- the local processing environment where Unsterwerx operates. Ingests, classifications, deduplication events, rule updates, reconstructions: all of it goes into the chain.

Run the default audit command:

bash
unsterwerx audit

This shows the 50 most recent events:

text
Audit Log (50 events)
══════════════════════════════════════════════════════════════════════
  2026-04-14T15:28:08 984 [rule_update         ] all → success 4353bb07cb32
  2026-04-14T15:28:03 983 [knowledge_dedup     ] ec5ff0f3 → success ccecd5db9551
  2026-04-14T15:28:03 982 [knowledge_dedup     ] b2767c56 → success 7458af9d391e
  2026-04-14T15:28:03 981 [knowledge_dedup     ] 4290b6b7 → success 01829667733e
  2026-04-14T15:28:03 980 [knowledge_dedup     ] cedbed37 → success 2f5b8f4b97a7
  ...
  2026-04-14T15:28:03 935 [knowledge_dedup     ] 4cf730d9 → success f47ebfc08189
══════════════════════════════════════════════════════════════════════

Each line shows a timestamp, sequence number, action type, target (document ID or all), result, and a truncated hash. That trailing hex string is the event's position in the cryptographic chain -- more on that in the next step.

To limit output, use --limit:

bash
unsterwerx audit --limit 10

You can also filter by action type with --action. Action types include ingest, classify, knowledge_dedup, rule_update, reconstruct, archive, document_dismiss, and many others. Check the full list in the audit command reference.

Step 2 -- Verify Cryptographic Integrity

The audit log is not just a list. It is a hash chain. Each event contains a SHA-256 hash that incorporates the previous event's hash, forming a tamper-evident sequence. If anyone modifies or deletes an event, the chain breaks.

To verify integrity:

bash
unsterwerx audit --verify
text
Verifying audit hash chain...
Chain verified: 984 events, integrity OK

This walks every event from the first to the last and confirms that each hash correctly links to its predecessor. The entire Trusted Client-Centric Application Architecture (TCA) depends on this property: you can trust that the audit trail has not been altered after the fact.

Run this periodically, especially after upgrades or any unexpected system behavior.

Step 3 -- Track a Document's History

When you need to know exactly what happened to a specific document, filter the audit log by its ID using --target:

bash
unsterwerx audit --target e6d22e4d
text
Audit Log (3 events)
══════════════════════════════════════════════════════════════════════
  2026-04-14T15:28:03 892 [knowledge_dedup     ] e6d22e4d → success 3e1f2032ae64
  2026-04-14T15:27:55 804 [assign_scope        ] e6d22e4d → success 73d214bb373c
  2026-04-14T15:27:25 739 [classify            ] e6d22e4d → success 9719ab298536
══════════════════════════════════════════════════════════════════════

This document was classified (event 739), then assigned a scope under a classification rule (event 804), then deduplicated as part of knowledge compaction (event 892). Reading bottom to top gives you the document's full lifecycle through the Universal Data Set -- the normalized canonical representation that Unsterwerx maintains.

For machine-readable output, add --json:

bash
unsterwerx audit --json --limit 5
json
{
  "command": "audit",
  "version": "0.5.4",
  "timestamp": "2026-04-14T15:28:22.830349+00:00",
  "data": {
    "mode": "list",
    "filters": {
      "action": null,
      "target": null,
      "limit": 5
    },
    "events": [
      {
        "id": 984,
        "actor": "system",
        "action": "rule_update",
        "target_type": "source_hierarchy_recompute",
        "target_id": "all",
        "details": {
          "errors": 0,
          "unchanged": 1128,
          "updated": 0
        },
        "result": "success",
        "timestamp": "2026-04-14T15:28:08.163401+00:00",
        "prev_hash": "ccecd5db9551...feda6be",
        "event_hash": "4353bb07cb32...6a2a3de8"
      }
    ]
  }
}

The JSON output exposes the full hash values, event details, and the prev_hash / event_hash pair that forms the chain link. This is useful for integrating audit data into external compliance or reporting systems.

Step 4 -- Monitor Background Jobs

When you ingest documents, Unsterwerx tracks each run as a job. Use jobs list to see all recent jobs:

bash
unsterwerx jobs list
text
  ID        Type           Status      Path                              Progress  Queued At
  ------------------------------------------------------------------------------------------
  8982ec9e  import         completed   ...set/awesome-chatgpt-prompts         3/3  2026-04-14 16:32:04
  2a24bbcc  ingest         completed   ...der/unsterwerx/dataset/2021       29/29  2026-04-14 15:23:25
  7f0b52fc  ingest         completed   ...AnyLogic Training Materials       23/23  2026-04-14 15:23:16
  28054aee  ingest         completed   ...jder/unsterwerx/dataset/ABM           -  2026-04-14 15:22:34
  9096487e  ingest         completed   ...terwerx/dataset/1. bookWERX       96/96  2026-04-14 15:22:25
  3de87c17  ingest         completed   ...sterwerx/dataset/0. GOVWERX     316/316  2026-04-14 15:22:05
  f92a82b1  ingest         completed   ...der/unsterwerx/dataset/2022       33/33  2026-04-14 15:21:54
  87c38dc4  ingest         completed   ...sterwerx/dataset/Book Ideas     753/753  2026-04-14 15:21:06
  d3ba33a0  ingest         completed   ...sterwerx/dataset/accounting       22/22  2026-04-14 15:21:03
  fe536a0a  ingest         completed   ...r/unsterwerx/dataset/2-sort     163/163  2026-04-14 15:20:47

The progress column shows items processed versus total. A completed job shows matching numbers (e.g., 316/316). Jobs can also be running, paused, stopped, failed, or stale.

To inspect a single job, use jobs status with the job ID or a unique prefix:

bash
unsterwerx jobs status 8982
text
Job Details
══════════════════════════════════════════════════════════════
  ID:              8982ec9e-fc81-4bb6-bf0a-4c53d768ab80
  Type:            import
  Mode:            foreground
  Status:          completed
  Input path:      /Users/frajder/unsterwerx/dataset/awesome-chatgpt-prompts
  Source type:      local
  PID:             7872
  Spec version:    1
  Resume count:    0

  Queued at:       2026-04-14 16:32:04
  Started at:      2026-04-14 16:32:04
  Heartbeat at:    2026-04-14 16:32:05
  Completed at:    2026-04-14 16:32:05

  Items total:            3
  Items imported:         3
  Items duplicate:        0
  Items unsupported:      0
  Items skipped:          0
  Items error:            0

  Import batches:
    d73ba93c
══════════════════════════════════════════════════════════════

This gives you the full picture: timestamps, item counts by disposition, the worker PID, and which import batches were created. If a job has errors, you will see them in Items error.

Step 5 -- Diagnose Job Failures

When a job reports errors, you need to know what failed and why. Use jobs errors to see per-file error diagnostics:

bash
unsterwerx jobs errors 3de8
text
Diagnostics for job 3de87c17 (19 errors, 5 warnings)
  Timestamp            Level    Phase       Item                            Message
  ------------------------------------------------------------------------------------------
  2026-04-14 15:22:06  error    parse       .../references/RAND_RR1600.pdf  Parse returned zero elements - treating as extraction failure
  2026-04-14 15:22:08  error    parse       ...Readiness/Blank DD 2813.pdf  Parse returned zero elements - treating as extraction failure
  2026-04-14 15:22:08  error    parse       .../Readiness/DA FORM 7655.pdf  Parse returned zero elements - treating as extraction failure
  2026-04-14 15:22:08  error    parse       ...elcome/RST policy memeo.pdf  PDF appears to be image-only (scanned) - requires OCR (3 pages)
  2026-04-14 15:22:12  error    parse       ...sses Cyberwarfare U (1).pdf  PDF appears to be image-only (scanned) - requires OCR (5 pages)
  ...

Each diagnostic includes the processing phase where the error occurred. The phase tells you where in the NAC (Normalized Application Container) pipeline the file failed. Common phases:

Most parse-phase errors fall into two categories: files that returned zero extractable elements (blank forms, protected documents) and image-only scanned PDFs that contain no text layer.

To see warnings alongside errors, use jobs logs instead:

bash
unsterwerx jobs logs 87c3
text
Diagnostics for job 87c38dc4 (32 errors, 3 warnings)
  Timestamp            Level    Phase       Item                            Message
  ------------------------------------------------------------------------------------------
  2026-04-14 15:21:07  error    parse       ...ssignment#2_template-2.pptx  invalid Zip archive: Could not find EOCD
  2026-04-14 15:21:07  error    parse       ...ssignment#3_template-1.pptx  invalid Zip archive: Could not find EOCD
  2026-04-14 15:21:07  error    parse       ...es/2. DODWERX/1144 Form.pdf  PDF appears to be image-only (scanned) - requires OCR (6 pages)
  2026-04-14 15:21:08  error    parse       ...es/2. DODWERX/2019 CSAC.pdf  Parse returned zero elements - treating as extraction failure
  2026-04-14 15:21:28  warning  parse       ...enter_osd008412-18_r....pdf  Signature repair failed for duplicate
  2026-04-14 15:21:28  error    parse       ...--U.S.-Patent-9,921,771.pdf  PDF appears to be image-only (scanned) - requires OCR (57 pages)
  ...

This job had 32 errors and 3 warnings. The "invalid Zip archive" errors indicate corrupted OOXML files (PPTX, XLSX). The image-only errors are scanned PDFs with no text layer. Warnings flag non-fatal issues like signature repair attempts on duplicates.

Step 6 -- Handle Document Errors

After ingestion completes, some documents end up in error states. Use status errors to see all of them across every job:

bash
unsterwerx status errors
text
Stranded Documents (61 total)
══════════════════════════════════════════════════════════════
  d0d8c00b AIM-008.pdf [pdf] (error)
    Error: Parse returned zero elements - treating as extraction failure

  ae3cd159 MIT-DT_Team Assignment#3_template-1.pptx [pptx] (error)
    Error: invalid Zip archive: Could not find EOCD

  1af0420d Data Strategy Risk and Opportunity_SPAWAR.pdf [pdf] (image_only)
    Error: PDF appears to be image-only (scanned) - requires OCR (2 pages)

  48f01072 1--U.S.-Patent-9,921,771.pdf [pdf] (image_only)
    Error: PDF appears to be image-only (scanned) - requires OCR (57 pages)

  581b8f9b General Online Domain.xlsx [xlsx] (error)
    Error: Failed to open XLSX workbook: Zip error: invalid Zip archive: Could not find EOCD
  ...
══════════════════════════════════════════════════════════════
  Error: 36  |  Image-only: 25  |  Total: 61

  Retry transient errors:    unsterwerx ingest --retry-errors
  Dismiss unrecoverable:     unsterwerx status dismiss <id> --reason "..."

The output separates documents into two categories:

The summary at the bottom gives you the exact commands for the two resolution paths.

Step 7 -- Dismiss Unrecoverable Documents

Some documents cannot be processed. Scanned PDFs without a text layer will not yield extractable content until OCR support is added. Corrupted ZIP archives are unlikely to self-repair. For these, the correct action is to dismiss them: acknowledge the limitation, record a reason, and move on.

bash
unsterwerx status dismiss 1af0420d --reason "Image-only scanned PDF, no OCR available"
text
Dismissed document 1af0420d (was: image_only)
  Reason: Image-only scanned PDF, no OCR available

The dismiss operation does three things:

  1. Transitions the document status from error or image_only to dismissed
  2. Records your reason in the audit trail
  3. Excludes the document from search results, knowledge scoring, and reconstruction

This is not deletion. The document record stays in the database. If OCR support is added in a future version, you can revisit dismissed documents.

Note: Only documents in error or image_only status can be dismissed. You cannot dismiss a document that was successfully processed.

Step 8 -- Retry Transient Failures

Before dismissing everything, try re-processing the error documents. Some failures are transient -- a file lock, a temporary memory constraint, or a parser edge case that has since been fixed by an upgrade.

bash
unsterwerx ingest --retry-errors
text
Found 61 documents to retry.

Retry Summary
══════════════════════════════════
  Inspected:          61
  Extracted:           1
  Still failed:       35
══════════════════════════════════

  Use 'unsterwerx status errors' to see remaining failures.
  Use 'unsterwerx status dismiss <id> --reason "..."' to acknowledge.

In this case, 1 document was recovered on retry. 35 still failed (the remaining 25 image-only documents were not retried since they require OCR). The retry operation re-runs the full NAC pipeline for each error document and updates its status accordingly.

After retrying, run status errors again to review what remains, then dismiss the truly unrecoverable ones.

Now verify the audit chain one final time to confirm all these operations were properly recorded:

bash
unsterwerx audit --verify
text
Verifying audit hash chain...
Chain verified: 1029 events, integrity OK

The chain grew from 984 to 1029 events. Every retry, every dismiss, every status change added a new link. The chain remains intact.

Conclusion

You now have the tools to maintain full visibility into your Unsterwerx document processing pipeline:

The combination of cryptographic auditing and structured error handling is a direct implementation of the TCA pattern: the Shared Sandbox maintains operator control and trust over all data processing.

For configuring the rules that classify documents into organizational hierarchies, continue to the rules source reference.