canonical

Normalizes all documents in ingested status into the Universal Data Set. Each document is parsed, converted to a canonical representation, stored in the content-addressed store (CAS), and indexed in the FTS5 full-text search index.

Usage

bash

unsterwerx canonical

Examples

Extract pending documents

bash

unsterwerx canonical

Canonical Summary
══════════════════════════════════
  Processed:        12
  Extracted:        11
  Failed:            1
  Elements:        847
  Words:         31204
══════════════════════════════════

No pending documents (idempotent)

bash

unsterwerx canonical

Canonical Summary
══════════════════════════════════
  Processed:         0
  Extracted:         0
  Failed:            0
  Elements:          0
  Words:             0
══════════════════════════════════

Notes

Only documents in ingested status are processed. Documents already in canonical or later statuses are skipped.
Extraction options (max file size, PDF fallback, worker threads) are read from configuration. See unsterwerx config.
If extraction fails for a document, it is marked as error or image_only and reported in the Failed count.
This command is idempotent. Running it again after all documents are processed results in zero work.
After running canonical, use unsterwerx search to query the indexed content.
To rebuild the FTS5 index from existing canonical content without re-extracting, use unsterwerx reindex.