canonical
Normalizes all documents in ingested status into the Universal Data Set. Each document is parsed, converted to a canonical representation, stored in the content-addressed store (CAS), and indexed in the FTS5 full-text search index.
Usage
bash
unsterwerx canonical
Examples
Extract pending documents
bash
unsterwerx canonical
Canonical Summary
══════════════════════════════════
Processed: 12
Extracted: 11
Failed: 1
Elements: 847
Words: 31204
══════════════════════════════════
No pending documents (idempotent)
bash
unsterwerx canonical
Canonical Summary
══════════════════════════════════
Processed: 0
Extracted: 0
Failed: 0
Elements: 0
Words: 0
══════════════════════════════════
Notes
- Only documents in
ingestedstatus are processed. Documents already incanonicalor later statuses are skipped. - Extraction options (max file size, PDF fallback, worker threads) are read from configuration. See
unsterwerx config. - If extraction fails for a document, it is marked as
errororimage_onlyand reported in theFailedcount. - This command is idempotent. Running it again after all documents are processed results in zero work.
- After running
canonical, useunsterwerx searchto query the indexed content. - To rebuild the FTS5 index from existing canonical content without re-extracting, use
unsterwerx reindex.