metadata
Inspects rich metadata extracted from ingested documents. Extractors run during ingestion to pull native properties out of PDF, OOXML (DOCX/XLSX/PPTX), and image files. Raw properties are then normalized into semantic facts through concept rules, giving you a uniform vocabulary across file types.
Five subcommands cover different angles: per-document inspection, corpus-wide key coverage, recurring canonical values, document lookup by semantic fact, and backfill extraction for documents ingested before metadata support was added.
Usage
unsterwerx metadata <SUBCOMMAND>
Subcommands
| Subcommand | Description |
|---|---|
show | Display metadata and semantic facts for a single document |
keys | List distinct metadata keys across the corpus with coverage stats |
values | Show recurring canonical values for a concept key or concept family |
find | Return documents whose semantic facts match a canonical value or regex |
extract | Run metadata extraction on already-ingested documents |
metadata show
Prints the raw extraction results and derived semantic facts for one document. Useful for checking what an extractor actually pulled from a file and how concept rules mapped those properties.
unsterwerx metadata show [OPTIONS] <ID>
Arguments
| Argument | Required | Description |
|---|---|---|
ID | Yes | Document ID or unique prefix |
Options
| Option | Type | Default | Description |
|---|---|---|---|
--extractor | string | all | Filter results to a specific extractor (e.g., builtin_pdf) |
--json | flag | Output as JSON |
Example
unsterwerx metadata show a1b2
Document: a1b2c3d4-5e6f-7890-abcd-ef1234567890
Extractions:
builtin_pdf (v1.0) [pdf] status=ok
pdf:Producer: LibreOffice 7.5
pdf:Creator: Writer
pdf:CreationDate: 2024-08-14T09:31:22Z
Semantic Facts:
[document_time]
document_created_at = 2024-08-14 09:31:22 (confidence: 1.00, raw: pdf:CreationDate=2024-08-14T09:31:22Z)
[origin_environment]
origin_software_name = libreoffice 7.5 (confidence: 0.90, raw: pdf:Producer=LibreOffice 7.5)
unsterwerx metadata show a1b2 --json
metadata keys
Shows which metadata keys exist across the corpus and how many documents carry each one. The coverage percentage tells you how widespread a given key is, which helps when deciding whether to build concept rules against it.
unsterwerx metadata keys [OPTIONS]
Options
| Option | Type | Default | Description |
|---|---|---|---|
--file-type | string | all | Filter by file type (e.g., pdf, docx, xlsx) |
--extractor | string | all | Filter by extractor name |
--json | flag | Output as JSON |
How coverage is calculated
The denominator depends on which filters you pass:
- With
--file-type: documents of that file type - With
--extractoronly: documents with at least one OK extraction from that extractor - Neither: total document count
Example
unsterwerx metadata keys --file-type pdf
EXTRACTOR KEY DOCS COVERAGE
------------------------------------------------------------------------
builtin_pdf pdf:Producer 842 71.1%
builtin_pdf pdf:Creator 756 63.9%
builtin_pdf pdf:CreationDate 698 59.0%
builtin_pdf pdf:ModDate 691 58.4%
builtin_pdf pdf:PageCount 1184 100.0%
metadata values
Inspects the canonical values that concept rules have produced. You can scope by concept key or concept family, then narrow further by file type, extractor, or a confidence floor. The text view shows total, usable, low-confidence, and suppressed counts so you can see whether a value is broadly trustworthy or mostly noise.
unsterwerx metadata values [OPTIONS]
At least one of --concept-key or --concept-family is required.
Options
| Option | Type | Default | Description |
|---|---|---|---|
--concept-key | string | Inspect a specific concept key (e.g., origin_software_name) | |
--concept-family | string | Inspect a concept family (e.g., origin_environment) | |
--min-docs | integer | 1 | Only include values found in at least this many documents |
--file-type | string | all | Only include facts from this file type |
--extractor | string | all | Only include facts produced by this extractor |
--min-confidence | float | 0.0 | Drop facts below this confidence floor |
--include-suppressed | flag | false | Keep rows whose only contributors are facet-suppressed facts |
--json | flag | Output as JSON |
Example
unsterwerx metadata values --concept-key origin_software_name --min-docs 5
Concept key: origin_software_name
CANONICAL VALUE TOTAL USABLE LOWCF SUPP FILE TYPES
------------------------------------------------------------------------------------------
microsoft office word 312 312 0 0 docx
libreoffice 198 194 4 0 pdf, docx
google docs 47 47 0 0 docx, pdf
adobe acrobat 31 29 2 0 pdf
unsterwerx metadata values --concept-family origin_environment
Concept family: origin_environment
CONCEPT KEY CANONICAL VALUE TOTAL USABLE LOWCF SUPP FILE TYPES
------------------------------------------------------------------------------------------------------------------------
origin_software_name microsoft office word 312 312 0 0 docx
origin_software_name libreoffice 198 194 4 0 pdf, docx
origin_software_version 16.0000 162 162 0 0 docx
origin_software_component pdfium 13 13 0 0 pdf
If any metadata-bearing documents are stale, metadata values excludes them and prints a notice pointing at unsterwerx rules metadata rebuild --all.
metadata find
Finds documents whose semantic facts match a literal canonical value or a regex pattern. Exact-value mode canonicalizes the input and walks metadata aliases before querying, so it is the best operator-facing command for "show me every document whose author/software/date matches this concept."
unsterwerx metadata find [OPTIONS]
Exactly one of --value or --value-pattern is required.
Options
| Option | Type | Default | Description |
|---|---|---|---|
--concept-key | string | required | Concept key to search within |
--value | string | Literal value to canonicalize and match | |
--value-pattern | regex | Regex matched against stored canonical values | |
--file-type | string | all | Restrict matches to this file type |
--extractor | string | all | Restrict matches to this extractor |
--min-confidence | float | 0.0 | Drop facts below this confidence floor |
--match-quality | string | usable | One of usable, any, or low-confidence |
--limit | integer | 50 | Stop after this many matching documents |
--json | flag | Output as JSON |
Example
unsterwerx metadata find \
--concept-key origin_software_name \
--value "Microsoft Office Word"
Matches for origin_software_name = microsoft office word:
Acquisition-Plan.docx (570c62fd-...)
fact#18 origin_software_name=microsoft office word (confidence 1.00)
PM-Guidebook.docx (8e62a1fb-...)
fact#44 origin_software_name=microsoft office word (confidence 1.00)
unsterwerx metadata find \
--concept-key document_author \
--value-pattern "(?i)whetsel|freimanis"
Use --match-quality any when you want low-confidence matches included, or --match-quality low-confidence to audit only noisy facts.
metadata extract
Runs extractors on documents that were ingested before metadata support was available, or re-runs them when you want fresh results. Without --force, documents that already have an OK extraction are skipped.
unsterwerx metadata extract [OPTIONS]
Options
| Option | Type | Default | Description |
|---|---|---|---|
--file-type | string | all | Only process documents of this file type |
--document | string | Only process a specific document (ID or prefix) | |
--force | flag | Re-extract even if results already exist | |
--dry-run | flag | Show what would be extracted without writing | |
--json | flag | Output as JSON |
File types are re-detected from actual bytes during extraction, not from the database value. This guards against stale file-type records.
Available extractors
| Extractor | Version | File Types |
|---|---|---|
builtin_pdf | 1.0 | |
builtin_ooxml | 1.0 | DOCX, XLSX, PPTX |
builtin_image | 1.0 | PNG, JPEG |
Example
unsterwerx metadata extract --dry-run
Metadata extraction dry run:
Total candidates: 1807
Extracted: 0
Skipped (missing file): 3
Skipped (already done): 1612
Errors: 0
unsterwerx metadata extract --file-type pdf --force
Metadata extraction complete:
Total candidates: 1184
Extracted: 1184
Skipped (missing file): 0
Skipped (already done): 0
Errors: 3
Concept Rules
Concept rules map raw metadata keys to normalized semantic facts. Each rule specifies:
- A concept family grouping (e.g.,
origin_environment) - A concept key within that family (e.g.,
origin_software_name) - A raw key pattern (regex matching extractor output keys)
- A normalization method:
identity,trim,lower, orcasefold - A confidence score (0.0 to 1.0)
Rules are applied in priority order during extraction. When multiple rules match the same raw key, the highest-priority rule wins. Facts are rebuilt atomically on each extraction, so re-running metadata extract --force picks up any rule changes.
Notes
- Import-sourced (
import://) and synthetic (synthetic://) documents are excluded from extraction. - Missing source files are skipped gracefully and counted separately.
metadata showprefers canonical values when available, and also surfaces low-confidence and suppressed fact state.metadata find,search, and metadata-driven classification all use the same concept canonicalization and alias-resolution path.- All subcommands support
--jsonfor machine-readable output through the standard envelope. - Extraction results are stored per-extractor per-document. Running
metadata extracton the same document twice (without--force) is a no-op. - Semantic facts power downstream analysis. Once extracted, facts show up in
metadata show,metadata find, metadata-aware search filters, and metadata predicates in classification rules.