metadata

Inspects rich metadata extracted from ingested documents. Extractors run during ingestion to pull native properties out of PDF, OOXML (DOCX/XLSX/PPTX), and image files. Raw properties are then normalized into semantic facts through concept rules, giving you a uniform vocabulary across file types.

Five subcommands cover different angles: per-document inspection, corpus-wide key coverage, recurring canonical values, document lookup by semantic fact, and backfill extraction for documents ingested before metadata support was added.

Usage

bash

unsterwerx metadata <SUBCOMMAND>

Subcommands

Subcommand	Description
`show`	Display metadata and semantic facts for a single document
`keys`	List distinct metadata keys across the corpus with coverage stats
`values`	Show recurring canonical values for a concept key or concept family
`find`	Return documents whose semantic facts match a canonical value or regex
`extract`	Run metadata extraction on already-ingested documents

metadata show

Prints the raw extraction results and derived semantic facts for one document. Useful for checking what an extractor actually pulled from a file and how concept rules mapped those properties.

bash

unsterwerx metadata show [OPTIONS] <ID>

Arguments

Argument	Required	Description
`ID`	Yes	Document ID or unique prefix

Options

Option	Type	Default	Description
`--extractor`	string	all	Filter results to a specific extractor (e.g., `builtin_pdf`)
`--json`	flag		Output as JSON

Example

bash

unsterwerx metadata show a1b2

Document: a1b2c3d4-5e6f-7890-abcd-ef1234567890

Extractions:
  builtin_pdf (v1.0) [pdf] status=ok
    pdf:Producer: LibreOffice 7.5
    pdf:Creator: Writer
    pdf:CreationDate: 2024-08-14T09:31:22Z

Semantic Facts:
  [document_time]
    document_created_at = 2024-08-14 09:31:22 (confidence: 1.00, raw: pdf:CreationDate=2024-08-14T09:31:22Z)
  [origin_environment]
    origin_software_name = libreoffice 7.5 (confidence: 0.90, raw: pdf:Producer=LibreOffice 7.5)

bash

unsterwerx metadata show a1b2 --json

metadata keys

Shows which metadata keys exist across the corpus and how many documents carry each one. The coverage percentage tells you how widespread a given key is, which helps when deciding whether to build concept rules against it.

bash

unsterwerx metadata keys [OPTIONS]

Options

Option	Type	Default	Description
`--file-type`	string	all	Filter by file type (e.g., `pdf`, `docx`, `xlsx`)
`--extractor`	string	all	Filter by extractor name
`--json`	flag		Output as JSON

How coverage is calculated

The denominator depends on which filters you pass:

With --file-type: documents of that file type
With --extractor only: documents with at least one OK extraction from that extractor
Neither: total document count

Example

bash

unsterwerx metadata keys --file-type pdf

EXTRACTOR            KEY                            DOCS     COVERAGE
------------------------------------------------------------------------
builtin_pdf          pdf:Producer                    842      71.1%
builtin_pdf          pdf:Creator                     756      63.9%
builtin_pdf          pdf:CreationDate                698      59.0%
builtin_pdf          pdf:ModDate                     691      58.4%
builtin_pdf          pdf:PageCount                  1184     100.0%

metadata values

Inspects the canonical values that concept rules have produced. You can scope by concept key or concept family, then narrow further by file type, extractor, or a confidence floor. The text view shows total, usable, low-confidence, and suppressed counts so you can see whether a value is broadly trustworthy or mostly noise.

bash

unsterwerx metadata values [OPTIONS]

At least one of --concept-key or --concept-family is required.

Options

Option	Type	Default	Description
`--concept-key`	string		Inspect a specific concept key (e.g., `origin_software_name`)
`--concept-family`	string		Inspect a concept family (e.g., `origin_environment`)
`--min-docs`	integer	1	Only include values found in at least this many documents
`--file-type`	string	all	Only include facts from this file type
`--extractor`	string	all	Only include facts produced by this extractor
`--min-confidence`	float	0.0	Drop facts below this confidence floor
`--include-suppressed`	flag	false	Keep rows whose only contributors are facet-suppressed facts
`--json`	flag		Output as JSON

Example

bash

unsterwerx metadata values --concept-key origin_software_name --min-docs 5

Concept key: origin_software_name
CANONICAL VALUE                             TOTAL USABLE  LOWCF   SUPP  FILE TYPES
------------------------------------------------------------------------------------------
microsoft office word                         312    312      0      0  docx
libreoffice                                  198    194      4      0  pdf, docx
google docs                                   47     47      0      0  docx, pdf
adobe acrobat                                 31     29      2      0  pdf

bash

unsterwerx metadata values --concept-family origin_environment

Concept family: origin_environment
CONCEPT KEY                    CANONICAL VALUE                TOTAL USABLE  LOWCF   SUPP  FILE TYPES
------------------------------------------------------------------------------------------------------------------------
origin_software_name           microsoft office word            312    312      0      0  docx
origin_software_name           libreoffice                      198    194      4      0  pdf, docx
origin_software_version        16.0000                          162    162      0      0  docx
origin_software_component      pdfium                            13     13      0      0  pdf

If any metadata-bearing documents are stale, metadata values excludes them and prints a notice pointing at unsterwerx rules metadata rebuild --all.

metadata find

Finds documents whose semantic facts match a literal canonical value or a regex pattern. Exact-value mode canonicalizes the input and walks metadata aliases before querying, so it is the best operator-facing command for "show me every document whose author/software/date matches this concept."

bash

unsterwerx metadata find [OPTIONS]

Exactly one of --value or --value-pattern is required.

Options

Option	Type	Default	Description
`--concept-key`	string	required	Concept key to search within
`--value`	string		Literal value to canonicalize and match
`--value-pattern`	regex		Regex matched against stored canonical values
`--file-type`	string	all	Restrict matches to this file type
`--extractor`	string	all	Restrict matches to this extractor
`--min-confidence`	float	0.0	Drop facts below this confidence floor
`--match-quality`	string	`usable`	One of `usable`, `any`, or `low-confidence`
`--limit`	integer	50	Stop after this many matching documents
`--json`	flag		Output as JSON

Example

bash

unsterwerx metadata find \
    --concept-key origin_software_name \
    --value "Microsoft Office Word"

text

Matches for origin_software_name = microsoft office word:
  Acquisition-Plan.docx (570c62fd-...)
    fact#18 origin_software_name=microsoft office word (confidence 1.00)
  PM-Guidebook.docx (8e62a1fb-...)
    fact#44 origin_software_name=microsoft office word (confidence 1.00)

bash

unsterwerx metadata find \
    --concept-key document_author \
    --value-pattern "(?i)whetsel|freimanis"

Use --match-quality any when you want low-confidence matches included, or --match-quality low-confidence to audit only noisy facts.

metadata extract

Runs extractors on documents that were ingested before metadata support was available, or re-runs them when you want fresh results. Without --force, documents that already have an OK extraction are skipped.

bash

unsterwerx metadata extract [OPTIONS]

Options

Option	Type	Default	Description
`--file-type`	string	all	Only process documents of this file type
`--document`	string		Only process a specific document (ID or prefix)
`--force`	flag		Re-extract even if results already exist
`--dry-run`	flag		Show what would be extracted without writing
`--json`	flag		Output as JSON

File types are re-detected from actual bytes during extraction, not from the database value. This guards against stale file-type records.

Available extractors

Extractor	Version	File Types
`builtin_pdf`	1.0	PDF
`builtin_ooxml`	1.0	DOCX, XLSX, PPTX
`builtin_image`	1.0	PNG, JPEG

Example

bash

unsterwerx metadata extract --dry-run

Metadata extraction dry run:
  Total candidates:   1807
  Extracted:             0
  Skipped (missing file): 3
  Skipped (already done): 1612
  Errors:                0

bash

unsterwerx metadata extract --file-type pdf --force

Metadata extraction complete:
  Total candidates:   1184
  Extracted:          1184
  Skipped (missing file): 0
  Skipped (already done):  0
  Errors:                3

Concept Rules

Concept rules map raw metadata keys to normalized semantic facts. Each rule specifies:

A concept family grouping (e.g., origin_environment)
A concept key within that family (e.g., origin_software_name)
A raw key pattern (regex matching extractor output keys)
A normalization method: identity, trim, lower, or casefold
A confidence score (0.0 to 1.0)

Rules are applied in priority order during extraction. When multiple rules match the same raw key, the highest-priority rule wins. Facts are rebuilt atomically on each extraction, so re-running metadata extract --force picks up any rule changes.

Notes

Import-sourced (import://) and synthetic (synthetic://) documents are excluded from extraction.
Missing source files are skipped gracefully and counted separately.
metadata show prefers canonical values when available, and also surfaces low-confidence and suppressed fact state.
metadata find, search, and metadata-driven classification all use the same concept canonicalization and alias-resolution path.
All subcommands support --json for machine-readable output through the standard envelope.
Extraction results are stored per-extractor per-document. Running metadata extract on the same document twice (without --force) is a no-op.
Semantic facts power downstream analysis. Once extracted, facts show up in metadata show, metadata find, metadata-aware search filters, and metadata predicates in classification rules.