Unsterwerx

metadata

Inspects rich metadata extracted from ingested documents. Extractors run during ingestion to pull native properties out of PDF, OOXML (DOCX/XLSX/PPTX), and image files. Raw properties are then normalized into semantic facts through concept rules, giving you a uniform vocabulary across file types.

Five subcommands cover different angles: per-document inspection, corpus-wide key coverage, recurring canonical values, document lookup by semantic fact, and backfill extraction for documents ingested before metadata support was added.

Usage

bash
unsterwerx metadata <SUBCOMMAND>

Subcommands

SubcommandDescription
showDisplay metadata and semantic facts for a single document
keysList distinct metadata keys across the corpus with coverage stats
valuesShow recurring canonical values for a concept key or concept family
findReturn documents whose semantic facts match a canonical value or regex
extractRun metadata extraction on already-ingested documents

metadata show

Prints the raw extraction results and derived semantic facts for one document. Useful for checking what an extractor actually pulled from a file and how concept rules mapped those properties.

bash
unsterwerx metadata show [OPTIONS] <ID>

Arguments

ArgumentRequiredDescription
IDYesDocument ID or unique prefix

Options

OptionTypeDefaultDescription
--extractorstringallFilter results to a specific extractor (e.g., builtin_pdf)
--jsonflagOutput as JSON

Example

bash
unsterwerx metadata show a1b2
Document: a1b2c3d4-5e6f-7890-abcd-ef1234567890

Extractions:
  builtin_pdf (v1.0) [pdf] status=ok
    pdf:Producer: LibreOffice 7.5
    pdf:Creator: Writer
    pdf:CreationDate: 2024-08-14T09:31:22Z

Semantic Facts:
  [document_time]
    document_created_at = 2024-08-14 09:31:22 (confidence: 1.00, raw: pdf:CreationDate=2024-08-14T09:31:22Z)
  [origin_environment]
    origin_software_name = libreoffice 7.5 (confidence: 0.90, raw: pdf:Producer=LibreOffice 7.5)
bash
unsterwerx metadata show a1b2 --json

metadata keys

Shows which metadata keys exist across the corpus and how many documents carry each one. The coverage percentage tells you how widespread a given key is, which helps when deciding whether to build concept rules against it.

bash
unsterwerx metadata keys [OPTIONS]

Options

OptionTypeDefaultDescription
--file-typestringallFilter by file type (e.g., pdf, docx, xlsx)
--extractorstringallFilter by extractor name
--jsonflagOutput as JSON

How coverage is calculated

The denominator depends on which filters you pass:

Example

bash
unsterwerx metadata keys --file-type pdf
EXTRACTOR            KEY                            DOCS     COVERAGE
------------------------------------------------------------------------
builtin_pdf          pdf:Producer                    842      71.1%
builtin_pdf          pdf:Creator                     756      63.9%
builtin_pdf          pdf:CreationDate                698      59.0%
builtin_pdf          pdf:ModDate                     691      58.4%
builtin_pdf          pdf:PageCount                  1184     100.0%

metadata values

Inspects the canonical values that concept rules have produced. You can scope by concept key or concept family, then narrow further by file type, extractor, or a confidence floor. The text view shows total, usable, low-confidence, and suppressed counts so you can see whether a value is broadly trustworthy or mostly noise.

bash
unsterwerx metadata values [OPTIONS]

At least one of --concept-key or --concept-family is required.

Options

OptionTypeDefaultDescription
--concept-keystringInspect a specific concept key (e.g., origin_software_name)
--concept-familystringInspect a concept family (e.g., origin_environment)
--min-docsinteger1Only include values found in at least this many documents
--file-typestringallOnly include facts from this file type
--extractorstringallOnly include facts produced by this extractor
--min-confidencefloat0.0Drop facts below this confidence floor
--include-suppressedflagfalseKeep rows whose only contributors are facet-suppressed facts
--jsonflagOutput as JSON

Example

bash
unsterwerx metadata values --concept-key origin_software_name --min-docs 5
Concept key: origin_software_name
CANONICAL VALUE                             TOTAL USABLE  LOWCF   SUPP  FILE TYPES
------------------------------------------------------------------------------------------
microsoft office word                         312    312      0      0  docx
libreoffice                                  198    194      4      0  pdf, docx
google docs                                   47     47      0      0  docx, pdf
adobe acrobat                                 31     29      2      0  pdf
bash
unsterwerx metadata values --concept-family origin_environment
Concept family: origin_environment
CONCEPT KEY                    CANONICAL VALUE                TOTAL USABLE  LOWCF   SUPP  FILE TYPES
------------------------------------------------------------------------------------------------------------------------
origin_software_name           microsoft office word            312    312      0      0  docx
origin_software_name           libreoffice                      198    194      4      0  pdf, docx
origin_software_version        16.0000                          162    162      0      0  docx
origin_software_component      pdfium                            13     13      0      0  pdf

If any metadata-bearing documents are stale, metadata values excludes them and prints a notice pointing at unsterwerx rules metadata rebuild --all.


metadata find

Finds documents whose semantic facts match a literal canonical value or a regex pattern. Exact-value mode canonicalizes the input and walks metadata aliases before querying, so it is the best operator-facing command for "show me every document whose author/software/date matches this concept."

bash
unsterwerx metadata find [OPTIONS]

Exactly one of --value or --value-pattern is required.

Options

OptionTypeDefaultDescription
--concept-keystringrequiredConcept key to search within
--valuestringLiteral value to canonicalize and match
--value-patternregexRegex matched against stored canonical values
--file-typestringallRestrict matches to this file type
--extractorstringallRestrict matches to this extractor
--min-confidencefloat0.0Drop facts below this confidence floor
--match-qualitystringusableOne of usable, any, or low-confidence
--limitinteger50Stop after this many matching documents
--jsonflagOutput as JSON

Example

bash
unsterwerx metadata find \
    --concept-key origin_software_name \
    --value "Microsoft Office Word"
text
Matches for origin_software_name = microsoft office word:
  Acquisition-Plan.docx (570c62fd-...)
    fact#18 origin_software_name=microsoft office word (confidence 1.00)
  PM-Guidebook.docx (8e62a1fb-...)
    fact#44 origin_software_name=microsoft office word (confidence 1.00)
bash
unsterwerx metadata find \
    --concept-key document_author \
    --value-pattern "(?i)whetsel|freimanis"

Use --match-quality any when you want low-confidence matches included, or --match-quality low-confidence to audit only noisy facts.


metadata extract

Runs extractors on documents that were ingested before metadata support was available, or re-runs them when you want fresh results. Without --force, documents that already have an OK extraction are skipped.

bash
unsterwerx metadata extract [OPTIONS]

Options

OptionTypeDefaultDescription
--file-typestringallOnly process documents of this file type
--documentstringOnly process a specific document (ID or prefix)
--forceflagRe-extract even if results already exist
--dry-runflagShow what would be extracted without writing
--jsonflagOutput as JSON

File types are re-detected from actual bytes during extraction, not from the database value. This guards against stale file-type records.

Available extractors

ExtractorVersionFile Types
builtin_pdf1.0PDF
builtin_ooxml1.0DOCX, XLSX, PPTX
builtin_image1.0PNG, JPEG

Example

bash
unsterwerx metadata extract --dry-run
Metadata extraction dry run:
  Total candidates:   1807
  Extracted:             0
  Skipped (missing file): 3
  Skipped (already done): 1612
  Errors:                0
bash
unsterwerx metadata extract --file-type pdf --force
Metadata extraction complete:
  Total candidates:   1184
  Extracted:          1184
  Skipped (missing file): 0
  Skipped (already done):  0
  Errors:                3

Concept Rules

Concept rules map raw metadata keys to normalized semantic facts. Each rule specifies:

Rules are applied in priority order during extraction. When multiple rules match the same raw key, the highest-priority rule wins. Facts are rebuilt atomically on each extraction, so re-running metadata extract --force picks up any rule changes.

Notes