How To Extract and Query Document Metadata with Unsterwerx
Every document carries metadata that its authoring software embedded at creation time: who wrote it, when, with what tool, what version. Unsterwerx can extract this metadata from your ingested corpus, normalize it into semantic facts, and let you query it across thousands of documents at once. This gives you a bird's-eye view of your document ecosystem that no file browser can provide.
Prerequisites
Before you begin, you need:
- Unsterwerx v0.5.4 or later installed and on your PATH
- An initialized Shared Sandbox (the local processing environment where Unsterwerx stores all state) with documents already ingested. If you haven't done this yet, run
unsterwerx ingest /path/to/your/documentsfirst. - Canonical content extracted for your documents (
unsterwerx canonicalorunsterwerx similarity, which runs canonicalization automatically)
Note: Metadata extraction is separate from canonical content extraction. You can run it at any point after ingestion, including months later. It reads from the original source files, not from the Universal Data Set (the normalized canonical representation).
Step 1 - Preview the Extraction with a Dry Run
Before extracting anything, find out how many documents are candidates and how many already have metadata. The --dry-run flag reports what would happen without writing to the database.
unsterwerx metadata extract --dry-run
Metadata extraction dry run:
Total candidates: 1128
Extracted: 1115
Skipped (missing file): 0
Skipped (already done): 13
Errors: 0
The output tells you that 1,128 documents are eligible for extraction, 1,115 would be processed, and 13 already have metadata from a previous run. "Missing file" counts documents whose original source files have been moved or deleted since ingestion.
Note: Import-sourced and synthetic documents are excluded from candidates automatically. Unsterwerx only extracts metadata from documents that have a real source file on disk.
Step 2 - Extract Metadata by File Type
You can extract metadata for all documents at once, or scope it to a specific file type. Running by file type is useful when you want to process your PDFs first and DOCX files later, or when you're testing against one format.
Extract metadata from PDF documents:
unsterwerx metadata extract --file-type pdf
Metadata extraction complete:
Total candidates: 1128
Extracted: 433
Skipped (missing file): 0
Skipped (already done): 8
Errors: 0
Then extract from DOCX documents:
unsterwerx metadata extract --file-type docx
Metadata extraction complete:
Total candidates: 1128
Extracted: 243
Skipped (missing file): 0
Skipped (already done): 3
Errors: 0
Each file type has its own extractor. PDFs use builtin_pdf, which reads the PDF info dictionary. DOCX, XLSX, and PPTX files use builtin_ooxml, which reads Office Open XML core and app properties. The extractors are format-specific NACs (Normalized Application Containers) - adapters that know how to read native metadata from each format.
Behind the scenes, Unsterwerx re-detects the file type from actual bytes during extraction, not from the database record. This guards against files that were misidentified at ingestion time.
Step 3 - Discover What Metadata Exists
Now that extraction is done, you can see what metadata keys exist across the entire corpus. The metadata keys command lists every distinct key, which extractor produced it, how many documents carry it, and the coverage percentage.
unsterwerx metadata keys
EXTRACTOR KEY DOCS COVERAGE
------------------------------------------------------------------------
builtin_pdf creation_date 378 33.5%
builtin_pdf mod_date 357 31.6%
builtin_pdf producer 341 30.2%
builtin_pdf creator 301 26.7%
builtin_ooxml dcterms_created 211 18.7%
builtin_ooxml Application 208 18.4%
builtin_ooxml dcterms_modified 206 18.3%
builtin_ooxml AppVersion 205 18.2%
builtin_ooxml revision 205 18.2%
builtin_ooxml creator 202 17.9%
builtin_ooxml lastModifiedBy 197 17.5%
builtin_pdf author 181 16.0%
builtin_pdf title 160 14.2%
builtin_ooxml Company 93 8.2%
builtin_ooxml title 37 3.3%
builtin_pdf subject 30 2.7%
builtin_ooxml subject 20 1.8%
builtin_pdf keywords 15 1.3%
builtin_ooxml keywords 8 0.7%
builtin_ooxml description 3 0.3%
builtin_ooxml category 1 0.1%
A few things to notice here. Both builtin_pdf and builtin_ooxml produce a creator key, but from different raw properties (/Creator in PDF info dictionary vs. dc:creator in OOXML core properties). The coverage column tells you how common each key is: creation_date appears in 33.5% of all documents, while category shows up in just one.
This is where Unsterwerx's concept rule system earns its keep. Raw keys like producer, Application, and creator are format-specific. Concept rules normalize them into semantic facts with a shared vocabulary, so you can query across PDF and DOCX files without caring which extractor produced the data.
Step 4 - Explore Specific Metadata Values
The metadata values command lets you drill into what a specific concept key or concept family contains. Concept keys are the normalized names (like origin_software_name), and concept families group related keys (like origin_environment, which contains software name, software version, and software component).
Start with a single concept key. To find out what software created your documents:
unsterwerx metadata values --concept-key origin_software_name
Concept key: origin_software_name
VALUE DOCS FILE TYPES
--------------------------------------------------------------------------------
Microsoft Office Word 203 docx
Adobe PDF Library 15.0 46 pdf
Adobe PDF Library 11.0 19 pdf
Acrobat Distiller 11.0 (Windows) 16 pdf
Microsoft: Print To PDF 14 pdf
PDFium 13 pdf
Acrobat Distiller 6.0.1 (Windows) 12 pdf
Microsoft® Word 2016 10 pdf
Microsoft® Word for Office 365 8 pdf
libtiff / tiff2pdf - 20100615 6 pdf
Adobe PDF Library 10.0 5 pdf
Microsoft Reporting Services PDF Rendering Extension 11.0.0.0 5 pdf
Microsoft® Word 2010 5 pdf
Microsoft® Word for Microsoft 365 5 pdf
Adobe Experience Manager forms output 4 pdf
Adobe LiveCycle Designer 11.0 4 pdf
This output reveals something important about real-world document corpora: they're messy. A single organization can have documents produced by dozens of different tools, spanning decades of software versions. The 203 DOCX files all report "Microsoft Office Word" because that's what the OOXML Application property stores. The PDF side is far more varied - PDFs from Word show up as "Microsoft® Word 2016", while PDFs from Adobe's pipeline show as "Adobe PDF Library 15.0".
To see the broader picture, query an entire concept family:
unsterwerx metadata values --concept-family origin_environment
Concept family: origin_environment
CONCEPT KEY VALUE DOCS FILE TYPES
----------------------------------------------------------------------------------------------------
origin_software_component PScript5.dll Version 5.2.2 34 pdf
origin_software_component Acrobat PDFMaker 15 for Word 19 pdf
origin_software_component Acrobat PDFMaker 11 for Word 14 pdf
origin_software_component PDFium 13 pdf
origin_software_component Microsoft® Word 2016 10 pdf
origin_software_component Microsoft® Word for Office 365 8 pdf
origin_software_component LaTeX with hyperref package 7 pdf
...
origin_software_name Microsoft Office Word 203 docx
origin_software_name Adobe PDF Library 15.0 46 pdf
origin_software_name Adobe PDF Library 11.0 19 pdf
...
origin_software_version 16.0000 162 docx
origin_software_version 15.0000 24 docx
origin_software_version 14.0000 11 docx
origin_software_version 12.0000 8 docx
The family view gives you all three dimensions at once: what software, what component, and what version. You can see that 162 DOCX files were created with version 16 (Office 2016/365), 24 with version 15 (Office 2013), and 11 with version 14 (Office 2010). That kind of version distribution is exactly what you need for migration planning or compliance audits.
Step 5 - Inspect Individual Document Metadata
When you need the full story on a specific document, metadata show gives you both the raw extraction results and the derived semantic facts.
unsterwerx metadata show 121530ca
Document: 121530ca-84aa-4d02-86f8-222dce40ffa3
Extractions:
builtin_ooxml (v1.0) [docx] status=ok
AppVersion: 16.0000
Application: Microsoft Office Word
creator: Freimanis, Adam D
dcterms_created: 2020-06-29T11:44:00Z
dcterms_modified: 2020-06-29T12:37:00Z
lastModifiedBy: Whetsel, Robert
revision: 3
Semantic Facts:
[document_authorship]
document_author = Freimanis, Adam D (confidence: 1.0, raw: creator=Freimanis, Adam D)
document_last_editor = Whetsel, Robert (confidence: 0.9, raw: lastModifiedBy=Whetsel, Robert)
[document_time]
document_created_at = 2020-06-29T11:44:00Z (confidence: 1.0, raw: dcterms_created=2020-06-29T11:44:00Z)
document_modified_at = 2020-06-29T12:37:00Z (confidence: 1.0, raw: dcterms_modified=2020-06-29T12:37:00Z)
[origin_environment]
origin_software_name = Microsoft Office Word (confidence: 1.0, raw: Application=Microsoft Office Word)
origin_software_version = 16.0000 (confidence: 1.0, raw: AppVersion=16.0000)
You only need to provide enough of the document ID to be unique - 121530ca is enough here. The output has two sections. The top section, Extractions, shows exactly what the builtin_ooxml extractor pulled from the OOXML properties: seven raw key-value pairs. The bottom section, Semantic Facts, shows how concept rules mapped those raw properties into normalized facts organized by family.
Look at the document_authorship family. The creator field became document_author with confidence 1.0, while lastModifiedBy became document_last_editor with confidence 0.9. The lower confidence on the editor reflects the fact that "last modified by" is less reliable than "creator" in some software pipelines. Every fact includes its raw source so you can trace exactly where the value came from.
Step 6 - Backfill Metadata for Previously Ingested Documents
If you ingested documents before metadata extraction was available, or if you want to pick up changes from updated concept rules, use the --force flag to re-extract everything.
unsterwerx metadata extract --force
This re-runs all extractors on all eligible documents, overwriting previous results. Semantic facts are rebuilt atomically for each document, so updated concept rules take effect immediately.
Without --force, the extract command skips any document that already has an OK extraction result. This makes it safe to run repeatedly - new documents get processed, existing ones are left alone.
Warning: Re-extraction reads from original source files on disk. If source files have been moved or deleted since ingestion, those documents will be counted as "Skipped (missing file)" and their existing metadata will remain unchanged.
You can also target a single document for re-extraction:
unsterwerx metadata extract --document 121530ca --force
This is useful when you're testing concept rule changes against a known document before running a full corpus re-extraction.
Conclusion
You've extracted metadata from a mixed corpus of PDF and DOCX files, discovered 21 distinct metadata keys with varying coverage, explored the software landscape that produced your documents, and inspected the full extraction and semantic fact chain for an individual document.
The concept key/family system is what makes this practical at scale. Raw metadata keys are inconsistent across formats - producer in PDF, Application in OOXML, but concept rules normalize them into a shared vocabulary. Once extracted, these semantic facts feed into downstream classification and scoring pipelines.
From here, you can use metadata alongside other Unsterwerx capabilities:
- If you haven't already, run similarity analysis to find duplicate and near-duplicate documents. See How To Detect and Remove Duplicate Documents with Unsterwerx.
- Export documents from the Universal Data Set back to usable formats with
unsterwerx reconstruct. - Build classification rules that use metadata values as matching criteria, so documents from specific software or authors are automatically tagged.