Unsterwerx

How To Extract and Query Document Metadata with Unsterwerx

Every document carries metadata that its authoring software embedded at creation time: who wrote it, when, with what tool, what version. Unsterwerx can extract this metadata from your ingested corpus, normalize it into semantic facts, and let you query it across thousands of documents at once. This gives you a bird's-eye view of your document ecosystem that no file browser can provide.

Prerequisites

Before you begin, you need:

Note: Metadata extraction is separate from canonical content extraction. You can run it at any point after ingestion, including months later. It reads from the original source files, not from the Universal Data Set (the normalized canonical representation).

Step 1 - Preview the Extraction with a Dry Run

Before extracting anything, find out how many documents are candidates and how many already have metadata. The --dry-run flag reports what would happen without writing to the database.

bash
unsterwerx metadata extract --dry-run
text
Metadata extraction dry run:
  Total candidates: 1128
  Extracted:        1115
  Skipped (missing file): 0
  Skipped (already done):  13
  Errors:           0

The output tells you that 1,128 documents are eligible for extraction, 1,115 would be processed, and 13 already have metadata from a previous run. "Missing file" counts documents whose original source files have been moved or deleted since ingestion.

Note: Import-sourced and synthetic documents are excluded from candidates automatically. Unsterwerx only extracts metadata from documents that have a real source file on disk.

Step 2 - Extract Metadata by File Type

You can extract metadata for all documents at once, or scope it to a specific file type. Running by file type is useful when you want to process your PDFs first and DOCX files later, or when you're testing against one format.

Extract metadata from PDF documents:

bash
unsterwerx metadata extract --file-type pdf
text
Metadata extraction complete:
  Total candidates: 1128
  Extracted:        433
  Skipped (missing file): 0
  Skipped (already done):  8
  Errors:           0

Then extract from DOCX documents:

bash
unsterwerx metadata extract --file-type docx
text
Metadata extraction complete:
  Total candidates: 1128
  Extracted:        243
  Skipped (missing file): 0
  Skipped (already done):  3
  Errors:           0

Each file type has its own extractor. PDFs use builtin_pdf, which reads the PDF info dictionary. DOCX, XLSX, and PPTX files use builtin_ooxml, which reads Office Open XML core and app properties. The extractors are format-specific NACs (Normalized Application Containers) - adapters that know how to read native metadata from each format.

Behind the scenes, Unsterwerx re-detects the file type from actual bytes during extraction, not from the database record. This guards against files that were misidentified at ingestion time.

Step 3 - Discover What Metadata Exists

Now that extraction is done, you can see what metadata keys exist across the entire corpus. The metadata keys command lists every distinct key, which extractor produced it, how many documents carry it, and the coverage percentage.

bash
unsterwerx metadata keys
text
EXTRACTOR            KEY                                DOCS   COVERAGE
------------------------------------------------------------------------
builtin_pdf          creation_date                       378      33.5%
builtin_pdf          mod_date                            357      31.6%
builtin_pdf          producer                            341      30.2%
builtin_pdf          creator                             301      26.7%
builtin_ooxml        dcterms_created                     211      18.7%
builtin_ooxml        Application                         208      18.4%
builtin_ooxml        dcterms_modified                    206      18.3%
builtin_ooxml        AppVersion                          205      18.2%
builtin_ooxml        revision                            205      18.2%
builtin_ooxml        creator                             202      17.9%
builtin_ooxml        lastModifiedBy                      197      17.5%
builtin_pdf          author                              181      16.0%
builtin_pdf          title                               160      14.2%
builtin_ooxml        Company                              93       8.2%
builtin_ooxml        title                                37       3.3%
builtin_pdf          subject                              30       2.7%
builtin_ooxml        subject                              20       1.8%
builtin_pdf          keywords                             15       1.3%
builtin_ooxml        keywords                              8       0.7%
builtin_ooxml        description                           3       0.3%
builtin_ooxml        category                              1       0.1%

A few things to notice here. Both builtin_pdf and builtin_ooxml produce a creator key, but from different raw properties (/Creator in PDF info dictionary vs. dc:creator in OOXML core properties). The coverage column tells you how common each key is: creation_date appears in 33.5% of all documents, while category shows up in just one.

This is where Unsterwerx's concept rule system earns its keep. Raw keys like producer, Application, and creator are format-specific. Concept rules normalize them into semantic facts with a shared vocabulary, so you can query across PDF and DOCX files without caring which extractor produced the data.

Step 4 - Explore Specific Metadata Values

The metadata values command lets you drill into what a specific concept key or concept family contains. Concept keys are the normalized names (like origin_software_name), and concept families group related keys (like origin_environment, which contains software name, software version, and software component).

Start with a single concept key. To find out what software created your documents:

bash
unsterwerx metadata values --concept-key origin_software_name
text
Concept key: origin_software_name
VALUE                                                  DOCS FILE TYPES          
--------------------------------------------------------------------------------
Microsoft Office Word                                   203 docx                
Adobe PDF Library 15.0                                   46 pdf                 
Adobe PDF Library 11.0                                   19 pdf                 
Acrobat Distiller 11.0 (Windows)                         16 pdf                 
Microsoft: Print To PDF                                  14 pdf                 
PDFium                                                   13 pdf                 
Acrobat Distiller 6.0.1 (Windows)                        12 pdf                 
Microsoft® Word 2016                                     10 pdf                 
Microsoft® Word for Office 365                            8 pdf                 
libtiff / tiff2pdf - 20100615                             6 pdf                 
Adobe PDF Library 10.0                                    5 pdf                 
Microsoft Reporting Services PDF Rendering Extension 11.0.0.0        5 pdf                 
Microsoft® Word 2010                                      5 pdf                 
Microsoft® Word for Microsoft 365                         5 pdf                 
Adobe Experience Manager forms output                     4 pdf                 
Adobe LiveCycle Designer 11.0                             4 pdf                 

This output reveals something important about real-world document corpora: they're messy. A single organization can have documents produced by dozens of different tools, spanning decades of software versions. The 203 DOCX files all report "Microsoft Office Word" because that's what the OOXML Application property stores. The PDF side is far more varied - PDFs from Word show up as "Microsoft® Word 2016", while PDFs from Adobe's pipeline show as "Adobe PDF Library 15.0".

To see the broader picture, query an entire concept family:

bash
unsterwerx metadata values --concept-family origin_environment
text
Concept family: origin_environment
CONCEPT KEY                    VALUE                                        DOCS FILE TYPES          
----------------------------------------------------------------------------------------------------
origin_software_component      PScript5.dll Version 5.2.2                     34 pdf                 
origin_software_component      Acrobat PDFMaker 15 for Word                   19 pdf                 
origin_software_component      Acrobat PDFMaker 11 for Word                   14 pdf                 
origin_software_component      PDFium                                         13 pdf                 
origin_software_component      Microsoft® Word 2016                           10 pdf                 
origin_software_component      Microsoft® Word for Office 365                  8 pdf                 
origin_software_component      LaTeX with hyperref package                     7 pdf                 
...
origin_software_name           Microsoft Office Word                         203 docx                
origin_software_name           Adobe PDF Library 15.0                         46 pdf                 
origin_software_name           Adobe PDF Library 11.0                         19 pdf                 
...
origin_software_version        16.0000                                       162 docx                
origin_software_version        15.0000                                        24 docx                
origin_software_version        14.0000                                        11 docx                
origin_software_version        12.0000                                         8 docx                

The family view gives you all three dimensions at once: what software, what component, and what version. You can see that 162 DOCX files were created with version 16 (Office 2016/365), 24 with version 15 (Office 2013), and 11 with version 14 (Office 2010). That kind of version distribution is exactly what you need for migration planning or compliance audits.

Step 5 - Inspect Individual Document Metadata

When you need the full story on a specific document, metadata show gives you both the raw extraction results and the derived semantic facts.

bash
unsterwerx metadata show 121530ca
text
Document: 121530ca-84aa-4d02-86f8-222dce40ffa3

Extractions:
  builtin_ooxml (v1.0) [docx] status=ok
    AppVersion: 16.0000
    Application: Microsoft Office Word
    creator: Freimanis, Adam D
    dcterms_created: 2020-06-29T11:44:00Z
    dcterms_modified: 2020-06-29T12:37:00Z
    lastModifiedBy: Whetsel, Robert
    revision: 3

Semantic Facts:
  [document_authorship]
    document_author = Freimanis, Adam D (confidence: 1.0, raw: creator=Freimanis, Adam D)
    document_last_editor = Whetsel, Robert (confidence: 0.9, raw: lastModifiedBy=Whetsel, Robert)
  [document_time]
    document_created_at = 2020-06-29T11:44:00Z (confidence: 1.0, raw: dcterms_created=2020-06-29T11:44:00Z)
    document_modified_at = 2020-06-29T12:37:00Z (confidence: 1.0, raw: dcterms_modified=2020-06-29T12:37:00Z)
  [origin_environment]
    origin_software_name = Microsoft Office Word (confidence: 1.0, raw: Application=Microsoft Office Word)
    origin_software_version = 16.0000 (confidence: 1.0, raw: AppVersion=16.0000)

You only need to provide enough of the document ID to be unique - 121530ca is enough here. The output has two sections. The top section, Extractions, shows exactly what the builtin_ooxml extractor pulled from the OOXML properties: seven raw key-value pairs. The bottom section, Semantic Facts, shows how concept rules mapped those raw properties into normalized facts organized by family.

Look at the document_authorship family. The creator field became document_author with confidence 1.0, while lastModifiedBy became document_last_editor with confidence 0.9. The lower confidence on the editor reflects the fact that "last modified by" is less reliable than "creator" in some software pipelines. Every fact includes its raw source so you can trace exactly where the value came from.

Step 6 - Backfill Metadata for Previously Ingested Documents

If you ingested documents before metadata extraction was available, or if you want to pick up changes from updated concept rules, use the --force flag to re-extract everything.

bash
unsterwerx metadata extract --force

This re-runs all extractors on all eligible documents, overwriting previous results. Semantic facts are rebuilt atomically for each document, so updated concept rules take effect immediately.

Without --force, the extract command skips any document that already has an OK extraction result. This makes it safe to run repeatedly - new documents get processed, existing ones are left alone.

Warning: Re-extraction reads from original source files on disk. If source files have been moved or deleted since ingestion, those documents will be counted as "Skipped (missing file)" and their existing metadata will remain unchanged.

You can also target a single document for re-extraction:

bash
unsterwerx metadata extract --document 121530ca --force

This is useful when you're testing concept rule changes against a known document before running a full corpus re-extraction.

Conclusion

You've extracted metadata from a mixed corpus of PDF and DOCX files, discovered 21 distinct metadata keys with varying coverage, explored the software landscape that produced your documents, and inspected the full extraction and semantic fact chain for an individual document.

The concept key/family system is what makes this practical at scale. Raw metadata keys are inconsistent across formats - producer in PDF, Application in OOXML, but concept rules normalize them into a shared vocabulary. Once extracted, these semantic facts feed into downstream classification and scoring pipelines.

From here, you can use metadata alongside other Unsterwerx capabilities: