How To Find Documents by Metadata with Unsterwerx
Every file on your disk already carries a dossier. Word stamps its version into the AppVersion property, Adobe Acrobat writes the /Producer string, your phone embeds a camera model in JPEG EXIF, and every OOXML document records the original author and the person who last touched it. Most of that information never surfaces in file explorers, and none of it is searchable with desktop tools.
Unsterwerx extracts those native properties during ingestion, normalizes them into semantic facts using concept rules, and gives you a query layer on top. This tutorial walks through the metadata find command and metadata-filtered search with regex patterns. Real enterprise scenarios show where each capability earns its keep.
Prerequisites
Before starting, you need:
- Unsterwerx v0.5.4 or later on your
PATH - A corpus that has been ingested and canonicalized (
unsterwerx ingest+unsterwerx canonical) - Metadata extracted via
unsterwerx metadata extract. If you have not done this yet, see How To Extract and Query Document Metadata with Unsterwerx first.
This tutorial uses a mixed corpus of 1,128 documents: PDF files plus the OOXML formats (DOCX, XLSX, PPTX), along with a few hundred PNG and JPEG images.
Step 1 - Survey the Concept Families Available to You
Before querying, find out what semantic facts actually exist in your corpus. Unsterwerx groups related concept keys into four families out of the box:
| Concept family | Concept keys | What it tells you |
|---|---|---|
document_authorship | document_author, document_title, document_last_editor, document_subject | Who wrote it, who edited it, what it is called |
document_time | document_created_at, document_modified_at | When it was created and last saved |
origin_environment | origin_software_name, origin_software_version, origin_software_component, origin_device_make, origin_device_model | What software and hardware produced the file |
dimensions | image_width, image_height, image_bit_depth, image_color_type | Raster geometry for PNG and JPEG files |
To confirm which keys your corpus actually carries, list the concept values by family:
unsterwerx metadata values --concept-family document_authorship --min-docs 3
Concept family: document_authorship
CONCEPT KEY CANONICAL VALUE TOTAL USABLE LOWCF SUPP FILE TYPES
-----------------------------------------------------------------------------------------
document_author whetsel, robert 48 48 0 0 pdf, docx
document_author freimanis, adam d 19 19 0 0 docx
document_author department of defense cio 14 14 0 0 pdf
document_last_editor whetsel, robert 57 57 0 0 docx
document_last_editor freimanis, adam d 12 12 0 0 docx
document_title acquisition plan 7 7 0 0 docx
The USABLE column is facts that cleared their confidence floor; LOWCF counts low-confidence facts; SUPP counts facts a facet rule has suppressed. You now know who your top authors are and roughly how many documents each one touched, without opening a single file.
Step 2 - Find Every Document by a Specific Author
The canonical use case for metadata find is employee offboarding. Someone leaves, and you need a list of everything they authored or last touched. Here is how you get it.
unsterwerx metadata find \
--concept-key document_author \
--value "Whetsel, Robert"
Matches for document_author = whetsel, robert:
CPTWHETSEL_Oct2021_RST.pdf (d9a14f2c-...)
fact#231 document_author=whetsel, robert (confidence 1.00)
NBIS_PPP_v1.5.docx (570c62fd-...)
fact#488 document_author=whetsel, robert (confidence 1.00)
DoDNET_Strategy_2020.pptx (b99b9e11-...)
fact#712 document_author=whetsel, robert (confidence 1.00)
...
48 documents returned (default limit 50)
Notice the input was "Whetsel, Robert" with a capital W and a space after the comma. Unsterwerx canonicalizes the input the same way it canonicalizes extracted facts. Capitalization and whitespace variants resolve to the same whetsel, robert key before the database is queried. You do not have to know the exact casing that an OOXML creator field stored years ago.
To catch everything a person touched, not just what they authored, query document_last_editor the same way. Or pass --concept-key and run both:
unsterwerx metadata find --concept-key document_last_editor --value "Whetsel, Robert"
For the offboarding checklist, export these two result sets to JSON and feed them into your retention workflow:
unsterwerx metadata find \
--concept-key document_author \
--value "Whetsel, Robert" \
--limit 1000 \
--json > authored.json
Step 3 - Audit Your Software Inventory
Version migration planning is painful because nobody ever inventories what produced the files sitting on the shared drive. metadata find against origin_software_name solves that.
Suppose you need to know which DOCX files were created on Word 2010 so you can prioritize them for a compatibility check against Word 365. Query by software version inside the docx file type:
unsterwerx metadata find \
--concept-key origin_software_version \
--value "14.0000" \
--file-type docx
Matches for origin_software_version = 14.0000:
Acquisition-Plan-legacy.docx (8e62a1fb-...)
fact#91 origin_software_version=14.0000 (confidence 1.00)
2012-HR-policy.docx (c72f0901-...)
fact#124 origin_software_version=14.0000 (confidence 1.00)
...
11 documents returned
Eleven files, all produced by Word 2010 (AppVersion 14.0000). That is your migration shortlist.
For PDFs, the story is messier because Adobe's pipeline chains multiple tools. The origin_software_component concept key captures the PDF producer, which is often a library rather than an application:
unsterwerx metadata find \
--concept-key origin_software_component \
--value "Acrobat Distiller 6.0.1 (Windows)"
Matches for origin_software_component = acrobat distiller 6.0.1 (windows):
2004_DoD_Memo.pdf (44219f38-...)
fact#672 origin_software_component=acrobat distiller 6.0.1 (windows) (confidence 0.90)
...
12 documents returned
Twelve PDFs ran through a 2003-era Distiller. For a forensic or compliance audit, that is a flag: these files predate modern PDF/A compliance and may not have the digital signatures your governance policy expects.
Step 4 - Match Variations With Regex Patterns
Literal matches are precise, but vendors are inconsistent. Microsoft Word alone has identified itself as Microsoft Office Word, Microsoft® Word 2016, Microsoft® Word for Office 365, and Microsoft Reporting Services PDF Rendering Extension 11.0.0.0 across its history. A literal search catches one variant at a time.
Swap --value for --value-pattern and pass a regex instead:
unsterwerx metadata find \
--concept-key origin_software_name \
--value-pattern "(?i)microsoft.*word"
Matches for origin_software_name ~ /(?i)microsoft.*word/:
Acquisition-Plan.docx (570c62fd-...)
fact#18 origin_software_name=microsoft office word (confidence 1.00)
PM-Guidebook.docx (8e62a1fb-...)
fact#44 origin_software_name=microsoft office word (confidence 1.00)
2016_Strategy_Brief.pdf (a0f0794e-...)
fact#311 origin_software_name=microsoft® word 2016 (confidence 0.90)
...
221 documents returned (default limit 50)
The (?i) makes the match case-insensitive, .* handles the variable middle part, and the result captures every form of Microsoft Word across both OOXML and PDF producers. Regex mode is ideal when you are sweeping for an ecosystem rather than a specific version.
Adobe-produced PDFs are another common sweep:
unsterwerx metadata find \
--concept-key origin_software_name \
--value-pattern "^adobe" \
--file-type pdf
A caret anchor at the start keeps the pattern tight. Results include Adobe PDF Library 15.0, Adobe Experience Manager forms output, and Adobe LiveCycle Designer 11.0, but not files where "adobe" appears as a substring deeper in the value.
Note: Regex runs in Rust after the SQL prefilter, so complex patterns do not slow down the database scan. The SQL side still applies file-type and extractor predicates first, then enforces the confidence floor.
Step 5 - Search by Date Ranges
Dates are where metadata outruns full-text search. Full-text cannot answer "which documents were created in Q2 2020?" because the date is in the file properties, not the content. The search command accepts --created-from, --created-to, --modified-from, and --modified-to filters that query the document_time family directly.
Find everything created during the first wave of the pandemic:
unsterwerx search \
--created-from 2020-03-01 \
--created-to 2020-05-31 \
--limit 20
Search Results (20 matches)
══════════════════════════════════════════════════════════════
1. COVID_contingency_plan.docx [c14b8f23]
· document_created_at = 2020-03-18 14:22:00
2. Q2_2020_budget_revision.xlsx [a77b1e09]
· document_created_at = 2020-04-02 09:17:00
3. WFH_policy_v1.docx [d811c52e]
· document_created_at = 2020-03-14 16:04:00
...
══════════════════════════════════════════════════════════════
This is metadata-only mode: no text query, just date filters. Results come back ordered by filename and document ID rather than text rank. Date bounds are inclusive, and date-only values expand to the full day (so 2020-05-31 includes everything through 23:59:59).
Modified dates work the same way and answer a different question. A file created in 2018 but last modified in 2024 was likely revised under the current policy regime:
unsterwerx search \
--modified-from 2024-01-01 \
--created-to 2019-12-31 \
--file-type docx
Pulling both filters combines with AND logic: created before 2020, but touched again in 2024 or later. The classic "old policy document that someone quietly updated."
Step 6 - Stack Text and Metadata Filters Together
The full power of search shows up when you stack filters. This is where metadata turns into a targeted query language.
Here is a compliance scenario: find every cybersecurity document authored by a specific person, produced with Microsoft Word, created in December 2017:
unsterwerx search "cybersecurity" \
--author "Department of Defense CIO" \
--origin-software "Microsoft Office Word" \
--created-from 2017-12-01 \
--created-to 2017-12-31
Search Results (3 matches)
══════════════════════════════════════════════════════════════
1. DoD_Cybersecurity_RMF_Dec2017.docx [b8e01573]
...DOD Cybersecurity and the Risk Management Framework...
· document_author = department of defense cio
· origin_software_name = microsoft office word
· document_created_at = 2017-12-18 11:05:00
2. RMF_Policy_Update_2017.docx [bd51b3ed]
...Department of Defense Program Manager's Guidebook for
Integrating the Cybersecurity Risk Management Framework...
· document_author = department of defense cio
· origin_software_name = microsoft office word
· document_created_at = 2017-12-22 09:48:00
3. DoD_CIO_memo_cyber_2017.docx [afb068e7]
...Cybersecurity Strategy UNCLASSIFIED National Background
Investigation System...
· document_author = department of defense cio
· origin_software_name = microsoft office word
· document_created_at = 2017-12-29 15:12:00
══════════════════════════════════════════════════════════════
Three documents. Without metadata filters, the text query "cybersecurity" returned dozens of matches. Layering author and software on top of a date window cut the noise by an order of magnitude and left only the files that match the actual scenario.
Filter logic inside the same concept family is OR, across families is AND. Pass --author "Alice" --author "Bob" to find documents authored by either, and it still AND-combines with your date window or file type filter.
Step 7 - Audit Low-Confidence Facts Separately
Not every metadata fact is equally trustworthy. When the OOXML creator field is obviously a default placeholder like "admin" or "user", the concept rule assigns a lower confidence score. When a PDF's /Producer string is "unknown" or blank, the extracted fact is flagged low-confidence. By default, both find and search exclude those rows so you get clean results.
Sometimes the noise is the point. To see only the messy metadata that your concept rules flagged as untrustworthy, use --match-quality low-confidence:
unsterwerx metadata find \
--concept-key document_author \
--value "admin" \
--match-quality low-confidence
Matches for document_author = admin:
legacy_import_batch_042.docx (f19a4d28-...)
fact#1201 document_author=admin (confidence 0.30) [low-confidence]
legacy_import_batch_051.docx (21b0e411-...)
fact#1243 document_author=admin (confidence 0.30) [low-confidence]
...
18 documents returned
Eighteen documents whose "author" is literally the string admin. That is a data hygiene problem, not a person. You may want to flag these for manual re-attribution or exclude them from author-based retention policies.
The three match-quality modes give you the full triage workflow:
usable(default): clean, trustworthy facts onlyany: every fact regardless of confidence, for completenesslow-confidence: only the flagged rows, for auditing
You can also stack --min-confidence 0.9 as a numeric floor to be stricter than the default. A 0.9 threshold drops everything below 90% confidence, which is useful when you are about to act on results in an automated pipeline.
Step 8 - Collapse Identity Variations With Metadata Aliases
One of the messiest realities of enterprise corpora is that the same person appears under half a dozen spellings. Robert Whetsel, Whetsel, Robert, rwhetsel, Robert C. Whetsel, and R. C. Whetsel are the same human, and yet each one produces a different canonical value by default.
Unsterwerx handles this through the metadata alias system. You define an alias rule that maps variant canonical values to a single target value, and both find and search resolve through the alias table automatically:
unsterwerx rules metadata alias add \
--concept-key document_author \
--from "r. c. whetsel" \
--to "whetsel, robert"
unsterwerx rules metadata alias add \
--concept-key document_author \
--from "rwhetsel" \
--to "whetsel, robert"
unsterwerx rules metadata alias add \
--concept-key document_author \
--from "robert c. whetsel" \
--to "whetsel, robert"
After a rebuild, a single search captures every spelling:
unsterwerx rules metadata rebuild --all
unsterwerx metadata find \
--concept-key document_author \
--value "Whetsel, Robert"
The result now includes documents that originally stored rwhetsel, R. C. Whetsel, and Robert C. Whetsel alongside the canonical form. For compliance work, this is the difference between finding 48 documents and finding 73.
Aliases also work for software. Map LibreOffice 7.5, libreoffice, and LibO all to libreoffice, and your version migration audit no longer misses the files some user opened on a different machine last year.
Warning: Alias rules are a controlled vocabulary. Wrong mappings silently merge identities. Before you rebuild, run unsterwerx metadata values --concept-key document_author to confirm you are collapsing real duplicates and not distinct people with similar names.
Step 9 - Find Documents by Device (Images)
The builtin_image extractor pulls EXIF hardware fields out of PNG and JPEG files, giving you two more concept keys: origin_device_make and origin_device_model. In a corpus that includes photos, this unlocks a forensic query you cannot get anywhere else.
Find every photo shot on a specific phone model:
unsterwerx metadata find \
--concept-key origin_device_model \
--value "iPhone 12 Pro"
Matches for origin_device_model = iphone 12 pro:
IMG_4021.jpeg (80ca3812-...)
fact#1802 origin_device_model=iphone 12 pro (confidence 1.00)
site_survey_north_wall.jpeg (ba9f2d04-...)
fact#1844 origin_device_model=iphone 12 pro (confidence 1.00)
...
23 documents returned
Combined with image dimensions, you can also isolate raster files by resolution category. Query the dimensions family to find assets that are too small for print reproduction or too large for a document template:
unsterwerx metadata values --concept-family dimensions
Concept family: dimensions
CONCEPT KEY CANONICAL VALUE TOTAL USABLE LOWCF SUPP FILE TYPES
---------------------------------------------------------------------------------
image_width 1920 44 44 0 0 jpeg, png
image_width 3024 31 31 0 0 jpeg
image_width 1024 18 18 0 0 png
image_height 1080 42 42 0 0 jpeg, png
image_height 4032 31 31 0 0 jpeg
image_color_type rgb 102 102 0 0 jpeg, png
image_color_type rgba 18 18 0 0 png
image_bit_depth 8 115 115 0 0 jpeg, png
Thirty-one photos at 3024×4032, the native iPhone camera resolution. A design team could use this to separate camera originals from web-sized derivatives without opening a single file.
Step 10 - Build Classification Rules on Top of Metadata
Once metadata facts are extracted, they become signals for classification. A standard filename-pattern rule catches tax-2020.pdf but misses financial_report.pdf that was authored by the CFO and created on the corporate Mac. Metadata predicates close that gap.
Add a classification rule that fires when any of the four named people authored the document, regardless of filename:
unsterwerx rules add \
--name "executive-authored" \
--class executive \
--priority 15 \
--metadata-predicate '{"concept_key":"document_author","op":"regex_match","value":"(?i)^(whetsel|freimanis|barratt|sabbage)"}'
Then re-classify:
unsterwerx classify
Classification Summary
══════════════════════════════════
Documents classified: 731
Rules applied: 1,082
Errors: 0
══════════════════════════════════
Every document whose document_author fact matches one of those four names now carries an executive class, which can then drive a retention policy. Metadata predicates support both equals (with a value list) and regex_match operators, and they honor the same canonicalization and alias resolution as metadata find.
When you update alias rules or concept rules after classification, the documents whose metadata depends on them are marked stale. Both metadata find and metadata-filtered search skip stale documents and print a notice. Refresh with:
unsterwerx rules metadata rebuild --all
Conclusion
You have queried your corpus by author, by software, by version, by regex patterns over software families, by date windows, by device model, and by stacked combinations of all of the above. You have flagged low-confidence facts for audit, collapsed identity variations with aliases, and wired metadata predicates into classification rules that drive retention.
Here is what makes this practical at enterprise scale. The concept rule layer takes eighteen different raw metadata keys (PDF /Producer, OOXML Application, EXIF Model, and the rest) and maps them into a shared vocabulary of fifteen semantic keys. Your query runs against the shared vocabulary, not the raw source. A single --author "Alice Smith" filter matches whether the file is a DOCX that stored the author in dc:creator, a PDF that stored it in /Author, or a spreadsheet that stored it in the OOXML core properties.
Metadata queries answer questions that text search cannot touch:
- Which documents did a departing employee author or last edit?
- Which files were created with a software version we are sunsetting?
- Which PDFs ran through an outdated Acrobat pipeline?
- Which files were created during a specific compliance window?
- Which photos came from a specific camera or phone?
- Which authors hide behind alias spellings in our document history?
From here, three directions are worth exploring:
- Use metadata predicates in classification rules to drive retention and archival policies on author or software signals.
- Combine metadata filters with similarity analysis to find near-duplicate documents produced by different authors or different software versions.
- Export metadata query results as JSON and feed them into your governance or compliance pipeline with
--jsonon anyfindorsearchcommand.