How To Search and Compare Documents with Unsterwerx
Once your documents have been ingested and normalized into the Universal Data Set, the real work begins: finding what you need and understanding how documents relate to each other. In this tutorial, you will rebuild the search index, run full-text queries across all your documents regardless of original format, detect duplicates and near-duplicates automatically, and compare specific document pairs to see exactly what changed.
Prerequisites
Before starting, make sure you have:
- Unsterwerx v0.5.4 or later installed and on your
PATH - A dataset already ingested and normalized (see How To Ingest and Normalize Enterprise Documents with Unsterwerx)
- Canonical extraction completed (
unsterwerx canonical)
This tutorial uses a real dataset of 1,042 documents spanning PDF, DOCX, PPTX, and XLSX files.
Step 1 - Rebuild the Search Index
After ingesting and normalizing documents, you need to build (or rebuild) the FTS5 full-text search index. This index is what makes sub-second search possible across thousands of documents.
Run the reindex command:
unsterwerx reindex
Rebuilding FTS5 index...
Reindex Summary
══════════════════════════════════
Canonical docs: 1042
Indexed in FTS5: 1042
Missing content: 0
Total in FTS5: 1042
══════════════════════════════════
All 1,042 canonical documents are now indexed. The Missing content: 0 line confirms that every canonical record has its content available in storage.
Note: You can safely run reindex at any time. It rebuilds from scratch. If you ingest new documents later, run unsterwerx canonical followed by unsterwerx reindex to include them.
Step 2 - Search Across All Your Documents
Here is the key insight: because every document has been normalized to the Universal Data Set through its format-specific NAC (Normalized Application Container), search works identically across PDF, DOCX, PPTX, and XLSX files. You do not need to know what format a document was originally in.
Search for documents related to business planning:
unsterwerx search "business plan"
Search Results (20 matches)
══════════════════════════════════════════════════════════════
1. Small Business [77f10077]
...Resiliency resources for <b>business</b> 1. FEMA's <b>Business</b>
Continuity <b>Planning</b> Suite https://www.ready.gov/<b>business</b>-
continuity-<b>planning</b>-suite...
2. Guidance for Review and [8c9370e0]
...BEA <b>Business</b> Enterprise Architecture BMA <b>Business</b>
Mission Area BPM <b>Business</b> Process Management BPR
<b>Business</b>...
3. Slide 1 [f190f6b4]
...Project financial management includes all aspects related to
effective <b>business</b> and financial management throughout the
project life cycle. <b>Business</b> case development <b>Planning</b>,
budgeting, forecasting...
4. Department of Defense [b99b9e11]
...Figure 1 – Integrated <b>Business</b> <b>Plan</b> Framework In examining
defense <b>business</b> system portfolios...
...
══════════════════════════════════════════════════════════════
Twenty matches across the corpus. Each result shows:
- The document title and an 8-character ID prefix in brackets (e.g.,
77f10077) - A content snippet with
<b>tags highlighting where the query matched - Results ranked by FTS5 relevance
Notice that result #3 ("Slide 1") came from a PowerPoint file, while result #1 came from a PDF. The search does not care about the original format.
Step 3 - Limit and Refine Results
Twenty results is a good default, but sometimes you want a tighter list. Use --limit to control how many results come back.
Search for tax-related documents with a limit of 5:
unsterwerx search "tax" --limit 5
Search Results (5 matches)
══════════════════════════════════════════════════════════════
1. a. Employee's Social Security Number [e6d22e4d]
...Include this <b>tax</b> on Form 1040 or 1040-SR. See "Other
<b>Taxes</b>" in the Forms 1040 and 1040-SR instructions.
B - Uncollected Medicare <b>tax</b> on tips...
2. a. Employee's Social Security Number [85565499]
...Include this <b>tax</b> on Form 1040 or 1040-SR. See "Other
<b>Taxes</b>" in the Forms 1040 and 1040-SR instructions...
3. a. Employee's Social Security Number [d0887c57]
...Include this <b>tax</b> on Form 1040 or 1040-SR...
4. a. Employee's Social Security Number [5f23f24e]
...Include this <b>tax</b> on Form 1040 or 1040-SR...
5. Schwab One® Account of [5f7cc7c3]
Schwab One® Account of Account Number <b>TAX</b> YEAR 2020
ROBERT CARLTON WHETSEL & 2846-9540 RUTH ALISON BARRATT
JT TEN FORM 1099 COMPOSITE...
══════════════════════════════════════════════════════════════
Four W-2 forms and a 1099 composite statement. This is already useful: those four W-2 results look nearly identical, which suggests they might be duplicates or year-over-year variants. The similarity analysis in the next step will confirm that automatically.
You can also search for a person's name. Searching for "whetsel" returns 20 matches spanning military evaluations, financial statements, book chapters, contact lists, and presentation slides:
unsterwerx search "whetsel"
Search Results (20 matches)
══════════════════════════════════════════════════════════════
1. Calculations [b441bb4d]
...Board of Directors & Advisors | | BOD | | Sophie Sabbage |
| <b>Whetsel</b> | | Templer | ## Cap Table (Current) | David
Levine, Founder & CEO | ...| Robert <b>Whetsel</b>...
2. HQDA#: [cfb70357]
...CPT <b>Whetsel</b> also published a paper for IEEE; R. C.
<b>Whetsel</b> and Y. Qu, "Quantifying the impact of big data's
variety,"...
3. + [a0f0794e]
...Net Portfolio Value: $54,280.19 ## DR ROBERT C <b>WHETSEL</b>
AND ## DR RUTH A BARRATT JTWROS...
...
══════════════════════════════════════════════════════════════
A single query finds a name across spreadsheets, PDFs, PowerPoint decks, and Word documents. That is the power of searching a normalized Universal Data Set instead of searching each format separately.
Step 4 - Find Duplicate and Similar Documents
Searching finds documents by content. Similarity analysis finds documents that are related to _each other_. This is a different question: not "which documents mention X?" but "which documents are copies or near-copies of each other?"
Unsterwerx uses MinHash combined with Locality-Sensitive Hashing (LSH) for this analysis. The key advantage: instead of comparing every document pair (which would be O(n^2) and extremely slow), LSH narrows the comparison space so the analysis scales linearly with collection size.
Run similarity with the default threshold of 0.3:
unsterwerx similarity
Running canonical extraction (if needed)...
Running similarity analysis...
Similarity Analysis
══════════════════════════════════════════
Documents processed: 1042
Candidate pairs: 451
Exact duplicates: 83
Threshold: 0.30
Run ID: 0d51c...
══════════════════════════════════════════
Top Pairs:
1.000 Identifying and Safeguardin... <-> Phishing Awareness Certific...
1.000 Kove Solution - 44 Compute ... <-> Kove Solution - 44 Compute ...
1.000 MIT SUS Takeda - April, 202... <-> MIT SUS Takeda - April^J 20...
1.000 CPTWHETSEL_Oct2021_RST.pdf <-> CPTWHETSEL_Sep2021_RST.pdf
1.000 2013-09-13_DoD_Strategy_for... <-> 2013-09-13_DoD_Strategy_for...
1.000 ACME_business_metrics_06242... <-> 1 page marketing plan.xlsx
1.000 value_prop_CDO_v3.docx <-> value_prop_CDO_final.docx
1.000 Blank RFO.PDF <-> 2020SEPT_WHETSEL_AT_RFO.pdf
1.000 PitchDeck Template - old (1... <-> PitchDeck Template - old.pptx
1.000 barratt_W2-1619016848.pdf <-> RBarratt2020W2.pdf
1.000 2015_National_Military_Stra... <-> 2015_National_Military_Stra...
...
The headline number: 83 exact duplicates found automatically across 1,042 documents. No manual review needed.
Look at what the analysis found:
ACME_business_metrics_06242...and1 page marketing plan.xlsxhave a Jaccard score of 1.000, meaning identical content despite completely different filenames. Someone renamed the file.value_prop_CDO_v3.docxandvalue_prop_CDO_final.docxare content-identical. The "v3" _was_ the final.Blank RFO.PDFand2020SEPT_WHETSEL_AT_RFO.pdfmatch perfectly. The blank template and a filled-out form have the same canonical text.barratt_W2-1619016848.pdfandRBarratt2020W2.pdfare the same W-2 saved under two names.
A Jaccard score of 1.000 means the two documents produce identical text after normalization. This works across formats: a .csv and .xlsx with the same data will score 1.000 because the NAC for each format produces the same canonical output.
Step 5 - Tune the Similarity Threshold
The default threshold of 0.3 catches a wide net of related documents. If you only care about very close matches, raise it.
Run with a threshold of 0.8 and limit the top pairs to 10:
unsterwerx similarity --threshold 0.8 --top 10
Running canonical extraction (if needed)...
Running similarity analysis...
Similarity Analysis
══════════════════════════════════════════
Documents processed: 1042
Candidate pairs: 184
Exact duplicates: 83
Threshold: 0.80
Run ID: e878e...
══════════════════════════════════════════
Top Pairs:
1.000 RovalryMarketingMay14.docx.... <-> RovalryMarketingMay14 2.doc...
1.000 NBIS_CCEP Draft Artifact Te... <-> NBIS_CCEP Draft Artifact Te...
1.000 RCP Tech Requirements - Dec... <-> RCP Tech Requirements - Dec...
1.000 CPTWHETSEL_March2022_RST (1... <-> RST FORM Vers.20190427.pdf
1.000 Indeco technology platform ... <-> Indeco technology platform ...
1.000 CPTWHETSEL_Oct2021_RST.pdf <-> CPTWHETSEL_Sep2021_RST.pdf
1.000 MIT SUS Takeda - April, 202... <-> MIT SUS Takeda - April^J 20...
1.000 CPTWHETSEL_Oct2021_RST.pdf <-> 151 TIOG_RST Form v5.pdf
1.000 DoDNET and DISANET.pptx <-> DoDNET and DISANET (002).pptx
1.000 CPT Whetsel RPA_IndividualO... <-> CPT Whetsel RPA_IndividualO...
Candidate pairs dropped from 451 to 184. The 83 exact duplicates remain (those always score 1.0 regardless of threshold), but the "loosely similar" pairs below 0.8 are filtered out.
Here is a rough guide for threshold values:
- 0.3 (default) - Broad net. Catches documents that share some structural overlap, good for discovery.
- 0.5 - Moderate. Documents that share significant content but may have different sections.
- 0.8 - Strict. Near-identical documents with minor edits or version differences.
- 1.0 - Exact duplicates only.
Note: You can persist your preferred threshold with unsterwerx config set similarity.threshold 0.8 so you do not have to pass the flag every time.
Step 6 - Compare Two Documents Side-by-Side
Similarity tells you _that_ two documents are related. The diff command tells you _how_ they differ. This is where you see the actual structural changes between two document versions.
Start by diffing two exact duplicates. Use the document IDs from the similarity output:
unsterwerx diff --doc-a <doc-a-id> --doc-b <doc-b-id>
For a pair that scored 1.000:
Documents are identical.
No surprises there. Now diff a pair that is similar but not identical. Here is the output for two financial XLSX files with related but different content:
unsterwerx diff --doc-a <doc-a-id> --doc-b <doc-b-id>
@@ -8,9 +8,7 @@
Profit andLoss3
-A/PAging Detail4
-
-Expenses byVendor Summary5
+Balance Sheet4
Profit andLoss
@@ -38,61 +36,27 @@
| NET OPERATING INCOME | -48,092.57 |
| NET INCOME | $ -48,092.57 |
-A/PAging Detail
-
-All Dates
-
-| |
-| --- |
-| |
-| |
-| This report contains no data for your specified date range. |
+Balance Sheet
-Expenses byVendor Summary
-
All Dates
| | Total |
| --- | --- |
-| Amazon | 856.46 |
-| AMTRAK | 282.00 |
-| Best Buy | 572.31 |
-| BIT DEFENDER.COM | 104.98 |
-| Cafe Nola | 64.27 |
-| Chase | 1,306.38 |
-| Columbia College | 1,039.00 |
-| Comcast | 416.57 |
-| Dell Sales and Service | 108.99 |
-| Godaddy.com | 1,078.99 |
-| Harvard Bus Publishing | 242.12 |
...
+| ASSETS | |
+| Current Assets | |
+| Bank Accounts | |
+| Checking | -23,879.05 |
+| Total Bank Accounts | -23,879.05 |
+| Total Current Assets | -23,879.05 |
+| TOTAL ASSETS | $ -23,879.05 |
+| LIABILITIES AND EQUITY | |
+| Equity | |
+| Owner's Investment | 24,213.52 |
+| Net Income | -48,092.57 |
+| Total Equity | -23,879.05 |
+| TOTAL LIABILITIES AND EQUITY | $ -23,879.05 |
This is a structural diff of two QuickBooks financial exports in XLSX format. Document A contains a Profit & Loss statement, an A/P Aging Detail, and an Expenses by Vendor Summary. Document B replaces the aging and vendor reports with a Balance Sheet.
The diff shows the exact line-by-line changes in unified diff format. Lines prefixed with - exist only in document A; lines prefixed with + exist only in document B. Context lines (no prefix) appear in both.
This is what makes the diff command powerful for financial documents: you can see that two related spreadsheets share the same net income figure ($-48,092.57) but present completely different views of the business. Without normalization, comparing an XLSX to another XLSX at the content level would require opening both in a spreadsheet application and manually scanning for differences.
Step 7 - Batch-Compare All Similar Pairs
Diffing one pair at a time is fine for investigation. When you want diffs for every candidate pair from the similarity analysis, use --all:
unsterwerx diff --all
This computes and stores diffs for all candidate pairs identified by the most recent similarity run. Each diff is compressed and stored in the local Shared Sandbox, so you only pay the computation cost once. Subsequent runs skip pairs that already have stored diffs.
After the batch completes, run unsterwerx diff without any flags to list all computed diffs with their change statistics:
unsterwerx diff
The output shows each pair with added/removed line counts and a change percentage, making it easy to prioritize which pairs deserve a closer look.
Conclusion
You have now used Unsterwerx to rebuild a full-text search index, query 1,042 documents across four formats with a single command, detect 83 exact duplicates automatically using MinHash + LSH, and compare related documents to see their exact structural differences.
The key takeaway: because all documents pass through format-specific NACs during ingestion and are normalized into a single Universal Data Set, every analysis command works uniformly across PDF, DOCX, PPTX, and XLSX. You search one index, not four.
For next steps, see How To Classify Documents and Set Retention Policies with Unsterwerx to learn how to apply Business Intelligence rules (classification and governance policies) and User Intelligence rules (retention, access, and mutability) to your document collection.