How To Search and Compare Documents with Unsterwerx

Once your documents have been ingested and normalized into the Universal Data Set, the real work begins: finding what you need and understanding how documents relate to each other. In this tutorial, you will rebuild the search index, run full-text queries across all your documents regardless of original format, detect duplicates and near-duplicates automatically, and compare specific document pairs to see exactly what changed.

Prerequisites

Before starting, make sure you have:

Unsterwerx v0.5.4 or later installed and on your PATH
A dataset already ingested and normalized (see How To Ingest and Normalize Enterprise Documents with Unsterwerx)
Canonical extraction completed (unsterwerx canonical)

This tutorial uses a real dataset of 1,042 documents spanning PDF, DOCX, PPTX, and XLSX files.

Step 1 - Rebuild the Search Index

After ingesting and normalizing documents, you need to build (or rebuild) the FTS5 full-text search index. This index is what makes sub-second search possible across thousands of documents.

Run the reindex command:

bash

unsterwerx reindex

text

Rebuilding FTS5 index...

Reindex Summary
══════════════════════════════════
  Canonical docs:       1042
  Indexed in FTS5:      1042
  Missing content:         0
  Total in FTS5:        1042
══════════════════════════════════

All 1,042 canonical documents are now indexed. The Missing content: 0 line confirms that every canonical record has its content available in storage.

Note: You can safely run reindex at any time. It rebuilds from scratch. If you ingest new documents later, run unsterwerx canonical followed by unsterwerx reindex to include them.

Step 2 - Search Across All Your Documents

Here is the key insight: because every document has been normalized to the Universal Data Set through its format-specific NAC (Normalized Application Container), search works identically across PDF, DOCX, PPTX, and XLSX files. You do not need to know what format a document was originally in.

Search for documents related to business planning:

bash

unsterwerx search "business plan"

text

Search Results (20 matches)
══════════════════════════════════════════════════════════════
  1. Small Business [77f10077]
     ...Resiliency resources for <b>business</b>  1.  FEMA's <b>Business</b>
     Continuity <b>Planning</b> Suite  https://www.ready.gov/<b>business</b>-
     continuity-<b>planning</b>-suite...

  2. Guidance for Review and [8c9370e0]
     ...BEA  <b>Business</b> Enterprise Architecture  BMA  <b>Business</b>
     Mission Area  BPM  <b>Business</b> Process Management  BPR
     <b>Business</b>...

  3. Slide 1 [f190f6b4]
     ...Project financial management includes all aspects related to
     effective <b>business</b> and financial management throughout the
     project life cycle. <b>Business</b> case development <b>Planning</b>,
     budgeting, forecasting...

  4. Department of Defense [b99b9e11]
     ...Figure 1 – Integrated <b>Business</b> <b>Plan</b> Framework  In examining
     defense <b>business</b> system portfolios...
  ...
══════════════════════════════════════════════════════════════

Twenty matches across the corpus. Each result shows:

The document title and an 8-character ID prefix in brackets (e.g., 77f10077)
A content snippet with <b> tags highlighting where the query matched
Results ranked by FTS5 relevance

Notice that result #3 ("Slide 1") came from a PowerPoint file, while result #1 came from a PDF. The search does not care about the original format.

Step 3 - Limit and Refine Results

Twenty results is a good default, but sometimes you want a tighter list. Use --limit to control how many results come back.

Search for tax-related documents with a limit of 5:

bash

unsterwerx search "tax" --limit 5

text

Search Results (5 matches)
══════════════════════════════════════════════════════════════
  1. a. Employee's Social Security Number [e6d22e4d]
     ...Include this <b>tax</b> on Form 1040 or 1040-SR. See "Other
     <b>Taxes</b>" in the Forms 1040 and 1040-SR instructions.
     B - Uncollected Medicare <b>tax</b> on tips...

  2. a. Employee's Social Security Number [85565499]
     ...Include this <b>tax</b> on Form 1040 or 1040-SR. See "Other
     <b>Taxes</b>" in the Forms 1040 and 1040-SR instructions...

  3. a. Employee's Social Security Number [d0887c57]
     ...Include this <b>tax</b> on Form 1040 or 1040-SR...

  4. a. Employee's Social Security Number [5f23f24e]
     ...Include this <b>tax</b> on Form 1040 or 1040-SR...

  5. Schwab One® Account of [5f7cc7c3]
     Schwab One® Account of  Account Number  <b>TAX</b> YEAR 2020
     ROBERT CARLTON WHETSEL &  2846-9540  RUTH ALISON BARRATT
     JT TEN  FORM 1099 COMPOSITE...
══════════════════════════════════════════════════════════════

Four W-2 forms and a 1099 composite statement. This is already useful: those four W-2 results look nearly identical, which suggests they might be duplicates or year-over-year variants. The similarity analysis in the next step will confirm that automatically.

You can also search for a person's name. Searching for "whetsel" returns 20 matches spanning military evaluations, financial statements, book chapters, contact lists, and presentation slides:

bash

unsterwerx search "whetsel"

text

Search Results (20 matches)
══════════════════════════════════════════════════════════════
  1. Calculations [b441bb4d]
     ...Board of Directors & Advisors | | BOD | | Sophie Sabbage |
     | <b>Whetsel</b> | | Templer |  ## Cap Table (Current)  | David
     Levine, Founder & CEO | ...| Robert <b>Whetsel</b>...

  2. HQDA#: [cfb70357]
     ...CPT <b>Whetsel</b> also published a paper for IEEE; R. C.
     <b>Whetsel</b> and Y. Qu, "Quantifying the impact of big data's
     variety,"...

  3. + [a0f0794e]
     ...Net Portfolio Value: $54,280.19  ## DR ROBERT C <b>WHETSEL</b>
     AND  ## DR RUTH A BARRATT JTWROS...
  ...
══════════════════════════════════════════════════════════════

A single query finds a name across spreadsheets, PDFs, PowerPoint decks, and Word documents. That is the power of searching a normalized Universal Data Set instead of searching each format separately.

Step 4 - Find Duplicate and Similar Documents

Searching finds documents by content. Similarity analysis finds documents that are related to _each other_. This is a different question: not "which documents mention X?" but "which documents are copies or near-copies of each other?"

Unsterwerx uses MinHash combined with Locality-Sensitive Hashing (LSH) for this analysis. The key advantage: instead of comparing every document pair (which would be O(n^2) and extremely slow), LSH narrows the comparison space so the analysis scales linearly with collection size.

Run similarity with the default threshold of 0.3:

bash

unsterwerx similarity

text

Running canonical extraction (if needed)...
Running similarity analysis...

Similarity Analysis
══════════════════════════════════════════
  Documents processed:     1042
  Candidate pairs:          451
  Exact duplicates:          83
  Threshold:               0.30
  Run ID:              0d51c...
══════════════════════════════════════════

  Top Pairs:
    1.000  Identifying and Safeguardin... <-> Phishing Awareness Certific...
    1.000  Kove Solution - 44 Compute ... <-> Kove Solution - 44 Compute ...
    1.000  MIT SUS Takeda - April, 202... <-> MIT SUS Takeda - April^J 20...
    1.000  CPTWHETSEL_Oct2021_RST.pdf <-> CPTWHETSEL_Sep2021_RST.pdf
    1.000  2013-09-13_DoD_Strategy_for... <-> 2013-09-13_DoD_Strategy_for...
    1.000  ACME_business_metrics_06242... <-> 1 page marketing plan.xlsx
    1.000  value_prop_CDO_v3.docx <-> value_prop_CDO_final.docx
    1.000  Blank RFO.PDF <-> 2020SEPT_WHETSEL_AT_RFO.pdf
    1.000  PitchDeck Template - old (1... <-> PitchDeck Template - old.pptx
    1.000  barratt_W2-1619016848.pdf <-> RBarratt2020W2.pdf
    1.000  2015_National_Military_Stra... <-> 2015_National_Military_Stra...
    ...

The headline number: 83 exact duplicates found automatically across 1,042 documents. No manual review needed.

Look at what the analysis found:

ACME_business_metrics_06242... and 1 page marketing plan.xlsx have a Jaccard score of 1.000, meaning identical content despite completely different filenames. Someone renamed the file.
value_prop_CDO_v3.docx and value_prop_CDO_final.docx are content-identical. The "v3" _was_ the final.
Blank RFO.PDF and 2020SEPT_WHETSEL_AT_RFO.pdf match perfectly. The blank template and a filled-out form have the same canonical text.
barratt_W2-1619016848.pdf and RBarratt2020W2.pdf are the same W-2 saved under two names.

A Jaccard score of 1.000 means the two documents produce identical text after normalization. This works across formats: a .csv and .xlsx with the same data will score 1.000 because the NAC for each format produces the same canonical output.

Step 5 - Tune the Similarity Threshold

The default threshold of 0.3 catches a wide net of related documents. If you only care about very close matches, raise it.

Run with a threshold of 0.8 and limit the top pairs to 10:

bash

unsterwerx similarity --threshold 0.8 --top 10

text

Running canonical extraction (if needed)...
Running similarity analysis...

Similarity Analysis
══════════════════════════════════════════
  Documents processed:     1042
  Candidate pairs:          184
  Exact duplicates:          83
  Threshold:               0.80
  Run ID:              e878e...
══════════════════════════════════════════

  Top Pairs:
    1.000  RovalryMarketingMay14.docx.... <-> RovalryMarketingMay14 2.doc...
    1.000  NBIS_CCEP Draft Artifact Te... <-> NBIS_CCEP Draft Artifact Te...
    1.000  RCP Tech Requirements - Dec... <-> RCP Tech Requirements - Dec...
    1.000  CPTWHETSEL_March2022_RST (1... <-> RST FORM Vers.20190427.pdf
    1.000  Indeco technology platform ... <-> Indeco technology platform ...
    1.000  CPTWHETSEL_Oct2021_RST.pdf <-> CPTWHETSEL_Sep2021_RST.pdf
    1.000  MIT SUS Takeda - April, 202... <-> MIT SUS Takeda - April^J 20...
    1.000  CPTWHETSEL_Oct2021_RST.pdf <-> 151 TIOG_RST Form v5.pdf
    1.000  DoDNET and DISANET.pptx <-> DoDNET and DISANET (002).pptx
    1.000  CPT Whetsel RPA_IndividualO... <-> CPT Whetsel RPA_IndividualO...

Candidate pairs dropped from 451 to 184. The 83 exact duplicates remain (those always score 1.0 regardless of threshold), but the "loosely similar" pairs below 0.8 are filtered out.

Here is a rough guide for threshold values:

0.3 (default) - Broad net. Catches documents that share some structural overlap, good for discovery.
0.5 - Moderate. Documents that share significant content but may have different sections.
0.8 - Strict. Near-identical documents with minor edits or version differences.
1.0 - Exact duplicates only.

Note: You can persist your preferred threshold with unsterwerx config set similarity.threshold 0.8 so you do not have to pass the flag every time.

Step 6 - Compare Two Documents Side-by-Side

Similarity tells you _that_ two documents are related. The diff command tells you _how_ they differ. This is where you see the actual structural changes between two document versions.

Start by diffing two exact duplicates. Use the document IDs from the similarity output:

bash

unsterwerx diff --doc-a <doc-a-id> --doc-b <doc-b-id>

For a pair that scored 1.000:

text

Documents are identical.

No surprises there. Now diff a pair that is similar but not identical. Here is the output for two financial XLSX files with related but different content:

bash

unsterwerx diff --doc-a <doc-a-id> --doc-b <doc-b-id>

text

@@ -8,9 +8,7 @@

 Profit andLoss3

-A/PAging Detail4
-
-Expenses byVendor Summary5
+Balance Sheet4

 Profit andLoss

@@ -38,61 +36,27 @@
 | NET OPERATING INCOME | -48,092.57 |
 | NET INCOME | $ -48,092.57 |

-A/PAging Detail
-
-All Dates
-
-|  |
-| --- |
-|  |
-|  |
-| This report contains no data for your specified date range. |
+Balance Sheet

-Expenses byVendor Summary
-
 All Dates

 |  | Total |
 | --- | --- |
-| Amazon | 856.46 |
-| AMTRAK | 282.00 |
-| Best Buy | 572.31 |
-| BIT DEFENDER.COM | 104.98 |
-| Cafe Nola | 64.27 |
-| Chase | 1,306.38 |
-| Columbia College | 1,039.00 |
-| Comcast | 416.57 |
-| Dell Sales and Service | 108.99 |
-| Godaddy.com | 1,078.99 |
-| Harvard Bus Publishing | 242.12 |
 ...
+| ASSETS |  |
+| Current Assets |  |
+| Bank Accounts |  |
+| Checking | -23,879.05 |
+| Total Bank Accounts | -23,879.05 |
+| Total Current Assets | -23,879.05 |
+| TOTAL ASSETS | $ -23,879.05 |
+| LIABILITIES AND EQUITY |  |
+| Equity |  |
+| Owner's Investment | 24,213.52 |
+| Net Income | -48,092.57 |
+| Total Equity | -23,879.05 |
+| TOTAL LIABILITIES AND EQUITY | $ -23,879.05 |

This is a structural diff of two QuickBooks financial exports in XLSX format. Document A contains a Profit & Loss statement, an A/P Aging Detail, and an Expenses by Vendor Summary. Document B replaces the aging and vendor reports with a Balance Sheet.

The diff shows the exact line-by-line changes in unified diff format. Lines prefixed with - exist only in document A; lines prefixed with + exist only in document B. Context lines (no prefix) appear in both.

This is what makes the diff command powerful for financial documents: you can see that two related spreadsheets share the same net income figure ($-48,092.57) but present completely different views of the business. Without normalization, comparing an XLSX to another XLSX at the content level would require opening both in a spreadsheet application and manually scanning for differences.

Step 7 - Batch-Compare All Similar Pairs

Diffing one pair at a time is fine for investigation. When you want diffs for every candidate pair from the similarity analysis, use --all:

bash

unsterwerx diff --all

This computes and stores diffs for all candidate pairs identified by the most recent similarity run. Each diff is compressed and stored in the local Shared Sandbox, so you only pay the computation cost once. Subsequent runs skip pairs that already have stored diffs.

After the batch completes, run unsterwerx diff without any flags to list all computed diffs with their change statistics:

bash

unsterwerx diff

The output shows each pair with added/removed line counts and a change percentage, making it easy to prioritize which pairs deserve a closer look.

Conclusion

You have now used Unsterwerx to rebuild a full-text search index, query 1,042 documents across four formats with a single command, detect 83 exact duplicates automatically using MinHash + LSH, and compare related documents to see their exact structural differences.

The key takeaway: because all documents pass through format-specific NACs during ingestion and are normalized into a single Universal Data Set, every analysis command works uniformly across PDF, DOCX, PPTX, and XLSX. You search one index, not four.

For next steps, see How To Classify Documents and Set Retention Policies with Unsterwerx to learn how to apply Business Intelligence rules (classification and governance policies) and User Intelligence rules (retention, access, and mutability) to your document collection.