How To Detect and Remove Duplicate Documents with Unsterwerx

Duplicate documents are a tax on every document corpus. They inflate storage, skew search results, and make governance harder. Unsterwerx provides a full deduplication pipeline that uses Bayesian probability scoring to find duplicates, groups them into knowledge vectors, and removes the redundant copies while preserving a single authoritative anchor for each group. Every removal is auditable and reversible.

In this article, you will train a knowledge model on a 1,042-document corpus, teach it with feedback labels, cluster documents into knowledge vectors, and apply deduplication to remove 179 duplicate documents - a 17% corpus reduction - with full rollback capability.

Prerequisites

Before you begin, you need:

Unsterwerx v0.5.4 or later installed (installation guide)
A corpus already ingested and indexed. If you have not done this yet, follow the Quick Start guide.
Similarity scores already computed. Run unsterwerx similarity build if you have not yet. The knowledge model builds on top of similarity data.
Familiarity with classification and retention policies is helpful but not required. See How To Classify Documents and Set Retention Policies with Unsterwerx for that workflow.

Step 1 - Build the Knowledge Model

The knowledge model uses a Naive Bayes classifier to score document pairs. It takes the similarity data your corpus already has (Jaccard and cosine scores from locality-sensitive hashing) and computes a posterior probability that each pair represents a true duplicate.

The model starts with bootstrap labels derived from your similarity data - it does not need manual input to produce initial results.

Run the build:

bash

unsterwerx knowledge build

text

Preflight checks...
  All prerequisites met.

Building semantic features...
  Corpus: 1042 docs, 1762686 unique terms (IDF snapshot #1)

Training Bayesian model...
  Bootstrap labels: 220 positive, 440 negative
  Model trained: run #1, P(dup)=0.306, P(unrel)=0.694

Scoring candidates...

Timing: Semantic: 1.1s | Scoring: 0.0s | Total: 1.5s
Candidates scored: 451

Top 20 pairs by posterior:
  Doc A                                  Doc B                                   Posterior    Jaccard     Cosine
  --------------------------------------------------------------------------------------------------------------
  087b9cf2-9a36-4efc-8d02-5f9eb1acb504   ba160f91-5bc1-4db1-82fb-db1c31bbaf5a       1.000      1.000      1.000
  4cd34b51-51b3-41c6-90ae-ea79ec6016c2   c7708a05-817a-4d0e-8cb2-a0d8e50ed522       1.000      0.906      0.930
  418249e8-6733-4982-b2d4-c86968e32e54   ae18c21b-9970-4770-b175-27efd73bb56f       1.000      0.984      0.998
  2c96bdc7-3055-4347-9ea0-d9c22ed2a1a0   30d29388-f857-4ee4-8f15-0c1d89d6e148       1.000      1.000      1.000
  15b60961-7acf-433d-a467-44b76dd36676   ca403989-c9fc-4882-b47e-df2f5f01dec9       1.000      0.992      0.993

The model trained on 220 positive and 440 negative bootstrap labels. P(dup)=0.306 is the prior probability: about 30.6% of candidate pairs are likely duplicates. The remaining 69.4% are likely unrelated.

The top pairs all show a posterior of 1.000 with high Jaccard and cosine scores. These are near-certain duplicates. The model scored 451 candidate pairs total from 1,042 documents.

Step 2 - Evaluate the Model

Before you trust the model's judgments, check its internal consistency. The --evaluate flag runs the model and reports accuracy metrics:

bash

unsterwerx knowledge build --evaluate

text

Model is current (no retrain needed).

Scoring candidates...

Timing: Semantic: 0.2s | Scoring: 0.0s | Total: 0.3s
Candidates scored: 451

Evaluation:
  Post-train consistency: 100.0%
  No user feedback labels yet. Add labels with 'knowledge labels add' for real precision/recall.

Post-train consistency is 100%, meaning the model's predictions perfectly match its own training data. That is a good baseline. But the model tells you something important: it has no user feedback yet. Without your labels, it cannot report precision or recall against ground truth.

Time to teach it.

Step 3 - Teach the Model with Feedback Labels

You improve the model by labeling document pairs. Pick a pair from the top results that you know are duplicates, and one pair you know are unrelated.

Label a known duplicate pair:

bash

unsterwerx knowledge labels add 087b9cf2-9a3 ba160f91-5bc --label duplicate_or_same_concept

text

Label added: 087b9cf2-9a3 / ba160f91-5bc → duplicate_or_same_concept

Label a known unrelated pair:

bash

unsterwerx knowledge labels add 694cac44-8ef dd3b3a3f-508 --label unrelated

text

Label added: 694cac44-8ef / dd3b3a3f-508 → unrelated

Even two labels give the model ground truth to measure itself against. In practice, labeling 10-20 pairs across different score ranges will sharpen precision significantly.

Step 4 - Retrain with Your Feedback

Now rebuild the model with the --retrain and --evaluate flags together. The model incorporates your labels into its training and reports precision/recall against them:

bash

unsterwerx knowledge build --retrain --evaluate

text

Building semantic features...
  Corpus: 1042 docs, 1762686 unique terms (IDF snapshot #1)

Training Bayesian model...
  Bootstrap labels: 220 positive, 440 negative
  Model trained: run #2, P(dup)=0.306, P(unrel)=0.694

Scoring candidates...

Timing: Semantic: 0.2s | Scoring: 0.0s | Total: 0.5s
Candidates scored: 451

Evaluation:
  Post-train consistency: 100.0%
  User feedback labels: 2
  Feedback precision: 50.0%
  Feedback recall:    100.0%
  Feedback F1:        66.7%

The model now reports against your 2 feedback labels. Recall is 100% - it found the duplicate you confirmed. Precision is 50% - it also flagged the unrelated pair as a duplicate, which your label corrected. The F1 score of 66.7% reflects both.

With more labels, these metrics improve. The Bayesian approach means each label shifts the model's posterior probabilities, and you can iteratively retrain until precision meets your requirements.

Step 5 - Cluster Documents into Knowledge Vectors

Knowledge vectors group related documents into clusters. Each vector represents a concept or topic in your corpus - what the TCA patent calls organizing documents within the Universal Data Set (the normalized canonical representation of all ingested data).

Build the vectors from your trained model:

bash

unsterwerx knowledge vectors build

text

Knowledge Vector Build Results:
  Vectors created:      121
  Vectors updated:      0
  Vectors deleted:      0
  Edges created:        0
  Documents clustered:  352
  Singletons dropped:   0
  Time:                 0.07s
  Run ID:               6856990f
  Upstream similarity:  6a3f9de3

The model grouped 352 documents into 121 knowledge vectors. The remaining 690 documents had no similarity candidates and remain unclustered. This is expected - not every document has a close neighbor.

Step 6 - Explore Your Knowledge Vectors

Before deduplicating, explore the vectors to understand what the model found.

List vectors

bash

unsterwerx knowledge vectors list

text

ID                                     Label                                                Docs Confidence Method
-------------------------------------------------------------------------------------------------------------------
710175d2-39d5-4f03-a59a-e9dbd15807aa   NBIS CCEP Draft Artifact Templates -DRAFT v2            3     1.000  bayesian_lsh_v1
cf99980f-f5bb-4f6e-84fb-335ebe087ce6   CDEROne Development Overview 12-3                       3     1.000  bayesian_lsh_v1
2662141d-47a0-4c40-bc46-36de359d71b1   NBIS Product Story - DataLake v 2017 9-20               4     1.000  bayesian_lsh_v1
f8cb9afc-279f-4c78-a982-885ece0c4921   Adv NBIS Prototype Overview  v09.25.2017                4     1.000  bayesian_lsh_v1
a56ec25e-3127-4758-8bf9-d40f32ff7c41   NBIS Kickoff Briefing 2017 11 02 v0.13                  3     1.000  bayesian_lsh_v1
236c2e99-0201-4efe-9c9b-4cc7a7243fd9   SOFWERX Data Science.7                                  3     1.000  bayesian_lsh_v1
53daa373-c044-4749-99ac-2b89df0b5296   outline                                                 3     1.000  bayesian_lsh_v1
...

Each vector has a label (derived from the primary document's filename), a document count, a confidence score, and the clustering method. Vectors with 3-4 documents and 1.000 confidence are strong duplicate clusters.

Inspect a specific vector

bash

unsterwerx knowledge vectors show 710175d2-39d5-4f03-a59a-e9dbd15807aa

text

Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa
  Label:       NBIS CCEP Draft Artifact Templates -DRAFT v2
  Confidence:  1.000
  Members:     3
  Method:      bayesian_lsh_v1
  Representative: 76e98e7b-bad9-498e-9bc8-f53b5ec7da34

Members:
  Document ID                            File Name                                  Score Primary
  -----------------------------------------------------------------------------------------------
  76e98e7b-bad9-498e-9bc8-f53b5ec7da34   NBIS_CCEP Draft Artifact Templates -DR    1.000 *
  82f5dc90-0c97-4b3b-938d-6981a6cb79dd   NBIS_CCEP Draft Artifact Templates -DR    1.000 
  b500fbdd-a59d-4477-b172-b15e99ca64a0   NBIS_CCEP Draft Artifact Templates -DR    1.000

Three copies of the same artifact template, all scoring 1.000. The primary member (marked with *) is the representative document. During dedup, one copy will be kept as the anchor and the other two removed.

Search across vectors

You can search vectors by keyword. This searches document content within all vectors:

bash

unsterwerx knowledge vectors search "business"

text

Vector ID                              Vector Label                   Document                               Snippet
--------------------------------------------------------------------------------------------------------------------------------------------
1d431706-bc7a-426d-ab7e-26a0c9db5301   4.NBIS Conceptual System Arc   83667607-1187-43d6-8c60-7d028c28b4ae   ...Business Application Tier
cf99980f-f5bb-4f6e-84fb-335ebe087ce6   CDEROne Development Overview   832db84c-1b2a-4c89-a321-83e3771cd5cd   ...Allows visibility into to the data owners, sources, proce
a047ad7b-4bf2-4a3f-998e-c77fe4f3fdc5   PitchDeck Template - old (1)   43f2cedc-09f6-4624-8022-5bed7ccb578c   ...Business Model
...

10 vectors matched.

This is useful for verifying that documents you care about are clustered correctly before running dedup.

Step 7 - Scan for Dedup Candidates

The dedup scan is non-destructive. It analyzes your vectors and produces a plan showing exactly what would be kept and what would be removed, without changing anything.

bash

unsterwerx knowledge dedup scan

text

Scanning for deduplication candidates (threshold=0.800)...

Dedup scan run: f1e54255
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190

The scan found 115 vectors with dedup candidates at a posterior threshold of 0.800. It plans to keep 149 documents and remove 190.

Each vector entry shows the decision logic:

text

Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa (NBIS CCEP Draft Artifact Templates -DRAF)
  Confidence: 1.000 | Kept: 1 | Removed: 2 | Anchor: b500fbdd-a59
  Document       File Name                      Weight Posterior  Signed Decision
  -------------------------------------------------------------------------------------
  76e98e7b-bad   NBIS_CCEP Draft Artifact Tem        2    1.000         REMOVE (posterior 1.000 >= threshold 0.800)
  82f5dc90-0c9   NBIS_CCEP Draft Artifact Tem        2    1.000         REMOVE (posterior 1.000 >= threshold 0.800)
  b500fbdd-a59   NBIS_CCEP Draft Artifact Tem        2        —         KEEP (primary anchor (highest weight))

The anchor is the document with the highest weight in the vector. It is always kept. Documents with a posterior probability at or above the threshold (0.800) are marked for removal. Documents below the threshold stay.

Here is the key concept: within each vector, one document is the anchor - the authoritative representative. Everything else that scores above the threshold is a redundant copy.

Notice how the scan also respects document protections:

text

Vector: fd321fb7-0fa8-4a34-9f02-450a95c9b0a4 (2020AUG WHETSEL AT RFO)
  Confidence: 0.876 | Kept: 4 | Removed: 5 | Anchor: d1349c98-826
  Document       File Name                      Weight Posterior  Signed Decision
  -------------------------------------------------------------------------------------
  d1349c98-826   RFO Request 06-19 September.        2        —         KEEP (primary anchor (highest weight))
  dfaf0650-537   RFO Request 06-19 September_        2        —     yes KEEP (signed document)
  0ff3ba58-8e4   2020AUG_WHETSEL_AT_RFO.pdf          2    1.000         REMOVE (posterior 1.000 >= threshold 0.800)
  4a427ccc-1ca   05192020_Request-For-order_v        2    0.525         KEEP (posterior 0.525 < threshold 0.800 with anchor)

The signed document (dfaf0650-537) is automatically kept regardless of its posterior score. Unsterwerx never removes digitally signed documents during dedup. Documents below the threshold (like the one at 0.525) are also preserved.

Step 8 - Preview the Dedup Plan

Before committing to anything, run a dry run. This produces the exact same scan output but explicitly confirms nothing will change:

bash

unsterwerx knowledge dedup apply --dry-run

text

Scanning for deduplication candidates (threshold=0.800)...

Dedup scan run: 12dc61bf
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190

Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa (NBIS CCEP Draft Artifact Templates -DRAF)
  Confidence: 1.000 | Kept: 1 | Removed: 2 | Anchor: b500fbdd-a59
  ...

Dry run — no changes applied.

Review the full output carefully. If any vector's decisions look wrong, go back to Step 3 and add more labels for the pairs in question, then retrain.

Step 9 - Apply Deduplication

When you are satisfied with the plan, apply it:

bash

unsterwerx knowledge dedup apply --confirm

text

Scanning for deduplication candidates (threshold=0.800)...

Dedup scan run: a15c52fb
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190

...

Applying deduplication...
WARN Doc under legal hold — skipping doc=86f07ef9-a54f-4c18-9350-2f08b94e10c5
WARN Doc under legal hold — skipping doc=d3cb9102-fd5d-4ca0-a00a-2425d9614e32
WARN Doc under legal hold — skipping doc=e170511f-a011-48cf-bc30-77abac817f47
WARN Doc under legal hold — skipping doc=147930e2-5b63-4dc1-88bf-2e744b28c75a
WARN Doc under legal hold — skipping doc=2886d59f-601f-4687-b76f-e4c426fd8a54
WARN Doc under legal hold — skipping doc=32673ebe-ba20-44c6-84ba-e906709ba891
WARN Doc under legal hold — skipping doc=5755db8d-f11d-4bb8-ae01-479160ca3752
WARN Doc under legal hold — skipping doc=75c300c7-ff59-452d-ae3c-b48d457278bf
WARN Doc under legal hold — skipping doc=aa918a34-3510-4abf-af1b-a6c6c3972be6
WARN Doc under legal hold — skipping doc=b385039a-8c18-4894-9a34-0175c5d33dd1
WARN Doc under legal hold — skipping doc=feabec39-edca-49fb-91fe-543506c83729

Deduplication complete:
  Documents removed:    179
  Diffs computed:       179
  Provenance merged:    68
  Labels inserted:      179
  Errors:               11
  Rule ID:              c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2
  Time:                 0.42s

Several things happened here:

179 documents removed. Each removal is recorded as a dedup action with a diff against the anchor.
179 diffs computed. Unsterwerx stores the difference between each removed document and its anchor so nothing is truly lost.
68 provenance records merged. Metadata from removed documents is merged into their anchors.
11 documents under legal hold were skipped. This is the Business Intelligence layer (the TCA patent's rules of hierarchy) enforcing governance policy. Documents with active legal holds cannot be removed by dedup, period. These 11 skips are reported as "errors" in the summary, but they are intentional protections.
The entire operation completed in 0.42 seconds.

Note: The 11 "errors" are legal-hold protections, not failures. Unsterwerx treats any document it cannot process as an error to ensure you review the output. Legal holds always win over dedup.

Step 10 - Verify and Inspect the Results

List dedup rules

Every dedup operation creates a named rule you can reference later:

bash

unsterwerx knowledge dedup list

text

Rule ID                                Name                           Actions Active Created
-----------------------------------------------------------------------------------------------
c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2   dedup-2026-04-14T15:28:03 (t       179    yes 2026-04-14 15:28:03

1 dedup rules total.

Inspect individual actions

Show the full list of actions in a dedup rule:

bash

unsterwerx knowledge dedup show c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2

text

Rule: c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2

    ID Document       Vector         Type             Restored Status      Created
-----------------------------------------------------------------------------------------------
     1 76e98e7b-bad   710175d2-39d   dedup_remove     classified           2026-04-14 15:28:03
     2 82f5dc90-0c9   710175d2-39d   dedup_remove     classified           2026-04-14 15:28:03
     3 c74b23dd-fad   cf99980f-f5b   dedup_remove     classified           2026-04-14 15:28:03
     4 e158d22c-5ef   cf99980f-f5b   dedup_remove     classified           2026-04-14 15:28:03
     5 3f03ab9f-d19   c7975b09-3a7   dedup_remove     canonical            2026-04-14 15:28:03
     ...

179 actions total.

To rollback a single action:  knowledge dedup rollback <action-id>
To rollback the entire rule:  knowledge dedup rollback <rule-id>

Each action shows the removed document ID, the vector it belonged to, the action type (dedup_remove), and the restored status (classified or canonical). The "Restored" column tells you what state the document would return to if you roll it back.

Rollback if needed

If you need to undo the deduplication, you can roll back individual actions or the entire rule:

bash

# Roll back a single document
unsterwerx knowledge dedup rollback 1

# Roll back the entire dedup operation
unsterwerx knowledge dedup rollback c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2

Every dedup removal is reversible because Unsterwerx stores the diff between the removed document and its anchor. Rollback restores the document to its pre-dedup state. This is the Shared Sandbox principle from the TCA patent: all processing happens in a trusted local environment where operations are auditable and recoverable.

Conclusion

You trained a Bayesian knowledge model on 1,042 documents, taught it with feedback labels, clustered 352 documents into 121 knowledge vectors, and applied deduplication to remove 179 redundant documents. That is a 17% corpus reduction, fully auditable and fully reversible.

The pipeline you followed:

Similarity produces raw Jaccard/cosine scores between document pairs
Knowledge build trains a Bayesian model on those scores and computes posterior duplicate probabilities
Labels let you correct the model with ground-truth feedback
Retrain incorporates your feedback and reports precision/recall
Vectors cluster related documents into groups
Dedup scan plans the removal without changing anything
Dedup apply executes the plan with legal-hold and signed-document protections
Dedup list/show/rollback gives you full auditability and reversibility

For the next step in your Unsterwerx workflow, see How To Extract and Query Document Metadata with Unsterwerx.