How To Detect and Remove Duplicate Documents with Unsterwerx
Duplicate documents are a tax on every document corpus. They inflate storage, skew search results, and make governance harder. Unsterwerx provides a full deduplication pipeline that uses Bayesian probability scoring to find duplicates, groups them into knowledge vectors, and removes the redundant copies while preserving a single authoritative anchor for each group. Every removal is auditable and reversible.
In this article, you will train a knowledge model on a 1,042-document corpus, teach it with feedback labels, cluster documents into knowledge vectors, and apply deduplication to remove 179 duplicate documents - a 17% corpus reduction - with full rollback capability.
Prerequisites
Before you begin, you need:
- Unsterwerx v0.5.4 or later installed (installation guide)
- A corpus already ingested and indexed. If you have not done this yet, follow the Quick Start guide.
- Similarity scores already computed. Run
unsterwerx similarity buildif you have not yet. The knowledge model builds on top of similarity data. - Familiarity with classification and retention policies is helpful but not required. See How To Classify Documents and Set Retention Policies with Unsterwerx for that workflow.
Step 1 - Build the Knowledge Model
The knowledge model uses a Naive Bayes classifier to score document pairs. It takes the similarity data your corpus already has (Jaccard and cosine scores from locality-sensitive hashing) and computes a posterior probability that each pair represents a true duplicate.
The model starts with bootstrap labels derived from your similarity data - it does not need manual input to produce initial results.
Run the build:
unsterwerx knowledge build
Preflight checks...
All prerequisites met.
Building semantic features...
Corpus: 1042 docs, 1762686 unique terms (IDF snapshot #1)
Training Bayesian model...
Bootstrap labels: 220 positive, 440 negative
Model trained: run #1, P(dup)=0.306, P(unrel)=0.694
Scoring candidates...
Timing: Semantic: 1.1s | Scoring: 0.0s | Total: 1.5s
Candidates scored: 451
Top 20 pairs by posterior:
Doc A Doc B Posterior Jaccard Cosine
--------------------------------------------------------------------------------------------------------------
087b9cf2-9a36-4efc-8d02-5f9eb1acb504 ba160f91-5bc1-4db1-82fb-db1c31bbaf5a 1.000 1.000 1.000
4cd34b51-51b3-41c6-90ae-ea79ec6016c2 c7708a05-817a-4d0e-8cb2-a0d8e50ed522 1.000 0.906 0.930
418249e8-6733-4982-b2d4-c86968e32e54 ae18c21b-9970-4770-b175-27efd73bb56f 1.000 0.984 0.998
2c96bdc7-3055-4347-9ea0-d9c22ed2a1a0 30d29388-f857-4ee4-8f15-0c1d89d6e148 1.000 1.000 1.000
15b60961-7acf-433d-a467-44b76dd36676 ca403989-c9fc-4882-b47e-df2f5f01dec9 1.000 0.992 0.993
The model trained on 220 positive and 440 negative bootstrap labels. P(dup)=0.306 is the prior probability: about 30.6% of candidate pairs are likely duplicates. The remaining 69.4% are likely unrelated.
The top pairs all show a posterior of 1.000 with high Jaccard and cosine scores. These are near-certain duplicates. The model scored 451 candidate pairs total from 1,042 documents.
Step 2 - Evaluate the Model
Before you trust the model's judgments, check its internal consistency. The --evaluate flag runs the model and reports accuracy metrics:
unsterwerx knowledge build --evaluate
Model is current (no retrain needed).
Scoring candidates...
Timing: Semantic: 0.2s | Scoring: 0.0s | Total: 0.3s
Candidates scored: 451
Evaluation:
Post-train consistency: 100.0%
No user feedback labels yet. Add labels with 'knowledge labels add' for real precision/recall.
Post-train consistency is 100%, meaning the model's predictions perfectly match its own training data. That is a good baseline. But the model tells you something important: it has no user feedback yet. Without your labels, it cannot report precision or recall against ground truth.
Time to teach it.
Step 3 - Teach the Model with Feedback Labels
You improve the model by labeling document pairs. Pick a pair from the top results that you know are duplicates, and one pair you know are unrelated.
Label a known duplicate pair:
unsterwerx knowledge labels add 087b9cf2-9a3 ba160f91-5bc --label duplicate_or_same_concept
Label added: 087b9cf2-9a3 / ba160f91-5bc → duplicate_or_same_concept
Label a known unrelated pair:
unsterwerx knowledge labels add 694cac44-8ef dd3b3a3f-508 --label unrelated
Label added: 694cac44-8ef / dd3b3a3f-508 → unrelated
Even two labels give the model ground truth to measure itself against. In practice, labeling 10-20 pairs across different score ranges will sharpen precision significantly.
Step 4 - Retrain with Your Feedback
Now rebuild the model with the --retrain and --evaluate flags together. The model incorporates your labels into its training and reports precision/recall against them:
unsterwerx knowledge build --retrain --evaluate
Building semantic features...
Corpus: 1042 docs, 1762686 unique terms (IDF snapshot #1)
Training Bayesian model...
Bootstrap labels: 220 positive, 440 negative
Model trained: run #2, P(dup)=0.306, P(unrel)=0.694
Scoring candidates...
Timing: Semantic: 0.2s | Scoring: 0.0s | Total: 0.5s
Candidates scored: 451
Evaluation:
Post-train consistency: 100.0%
User feedback labels: 2
Feedback precision: 50.0%
Feedback recall: 100.0%
Feedback F1: 66.7%
The model now reports against your 2 feedback labels. Recall is 100% - it found the duplicate you confirmed. Precision is 50% - it also flagged the unrelated pair as a duplicate, which your label corrected. The F1 score of 66.7% reflects both.
With more labels, these metrics improve. The Bayesian approach means each label shifts the model's posterior probabilities, and you can iteratively retrain until precision meets your requirements.
Step 5 - Cluster Documents into Knowledge Vectors
Knowledge vectors group related documents into clusters. Each vector represents a concept or topic in your corpus - what the TCA patent calls organizing documents within the Universal Data Set (the normalized canonical representation of all ingested data).
Build the vectors from your trained model:
unsterwerx knowledge vectors build
Knowledge Vector Build Results:
Vectors created: 121
Vectors updated: 0
Vectors deleted: 0
Edges created: 0
Documents clustered: 352
Singletons dropped: 0
Time: 0.07s
Run ID: 6856990f
Upstream similarity: 6a3f9de3
The model grouped 352 documents into 121 knowledge vectors. The remaining 690 documents had no similarity candidates and remain unclustered. This is expected - not every document has a close neighbor.
Step 6 - Explore Your Knowledge Vectors
Before deduplicating, explore the vectors to understand what the model found.
List vectors
unsterwerx knowledge vectors list
ID Label Docs Confidence Method
-------------------------------------------------------------------------------------------------------------------
710175d2-39d5-4f03-a59a-e9dbd15807aa NBIS CCEP Draft Artifact Templates -DRAFT v2 3 1.000 bayesian_lsh_v1
cf99980f-f5bb-4f6e-84fb-335ebe087ce6 CDEROne Development Overview 12-3 3 1.000 bayesian_lsh_v1
2662141d-47a0-4c40-bc46-36de359d71b1 NBIS Product Story - DataLake v 2017 9-20 4 1.000 bayesian_lsh_v1
f8cb9afc-279f-4c78-a982-885ece0c4921 Adv NBIS Prototype Overview v09.25.2017 4 1.000 bayesian_lsh_v1
a56ec25e-3127-4758-8bf9-d40f32ff7c41 NBIS Kickoff Briefing 2017 11 02 v0.13 3 1.000 bayesian_lsh_v1
236c2e99-0201-4efe-9c9b-4cc7a7243fd9 SOFWERX Data Science.7 3 1.000 bayesian_lsh_v1
53daa373-c044-4749-99ac-2b89df0b5296 outline 3 1.000 bayesian_lsh_v1
...
Each vector has a label (derived from the primary document's filename), a document count, a confidence score, and the clustering method. Vectors with 3-4 documents and 1.000 confidence are strong duplicate clusters.
Inspect a specific vector
unsterwerx knowledge vectors show 710175d2-39d5-4f03-a59a-e9dbd15807aa
Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa
Label: NBIS CCEP Draft Artifact Templates -DRAFT v2
Confidence: 1.000
Members: 3
Method: bayesian_lsh_v1
Representative: 76e98e7b-bad9-498e-9bc8-f53b5ec7da34
Members:
Document ID File Name Score Primary
-----------------------------------------------------------------------------------------------
76e98e7b-bad9-498e-9bc8-f53b5ec7da34 NBIS_CCEP Draft Artifact Templates -DR 1.000 *
82f5dc90-0c97-4b3b-938d-6981a6cb79dd NBIS_CCEP Draft Artifact Templates -DR 1.000
b500fbdd-a59d-4477-b172-b15e99ca64a0 NBIS_CCEP Draft Artifact Templates -DR 1.000
Three copies of the same artifact template, all scoring 1.000. The primary member (marked with *) is the representative document. During dedup, one copy will be kept as the anchor and the other two removed.
Search across vectors
You can search vectors by keyword. This searches document content within all vectors:
unsterwerx knowledge vectors search "business"
Vector ID Vector Label Document Snippet
--------------------------------------------------------------------------------------------------------------------------------------------
1d431706-bc7a-426d-ab7e-26a0c9db5301 4.NBIS Conceptual System Arc 83667607-1187-43d6-8c60-7d028c28b4ae ...Business Application Tier
cf99980f-f5bb-4f6e-84fb-335ebe087ce6 CDEROne Development Overview 832db84c-1b2a-4c89-a321-83e3771cd5cd ...Allows visibility into to the data owners, sources, proce
a047ad7b-4bf2-4a3f-998e-c77fe4f3fdc5 PitchDeck Template - old (1) 43f2cedc-09f6-4624-8022-5bed7ccb578c ...Business Model
...
10 vectors matched.
This is useful for verifying that documents you care about are clustered correctly before running dedup.
Step 7 - Scan for Dedup Candidates
The dedup scan is non-destructive. It analyzes your vectors and produces a plan showing exactly what would be kept and what would be removed, without changing anything.
unsterwerx knowledge dedup scan
Scanning for deduplication candidates (threshold=0.800)...
Dedup scan run: f1e54255
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190
The scan found 115 vectors with dedup candidates at a posterior threshold of 0.800. It plans to keep 149 documents and remove 190.
Each vector entry shows the decision logic:
Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa (NBIS CCEP Draft Artifact Templates -DRAF)
Confidence: 1.000 | Kept: 1 | Removed: 2 | Anchor: b500fbdd-a59
Document File Name Weight Posterior Signed Decision
-------------------------------------------------------------------------------------
76e98e7b-bad NBIS_CCEP Draft Artifact Tem 2 1.000 REMOVE (posterior 1.000 >= threshold 0.800)
82f5dc90-0c9 NBIS_CCEP Draft Artifact Tem 2 1.000 REMOVE (posterior 1.000 >= threshold 0.800)
b500fbdd-a59 NBIS_CCEP Draft Artifact Tem 2 — KEEP (primary anchor (highest weight))
The anchor is the document with the highest weight in the vector. It is always kept. Documents with a posterior probability at or above the threshold (0.800) are marked for removal. Documents below the threshold stay.
Here is the key concept: within each vector, one document is the anchor - the authoritative representative. Everything else that scores above the threshold is a redundant copy.
Notice how the scan also respects document protections:
Vector: fd321fb7-0fa8-4a34-9f02-450a95c9b0a4 (2020AUG WHETSEL AT RFO)
Confidence: 0.876 | Kept: 4 | Removed: 5 | Anchor: d1349c98-826
Document File Name Weight Posterior Signed Decision
-------------------------------------------------------------------------------------
d1349c98-826 RFO Request 06-19 September. 2 — KEEP (primary anchor (highest weight))
dfaf0650-537 RFO Request 06-19 September_ 2 — yes KEEP (signed document)
0ff3ba58-8e4 2020AUG_WHETSEL_AT_RFO.pdf 2 1.000 REMOVE (posterior 1.000 >= threshold 0.800)
4a427ccc-1ca 05192020_Request-For-order_v 2 0.525 KEEP (posterior 0.525 < threshold 0.800 with anchor)
The signed document (dfaf0650-537) is automatically kept regardless of its posterior score. Unsterwerx never removes digitally signed documents during dedup. Documents below the threshold (like the one at 0.525) are also preserved.
Step 8 - Preview the Dedup Plan
Before committing to anything, run a dry run. This produces the exact same scan output but explicitly confirms nothing will change:
unsterwerx knowledge dedup apply --dry-run
Scanning for deduplication candidates (threshold=0.800)...
Dedup scan run: 12dc61bf
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190
Vector: 710175d2-39d5-4f03-a59a-e9dbd15807aa (NBIS CCEP Draft Artifact Templates -DRAF)
Confidence: 1.000 | Kept: 1 | Removed: 2 | Anchor: b500fbdd-a59
...
Dry run — no changes applied.
Review the full output carefully. If any vector's decisions look wrong, go back to Step 3 and add more labels for the pairs in question, then retrain.
Step 9 - Apply Deduplication
When you are satisfied with the plan, apply it:
unsterwerx knowledge dedup apply --confirm
Scanning for deduplication candidates (threshold=0.800)...
Dedup scan run: a15c52fb
Vector graph run: 6856990f
Model: #2, Threshold: 0.800
Vectors affected: 115, Total kept: 149, Total removed: 190
...
Applying deduplication...
WARN Doc under legal hold — skipping doc=86f07ef9-a54f-4c18-9350-2f08b94e10c5
WARN Doc under legal hold — skipping doc=d3cb9102-fd5d-4ca0-a00a-2425d9614e32
WARN Doc under legal hold — skipping doc=e170511f-a011-48cf-bc30-77abac817f47
WARN Doc under legal hold — skipping doc=147930e2-5b63-4dc1-88bf-2e744b28c75a
WARN Doc under legal hold — skipping doc=2886d59f-601f-4687-b76f-e4c426fd8a54
WARN Doc under legal hold — skipping doc=32673ebe-ba20-44c6-84ba-e906709ba891
WARN Doc under legal hold — skipping doc=5755db8d-f11d-4bb8-ae01-479160ca3752
WARN Doc under legal hold — skipping doc=75c300c7-ff59-452d-ae3c-b48d457278bf
WARN Doc under legal hold — skipping doc=aa918a34-3510-4abf-af1b-a6c6c3972be6
WARN Doc under legal hold — skipping doc=b385039a-8c18-4894-9a34-0175c5d33dd1
WARN Doc under legal hold — skipping doc=feabec39-edca-49fb-91fe-543506c83729
Deduplication complete:
Documents removed: 179
Diffs computed: 179
Provenance merged: 68
Labels inserted: 179
Errors: 11
Rule ID: c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2
Time: 0.42s
Several things happened here:
- 179 documents removed. Each removal is recorded as a dedup action with a diff against the anchor.
- 179 diffs computed. Unsterwerx stores the difference between each removed document and its anchor so nothing is truly lost.
- 68 provenance records merged. Metadata from removed documents is merged into their anchors.
- 11 documents under legal hold were skipped. This is the Business Intelligence layer (the TCA patent's rules of hierarchy) enforcing governance policy. Documents with active legal holds cannot be removed by dedup, period. These 11 skips are reported as "errors" in the summary, but they are intentional protections.
- The entire operation completed in 0.42 seconds.
Note: The 11 "errors" are legal-hold protections, not failures. Unsterwerx treats any document it cannot process as an error to ensure you review the output. Legal holds always win over dedup.
Step 10 - Verify and Inspect the Results
List dedup rules
Every dedup operation creates a named rule you can reference later:
unsterwerx knowledge dedup list
Rule ID Name Actions Active Created
-----------------------------------------------------------------------------------------------
c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2 dedup-2026-04-14T15:28:03 (t 179 yes 2026-04-14 15:28:03
1 dedup rules total.
Inspect individual actions
Show the full list of actions in a dedup rule:
unsterwerx knowledge dedup show c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2
Rule: c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2
ID Document Vector Type Restored Status Created
-----------------------------------------------------------------------------------------------
1 76e98e7b-bad 710175d2-39d dedup_remove classified 2026-04-14 15:28:03
2 82f5dc90-0c9 710175d2-39d dedup_remove classified 2026-04-14 15:28:03
3 c74b23dd-fad cf99980f-f5b dedup_remove classified 2026-04-14 15:28:03
4 e158d22c-5ef cf99980f-f5b dedup_remove classified 2026-04-14 15:28:03
5 3f03ab9f-d19 c7975b09-3a7 dedup_remove canonical 2026-04-14 15:28:03
...
179 actions total.
To rollback a single action: knowledge dedup rollback <action-id>
To rollback the entire rule: knowledge dedup rollback <rule-id>
Each action shows the removed document ID, the vector it belonged to, the action type (dedup_remove), and the restored status (classified or canonical). The "Restored" column tells you what state the document would return to if you roll it back.
Rollback if needed
If you need to undo the deduplication, you can roll back individual actions or the entire rule:
# Roll back a single document
unsterwerx knowledge dedup rollback 1
# Roll back the entire dedup operation
unsterwerx knowledge dedup rollback c26f7edd-3a51-4f07-aaf0-9bd71b7f96b2
Every dedup removal is reversible because Unsterwerx stores the diff between the removed document and its anchor. Rollback restores the document to its pre-dedup state. This is the Shared Sandbox principle from the TCA patent: all processing happens in a trusted local environment where operations are auditable and recoverable.
Conclusion
You trained a Bayesian knowledge model on 1,042 documents, taught it with feedback labels, clustered 352 documents into 121 knowledge vectors, and applied deduplication to remove 179 redundant documents. That is a 17% corpus reduction, fully auditable and fully reversible.
The pipeline you followed:
- Similarity produces raw Jaccard/cosine scores between document pairs
- Knowledge build trains a Bayesian model on those scores and computes posterior duplicate probabilities
- Labels let you correct the model with ground-truth feedback
- Retrain incorporates your feedback and reports precision/recall
- Vectors cluster related documents into groups
- Dedup scan plans the removal without changing anything
- Dedup apply executes the plan with legal-hold and signed-document protections
- Dedup list/show/rollback gives you full auditability and reversibility
For the next step in your Unsterwerx workflow, see How To Extract and Query Document Metadata with Unsterwerx.