similarity
Runs canonical extraction (if needed) and then performs MinHash + LSH similarity analysis across all documents. Identifies exact duplicates and near-duplicate pairs above the specified Jaccard threshold.
Usage
bash
unsterwerx similarity [OPTIONS]
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--threshold | -t | float | config similarity.threshold (built-in: 0.3) | Jaccard similarity threshold (0.0–1.0) |
--num-hashes | integer | config similarity.num_hashes (built-in: 128) | Number of MinHash hash functions | |
--bands | integer | config similarity.lsh_bands (built-in: 32) | Number of LSH bands | |
--rows | integer | config similarity.lsh_rows (built-in: 4) | Number of LSH rows per band | |
--top | integer | 20 | Show top N candidate pairs in output |
Precedence: CLI flag > config file (config set similarity.*) > built-in default. If you omit a flag, the value from your config file is used. If the config file has no entry, the built-in default applies.
Examples
Default similarity analysis
bash
unsterwerx similarity
Running canonical extraction (if needed)...
Running similarity analysis...
Similarity Analysis
══════════════════════════════════
Documents processed: 1806
Candidate pairs: 371
Exact duplicates: 97
Threshold: 0.30
══════════════════════════════════
Top Pairs:
1.000 contacts.csv <-> contacts.xlsx
1.000 CDEROne-Challenge_v0.1.pptx <-> CDEROne-Challenge_v0.2.pptx
1.000 RST_Form_v5.pdf <-> RST_Oct2021.pdf
...
Higher threshold (stricter matching)
bash
unsterwerx similarity --threshold 0.7
Similarity Analysis
══════════════════════════════════
Documents processed: 1806
Candidate pairs: 320
Exact duplicates: 97
Threshold: 0.70
══════════════════════════════════
Lower threshold with more results
bash
unsterwerx similarity --threshold 0.1 --top 10
Similarity Analysis
══════════════════════════════════
Documents processed: 1806
Candidate pairs: 371
Exact duplicates: 97
Threshold: 0.10
══════════════════════════════════
Notes
- This command triggers canonical extraction for any documents that have not yet been processed.
- The built-in threshold of 0.3 catches most near-duplicates while avoiding excessive false positives. Override it persistently with
unsterwerx config set similarity.threshold 0.5, or per-run with--threshold 0.5. - Exact duplicates (Jaccard score = 1.0) indicate documents with identical canonical content, even if their filenames or original formats differ (e.g.,
.csvvs.xlsx). - The MinHash parameters (
--num-hashes,--bands,--rows) control the accuracy/performance tradeoff. The built-in defaults work well for collections up to ~10,000 documents. All four parameters can be persisted viaconfig set(see Configuration).
Tip: A Jaccard score of 1.0 means the documents have identical text content after normalization. A score of 0.7+ typically indicates the same document with minor edits.