similarity

Runs canonical extraction (if needed) and then performs MinHash + LSH similarity analysis across all documents. Identifies exact duplicates and near-duplicate pairs above the specified Jaccard threshold.

Usage

bash

unsterwerx similarity [OPTIONS]

Options

Option	Short	Type	Default	Description
`--threshold`	`-t`	float	config `similarity.threshold` (built-in: 0.3)	Jaccard similarity threshold (0.0–1.0)
`--num-hashes`		integer	config `similarity.num_hashes` (built-in: 128)	Number of MinHash hash functions
`--bands`		integer	config `similarity.lsh_bands` (built-in: 32)	Number of LSH bands
`--rows`		integer	config `similarity.lsh_rows` (built-in: 4)	Number of LSH rows per band
`--top`		integer	20	Show top N candidate pairs in output

Precedence: CLI flag > config file (config set similarity.*) > built-in default. If you omit a flag, the value from your config file is used. If the config file has no entry, the built-in default applies.

Examples

Default similarity analysis

bash

unsterwerx similarity

Running canonical extraction (if needed)...
Running similarity analysis...

Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          371
  Exact duplicates:          97
  Threshold:               0.30
══════════════════════════════════

  Top Pairs:
    1.000  contacts.csv <-> contacts.xlsx
    1.000  CDEROne-Challenge_v0.1.pptx <-> CDEROne-Challenge_v0.2.pptx
    1.000  RST_Form_v5.pdf <-> RST_Oct2021.pdf
    ...

Higher threshold (stricter matching)

bash

unsterwerx similarity --threshold 0.7

Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          320
  Exact duplicates:          97
  Threshold:               0.70
══════════════════════════════════

Lower threshold with more results

bash

unsterwerx similarity --threshold 0.1 --top 10

Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          371
  Exact duplicates:          97
  Threshold:               0.10
══════════════════════════════════

Notes

This command triggers canonical extraction for any documents that have not yet been processed.
The built-in threshold of 0.3 catches most near-duplicates while avoiding excessive false positives. Override it persistently with unsterwerx config set similarity.threshold 0.5, or per-run with --threshold 0.5.
Exact duplicates (Jaccard score = 1.0) indicate documents with identical canonical content, even if their filenames or original formats differ (e.g., .csv vs .xlsx).
The MinHash parameters (--num-hashes, --bands, --rows) control the accuracy/performance tradeoff. The built-in defaults work well for collections up to ~10,000 documents. All four parameters can be persisted via config set (see Configuration).

Tip: A Jaccard score of 1.0 means the documents have identical text content after normalization. A score of 0.7+ typically indicates the same document with minor edits.