Unsterwerx

similarity

Runs canonical extraction (if needed) and then performs MinHash + LSH similarity analysis across all documents. Identifies exact duplicates and near-duplicate pairs above the specified Jaccard threshold.

Usage

bash
unsterwerx similarity [OPTIONS]

Options

OptionShortTypeDefaultDescription
--threshold-tfloatconfig similarity.threshold (built-in: 0.3)Jaccard similarity threshold (0.0–1.0)
--num-hashesintegerconfig similarity.num_hashes (built-in: 128)Number of MinHash hash functions
--bandsintegerconfig similarity.lsh_bands (built-in: 32)Number of LSH bands
--rowsintegerconfig similarity.lsh_rows (built-in: 4)Number of LSH rows per band
--topinteger20Show top N candidate pairs in output

Precedence: CLI flag > config file (config set similarity.*) > built-in default. If you omit a flag, the value from your config file is used. If the config file has no entry, the built-in default applies.

Examples

Default similarity analysis

bash
unsterwerx similarity
Running canonical extraction (if needed)...
Running similarity analysis...

Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          371
  Exact duplicates:          97
  Threshold:               0.30
══════════════════════════════════

  Top Pairs:
    1.000  contacts.csv <-> contacts.xlsx
    1.000  CDEROne-Challenge_v0.1.pptx <-> CDEROne-Challenge_v0.2.pptx
    1.000  RST_Form_v5.pdf <-> RST_Oct2021.pdf
    ...

Higher threshold (stricter matching)

bash
unsterwerx similarity --threshold 0.7
Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          320
  Exact duplicates:          97
  Threshold:               0.70
══════════════════════════════════

Lower threshold with more results

bash
unsterwerx similarity --threshold 0.1 --top 10
Similarity Analysis
══════════════════════════════════
  Documents processed:     1806
  Candidate pairs:          371
  Exact duplicates:          97
  Threshold:               0.10
══════════════════════════════════

Notes

Tip: A Jaccard score of 1.0 means the documents have identical text content after normalization. A score of 0.7+ typically indicates the same document with minor edits.