Fuzziness is a matching technique that enables us to return matches when there are small variations between the spelling of your customer's name and the names referenced in risk lists and media. Fuzziness is sometimes referred to as edit distance or Levenshtein distance. Our matching algorithm uses fuzziness alongside many other techniques to match names.
Fuzziness will match entities with names that have an inserted, omitted or replaced character compared to the name you screened. Fuzziness allows up to one character difference for each word in the search term.
Fuzziness is useful in two main scenarios:
1. The customer name has been manually entered and may contain typos or spelling mistakes
2. The customer name has been transliterated into Latin script from a non-Latin script, or is written in a different script compared to a risk list or media article.
By default, our matching algorithm uses fuzziness only for the second of these scenarios: matching transliteration variants of names. This means that, by default, fuzzy name matches are only returned when the searched and matched name words are broadly phonetically equivalent (using an industry standard phonetic algorithm).
Configuration options are available to change this default behaviour such that the first scenario (typos and spelling mistakes) may also generate matches.
Understanding fuzziness percentages
Fuzziness always allows up to one character difference for each word in the search term. However, you can use the fuzziness percentage to configure how long a word must be for a difference to be considered significant.
This matters because longer names with one character different are more likely to be the same entity than shorter names. For example, the names Leederheimer and Lexderheimer are far more likely to be misspellings of each other than Lee and Lex.
Your choice of fuzziness percentage depends on your risk-based approach and how sure you are that the names you input for searching are correct. For example, if you take the information directly from the customers' identity documents, you will have greater confidence in it than if customers input it themselves, which is more prone to error.
Fuzziness Setting | Minimum word length to allow fuzziness |
0% |
None (no fuzziness allowed) |
10% | 25 |
20% | 13 |
30% | 9 |
40% | 7 |
50% | 5 |
60% | 5 |
70% | 4 |
80% | 4 |
90% | 3 |
100% | 3 |
What's the difference between 0% fuzziness and exact match?
There are several differences between 0% fuzziness and an exact match. Setting fuzziness to 0% only affects the edit distance matching behaviour described above.
Exact match also affects the following, in addition to setting fuzziness to 0%:
- Disables all pre-processing. For example, honorifics or suffixes like Mr, Ms, Dr or PhD are matched exactly.
- Disables all other forms of inexact name word matching such as equivalent names and phonetic matching except for word order variations and also-known-as (AKA) matching.
- Does not allow for extra words to be added. 'John Smith' won't match 'John Williams Smith'.
-
Disables fuzziness for year of birth. When fuzziness is between 10% and 100%, we allow a one-year difference in year of birth.
All of the above matching behaviours are enabled by default and are disabled by using the exact match setting.
What is the impact on false positives?
Name matching is inherently probabilistic and the options described above enable you to trade off greater aversion to the risk of missing inexact name matches against the operational impacts of a higher number of false positives.
To optimise this trade off, we have capped general edit distance matching at one character per name word. This allows for the overwhelming majority of spelling errors and typos while controlling the number of false positives. This does not mean that only single edit distance variations are considered matches. As noted above, we additionally use many other methods to match equivalent names, phonetically equivalent words, abbreviations, hypocorisms and more.
The ComplyAdvantage matching algorithm has been tested extensively (both internally and by independent consultants) across different names and name variations in our database.