During interaction with systems people often misspell or mistype their queries. As a result, the users do not get the expected result back. That is why we need to find the most likely result when an exact match to a query is not found. Fuzzy search algorithms have been here for a while. They try to produce a likely relevant result even when the search argument does not exactly correspond to the desired information.
PostgreSQL via the fuzzystrmatch module provides several functions that can help you determine similarities between strings.
Soundex is a phonetic algorithm for indexing/comparing English words based on phonetic similarity. The main function of this algorithm is to group up the same sounding words together by reducing each word to a four-character alphanumeric code. The first letter of the code corresponds to the first letter of the word. The rest of the code consists of three digits (from 1 to 6) derived from the syllables of the word according to the following groups:
- B, F, P, V
- C, G, J, K, Q, S, X, Z
- D, T
- M, N
The letters A, E, I, O, U, H, W, Y unless if they are in the beginning of a word, are being ignored.
So, for example the word Elephant is converted to the code E415, Computer to C513 etc.
As you may have noticed from the examples above, this algorithm is case-insensitive. Ex: soundex(‘computer’) and soundex(‘Computer’) will both return the same code.
Levenshtein distance is used to measure the similarity between two strings. The distance is the number of deletions, insertions, or substitutions required to transform one string into another string. The bigger the Levenshtein distance, the more different the strings are.
For example, the Levenshtein distance between “scotch” and “shot” is 3, since the following changes occur:
SELECT levenshtein(‘scotch’, ‘shot’);
- shot-> scot (substitution of h for c)
- scot -> scotc (insertion of c at the end)
- scotc -> scotch (insertion of h at the end)
In addition to this, please note that the Levenshtein function in Postgres is case-sensitive. For example:
SELECT levenshtein(‘Phone’, ‘phone’);
Make sure that when you are calculating the Levenshtein distance between two strings in Postgres your strings are either uppercase or lowercase.
The execution time of this algorithm grows proportionally to the product of the two string lengths that are being compared, so you will find that using it for longer words is impractical.
Metaphone and Double Metaphone
Metaphone is a phonetic algorithm based on Soundex. Just like Soundex, these algorithms generate keys based on the phonetic similarity of words.
Metaphone improves on the Soundex algorithm by using additional information about exceptions and anomalies in English spelling and pronunciation in order to produce a more accurate encoding, which in the end results in better matching of words that sound similar.
The Double Metaphone is the second generation of this algorithm. It makes a number of fundamental improvements over the original Metaphone algorithm. It is called Double because unlike the first Metaphone, this algorithm can return two codes (let’s call them primary and secondary). In most of the cases the primary and secondary codes are the same, but for some exceptional cases they can be different.
Based on the primary and secondary codes of two words there are three levels of matching:
- Primary Code = Primary Code Strong Match
- Primary Code = Secondary Code Normal Match
- Secondary Code = Secondary Code Minimal Match
The following examples demonstrate the usage of metaphone and double metaphone in Postgres. Note that metaphone as additional parameter takes the maximum length of the code that will be generated. If the code is longer than the specified length, it will be truncated.
SELECT metaphone(‘computer’, 10); SELECT dmetaphone(‘computer’);
=> KMPTR => KMPT
SELECT metaphone(‘elephant’, 10); SELECT dmetaphone(‘elephant’);
=> ELFNT => ALFN
Keep in mind that Soundex is not very useful for non-English names. PostgreSQL is really powerful for English string similarity problems. This blog covered functions as Levenshtein distance, metaphone and dmetaphone included in fuzzstrmatch, that are rarely found in other relational databases.
Aleksandar Gjorgievski and Aleksandar Mojachevski