Reducing a word to its base form using Stemming and Lemmatization is a part of the technique called Canonicalisation. Stemming tries to reduce a word to its root form. Lemmatization tries to reduce a word to its lemma. The root and the lemma are nothing but the base forms of the inflected words. just that the method is different in both.
There are some cases that can’t be handled either by stemming nor lemmatization. You need another preprocessing method in order to stem or lemmatize the words efficiently.
For example if the corpus contains two misspelled versions of the word ‘disappearing’ — ‘dissappearng’ and ’dissapearing’. After you stem these words, you’ll have two different stems — ‘dissappear’ and ‘dissapear’. You still have the problem of redundant tokens. On the other hand, lemmatization won’t even work on these two words and will return the same words if it is applied because it only works on correct dictionary spelling.
To deal with different spellings that occur due to different pronunciations, we use the concept of phonetic hashing which will help you canonicalise different versions of the same word to a base word.
Phonetic hashing is done using the Soundex algorithm. It doesn’t matter which language the input word comes from — as long as the words sound similar, they will get the same hash code.
References:
https://amitg0161.medium.com/phonetic-hashing-and-soundex-in-python-60d4ca7a2843
No comments:
Post a Comment