Saturday, October 8, 2022

AI/ML Porter Stemmer and Snowball Stemmer

Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer


Stemming: It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or to roots of words known as a lemma. In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. Stemming is important in natural language processing(NLP).


Some few common rules of Snowball stemming are:


Few Rules:

ILY  -----> ILI

LY   -----> Nill

SS   -----> SS

S    -----> Nill

ED   -----> E,Nill


Nill means the suffix is replaced with nothing and is just removed.

There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. One more interesting thing is in the word ‘stemmed‘ it is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word.


Word           Stem

cared          care

university     univers

fairly         fair

easily         easili

singing        sing

sings          sing

sung           sung

singer         singer

sportingly     sport


import nltk

from nltk.stem.snowball import SnowballStemmer

  

#the stemmer requires a language parameter

snow_stemmer = SnowballStemmer(language='english')

  

#list of tokenized words

words = ['cared','university','fairly','easily','singing',

       'sings','sung','singer','sportingly']

  

#stem's of each word

stem_words = []

for w in words:

    x = snow_stemmer.stem(w)

    stem_words.append(x)

      

#print stemming results

for e1,e2 in zip(words,stem_words):

    print(e1+' ----> '+e2)


Difference Between Porter Stemmer and Snowball Stemmer:


Snowball Stemmer is more aggressive than Porter Stemmer.

Some issues in Porter Stemmer were fixed in Snowball Stemmer.

There is only a little difference in the working of these two.

Words like ‘fairly‘ and ‘sportingly‘ were stemmed to ‘fair’ and ‘sport’ in the snowball stemmer but when you use the porter stemmer they are stemmed to ‘fairli‘ and ‘sportingli‘.

The difference between the two algorithms can be clearly seen in the way the word ‘Sportingly’ in stemmed by both. Clearly Snowball Stemmer stems it to a more accurate stem.




References:

https://www.geeksforgeeks.org/snowball-stemmer-nlp


No comments:

Post a Comment