Tuesday, November 15, 2022

AIML What is CountVectorizer and n-gram analysis


It is a scikit-learn package. This is mainly used for analysing commonly occurring words or phrase from a given set of documents such as web pages. 


The usage is something similar below 


import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


pd.set_option('max_columns', 10)

pd.set_option('max_rows', 10)


df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')

df.head()


# this works similar to the machine learning fit mechanism. We need to fit the vectoriser to the data that we need to analyse 


text = df['product_description']

model = CountVectorizer(ngram_range = (1, 1))

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())

df_output.T.tail(5)


df_output.shape


we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.


text = df['product_description']

model = CountVectorizer(ngram_range = (2, 2), stop_words='english')

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())

df_output.T.tail(5)


References:

https://practicaldatascience.co.uk/machine-learning/how-to-use-count-vectorization-for-n-gram-analysis#:~:text=CountVectorizer%20will%20tokenize%20the%20data,such%20as%20%E2%80%9Cwhey%20protein%E2%80%9D.


No comments:

Post a Comment