Tuesday, November 15, 2022

AIML What is CountVectorizer and n-gram analysis

It is a scikit-learn package. This is mainly used for analysing commonly occurring words or phrase from a given set of documents such as web pages. 

The usage is something similar below 

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('max_columns', 10)

pd.set_option('max_rows', 10)

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')


# this works similar to the machine learning fit mechanism. We need to fit the vectoriser to the data that we need to analyse 

text = df['product_description']

model = CountVectorizer(ngram_range = (1, 1))

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())



we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.

text = df['product_description']

model = CountVectorizer(ngram_range = (2, 2), stop_words='english')

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())




No comments:

Post a Comment