It is a scikit-learn package. This is mainly used for analysing commonly occurring words or phrase from a given set of documents such as web pages.
The usage is something similar below
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 10)
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')
df.head()
# this works similar to the machine learning fit mechanism. We need to fit the vectoriser to the data that we need to analyse
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1))
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
df_output.shape
we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.
text = df['product_description']
model = CountVectorizer(ngram_range = (2, 2), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
References:
https://practicaldatascience.co.uk/machine-learning/how-to-use-count-vectorization-for-n-gram-analysis#:~:text=CountVectorizer%20will%20tokenize%20the%20data,such%20as%20%E2%80%9Cwhey%20protein%E2%80%9D.
No comments:
Post a Comment