The text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = [sentence1, sentence2, sentence3]
print(docs)
docs = [sentence1, sentence2, sentence3]
print(docs)
vec = CountVectorizer()
X = vec.fit_transform(docs)
#now this can be converted to and printed using data frame
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.head()
An example view from another workspace is
References:
https://analyticsindiamag.com/a-guide-to-term-document-matrix-with-its-implementation-in-r-and-python/
No comments:
Post a Comment