Tuesday, November 15, 2022

AI/ML What is Document Term Matrix

The text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.


import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

docs = [sentence1, sentence2, sentence3]

print(docs)

docs = [sentence1, sentence2, sentence3]

print(docs)


vec = CountVectorizer()

X = vec.fit_transform(docs)


#now this can be converted to and printed using data frame 

df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

df.head()


An example view from another workspace is 


References:

https://analyticsindiamag.com/a-guide-to-term-document-matrix-with-its-implementation-in-r-and-python/

No comments:

Post a Comment