Tuesday, March 7, 2023

What does CountVectorizer do?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). Let us consider a few sample texts from a document (each as a list element):


CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.  This can be visualized as follows –


  at each four geek geeks geeksforgeeks help helps many one other two

document[0] 0 0 0 1 1 0 0 1 0 1 0 1

document[1] 0 0 1 0 2 0 1 0 0 0 0 1

document[2] 1 1 0 1 1 1 0 1 1 0 1 0


https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/

No comments:

Post a Comment