Usually done through LDA(Latent Dirichlet Allocation). It identifies topics that describes a document or set of documents.
The word latent is because the topics will only evolve during the modelling process. Topic modelling is an unsupervised task.
This is mainly done by identifying the patterns of word clusters and frequencies of words in the document.
LDA short summary (Latent Dirichlet Allocation)
Dirichlet is form of distribution, which is different from Normal distribution. The ML algorithms can be applied where the data is normally distributed and it works with real numbers. In Dirichlet, the plotted data sum up to 1. Dirichlet is a probability distribution that is sampling over probability simplex instead of sampling from the space of real numbers as in Normal distribution.
LDA brings the words in the topic with their distribution using Dichrlet distribution. The words assigned to the topic with their distribution using Dichrlet distribution.
References:
https://www.analyticsvidhya.com/blog/2021/05/topic-modelling-in-natural-language-processing/
No comments:
Post a Comment