Topic Model with an Infinite Vocabulary
Abstract
Introduction: Due to the continuous growth of the internet, increasing amount of news, email messages, posts in blogs, etc., Natural Language Processing systems are in high demand. A popular and promising direction in machine learning and natural language processing is developing topic model algorithms. Most topic model methods deal with static information and a limited vocabulary. In practice, however, we need tools to work with a refillable vocabulary. New words come out every year, some words become obsolete, so refillable vocabularies are especially important for Online Topic Models. Purpose: We develop an approach to determine the topical vector for a new word using the Hadamard product of the topical vectors of the documents where this word was found. This approach will be an alternative to the use of Dirichlet distribution or Dirichlet process. Results: The research has shown that a sum of topical vectors in the documents with a new word gives a wrong idea about the topic of this new word. At the same time, it is better to use Hadamard product to specify the topic of a new word by the topics of the documents. Multiplying entrywise the topical vectors of the documents with a new word cancels the topics which do not overlap, separating out common topics with similar meanings. Multiplying the topical vectors of the documents provides a topical vector for the new word with the highest probability values for several most important topics. The values of weakly expressed topics either approach zero or are reset to zero. Practical relevance: The use of the proposed algorithm can infinitely expand the online vocabulary of a topic model and, hence, consider both new and old words.Published
2016-12-19
How to Cite
Karpovich, S. (2016). Topic Model with an Infinite Vocabulary. Information and Control Systems, (6), 43-49. https://doi.org/10.15217/issn1684-8853.2016.6.43
Issue
Section
System and process modeling