Skip to content

NeverForged/TopicSignificanceRank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopicSignificanceRank

Goal

In order to determine the number of topics to use when applying LDA to a matrix of telemetry events from a video game, I searched through the literature for an nlp-independent (and human work independent) method for evaluating the topics of an LDA model to assess the number of topics. Since the data I am using pertains to actions in a game (1700 different actions in Minecraft: Education Edition, such as breaking a wooden block, opening CodeConnection, or making something). I have no semantics to run gensim cosine similarities between words, so I can't use coherence. Log-Likliehoods and Perplexity seemed to just run off (see this example, but even had they gone with fewer topics, they may not have hit a maximum). Using a statistic definition of 'Junk' topics makes sense, since I can run statistics, so I went with the method described here.

Writng the Code

  • Opened up a git repo and started a Jupyter Notebook
  • Hit Section 3 of the paper and added distributions.
  • Section 4 led to the U, V, and B variables, as well as S1, S2, S, and Psi.
  • Psi was the biggest challenge. Paper was vague, especially with regards to the weights in the Psi equation. Ended up setting them all to 1/3, to avoid over-filtering based on number of topics (since this is intended to be a way to determine the optimal number of topics)

Testing the Code

  • First, looking if it works (word frequency is not the best thing to look at, which makes sense)
  • Next, compare log-likliehood, perplexity, Psi, S, and TSR for a model.

About

An Attempt to code the following in python: https://mimno.infosci.cornell.edu/info6150/readings/ECML09_AlSumaitetal.pdf

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published