Skip to content

serialisation and topWords info#127

Open
cbadenes wants to merge 31 commits intomimno:masterfrom
cbadenes:master
Open

serialisation and topWords info#127
cbadenes wants to merge 31 commits intomimno:masterfrom
cbadenes:master

Conversation

@cbadenes
Copy link

Minor changes in serialisation process and added a method to get top words along with their weights per topic

@mimno
Copy link
Owner

mimno commented Apr 25, 2018

Thank you for all of this! Some comments:

Could you say more about the Lexer -> Pattern shift in CharSequence2TokenSequence?

It looks like the validateTopics function is adding stopwords during training? Is there a reference for this? I'm reluctant to make something available without fully understanding when users should and shouldn't use it.

I'm planning to release the HPPC version as 2.1, I'd like to see this as part of it.

@cbadenes
Copy link
Author

Hi David,

To make the CharSequence2TokenSequence class thread-safety when perform a pipe build action, a new instance of CharSequenceLexer is required for each instance carried in a pipe. Thus, the regex pattern should be the only class attribute of the CharSequence2TokenSequence object.

About the validateTopics function, the idea is to create a list of stopwords, in an iterative way, based on those words appearing as top-words in multiple topics. This is similar to apply TF/IDF on Topics instead of Documents.

I hope it was helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants