Skip to content

Revise coherence calculation #15

@jeroarenas

Description

@jeroarenas

Currently calculation of coherence implies that a corpus.txt with the full training corpus needs to be created. Note I did this even for the Spark LDA, even though this is not in principle the format in which the training corpus is needed by Spark (we use bow in parquet format). For very large training corpus this is very inefficient, since that corpus.txt file which is only needed for coherence calculation can take up several Gigabytes !!!

We need to:
1/ Revise TMmodel generation process, so that if coherence calculation fails it does not crash (same could be done with pyLDAvis, by the way)
2/ Revise what is actually needed for coherence calculation. ¿Is there an alternative to using the original corpus? ¿Could we have a very large corpus per dataset that is used for everything? ¿Could we use a smaller corpus sampling the original corpus?
3/ Revise also if corpus.txt is not needed after model training (or just needed for coherence calculation), could we delete this file?

Metadata

Metadata

Labels

Design AspectsImply rethinking of the structure of the applicationHigh PriorityIssues that need to be prioritized for next releaseenhancement / efficiencyNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions