Revise coherence calculation

Currently calculation of coherence implies that a corpus.txt with the full training corpus needs to be created. Note I did this even for the Spark LDA, even though this is not in principle the format in which the training corpus is needed by Spark (we use bow in parquet format). For very large training corpus this is very inefficient, since that corpus.txt file which is only needed for coherence calculation can take up several Gigabytes !!!

We need to:
1/ Revise TMmodel generation process, so that if coherence calculation fails it does not crash (same could be done with pyLDAvis, by the way)
2/ Revise what is actually needed for coherence calculation. ¿Is there an alternative to using the original corpus? ¿Could we have a very large corpus per dataset that is used for everything?  ¿Could we use a smaller corpus sampling the original corpus?
3/ Revise also if corpus.txt is not needed after model training (or just needed for coherence calculation), could we delete this file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise coherence calculation #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revise coherence calculation #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions