Iterate over input data, don't load into memory by jfelectron · Pull Request #170 · mimno/Mallet

jfelectron · 2019-07-26T21:59:30Z

The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.

Notes:

not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.

Further improvements:

make use of threads to speed this up, using one thread for 100M plus instances takes quite a while

Jonathan Foley added 5 commits July 19, 2019 11:30

use buffered reading of input file during data load

efb8555

handle missing file

e241fbe

don't keep references to written output objects

289e09a

reset OutputStream write to free resources

4c53ed3

log progress

462c700

jfelectron mentioned this pull request Jul 26, 2019

OutOfMemoryError while computing LDA model for large .mallet file #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate over input data, don't load into memory#170

Iterate over input data, don't load into memory#170
jfelectron wants to merge 5 commits intomimno:masterfrom
jfelectron:master

jfelectron commented Jul 26, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfelectron commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jfelectron commented Jul 26, 2019 •

edited

Loading