Iterate over input data, don't load into memory#170
Open
jfelectron wants to merge 5 commits intomimno:masterfrom
Open
Iterate over input data, don't load into memory#170jfelectron wants to merge 5 commits intomimno:masterfrom
jfelectron wants to merge 5 commits intomimno:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.
Notes:
not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.
Further improvements: