Skip to content

Conversation

@ankit-vaghela30
Copy link
Contributor

This improved accuracy from 27.xx to 58.xx

return ((label, feature), value)
by_label = x.map(doc_to_label) # ((label, feature), value)
by_label = by_label.reduceByKey(lambda x, y: x+y)
by_label_map = by_label.collectAsMap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gonna be 4*vocab... yikes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! But this is the second change which improved my accuracy.

If you remember the equation of likelyhood probability, in the nominator, we have to count the words in the class given the document. I was simply counting # of words in class.

I am using the by_label_map to fetch count of words given document and if the combination of (label, feature) does not exist, I am returning 0 as a count and of course for each time I get the word count, I am adding 1 to it. Our likelyhood nominator becomes (for each class, each word):

by_label_map.get((label, feature), 0) + 1

Important thing here is that, before this change I was not counting those (0+1)s !

I know the collectAsMap() looks ugly. It was a desperate attempt to execute the idea as soon as possible!

@cbarrick
Copy link
Contributor

cbarrick commented Feb 2, 2018

The performance improvement is exciting! The collectToMap not so much.

It looks like a change to the log-likelihood formulation. What's the high-level idea?

@ankit-vaghela30
Copy link
Contributor Author

The high level idea is that: in the fit method, while training, we have to create likelyhood rdds considering all the possible labels. Meaning that algorithmically, you have to iterate over our four labels and calculate likelyhood probabilities for those words.

This is why you see that ugly cartesian product there! I was not doing that before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants