-
Notifications
You must be signed in to change notification settings - Fork 7
Unk frequency #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| target_vocabulary_type=word | ||
|
|
||
| ; Vocabulary size in each side. | ||
| unk_frequency=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need detailed description for this option.
And I guess this option has multiple meanings:
- choosing filtering strategy as either by-frequency or by-ranking,
- specifying the threshold of unknown words in both source/target languages.
Maybe they could be separated into some unique options. For example:
unk_filter_type=frequency/rank
source_unk_frequency=3 (only used when type=frequency)
target_unk_frequency=4 (ditto)
source_vocabulary_size=4100 (only used when type=rank)
target_vocabulary_size=4900 (ditto)
| target_vocabulary_type=word | ||
| source_vocabulary_size=30 | ||
| unk_frequency=0 | ||
| source_vocabulary_size=33 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to 30.
| // TODO: This is a workaround for old Boost libraries. The function should | ||
| // return a smart pointer, but boost::scoped_ptr is not movable, and the | ||
| // serialization library does not support std::unique_ptr. | ||
| nmtkit::Vocabulary * createVocabulary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think basically one parameter should have only one meaning to prevent abusing them. CharacterVocabulary and WordVocabulary could take more 1 parameter to choose unk filtering strategy (just specified in config file) to prevent increasing tne number of meanings in unk_frequency.
|
|
||
| WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned size) { | ||
| NMTKIT_CHECK(size >= 3, "Size should be equal or greater than 3."); | ||
| WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned unk_frequency, unsigned size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test code in src/test/word_vocabulary_test.cc for unk_frequency?
|
|
||
| CharacterVocabulary::CharacterVocabulary( | ||
| const string & corpus_filename, | ||
| unsigned unk_frequency, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test code in src/test/character_vocabulary_test.cc for unk_frequency?
odashi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for designing.
add option unk_frequency to adjust vocabulary size.
in a configuration file,
to ignore this option.
by setting n to it, a word with the frequency under n will be treated as an unknown word.