Skip to content

Conversation

@delihiros
Copy link

add option unk_frequency to adjust vocabulary size.
in a configuration file,

unk_frequency=0

to ignore this option.

by setting n to it, a word with the frequency under n will be treated as an unknown word.

target_vocabulary_type=word

; Vocabulary size in each side.
unk_frequency=0
Copy link
Owner

@odashi odashi Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need detailed description for this option.
And I guess this option has multiple meanings:

  • choosing filtering strategy as either by-frequency or by-ranking,
  • specifying the threshold of unknown words in both source/target languages.

Maybe they could be separated into some unique options. For example:

unk_filter_type=frequency/rank
source_unk_frequency=3 (only used when type=frequency)
target_unk_frequency=4 (ditto)
source_vocabulary_size=4100 (only used when type=rank)
target_vocabulary_size=4900 (ditto)

target_vocabulary_type=word
source_vocabulary_size=30
unk_frequency=0
source_vocabulary_size=33
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back to 30.

// TODO: This is a workaround for old Boost libraries. The function should
// return a smart pointer, but boost::scoped_ptr is not movable, and the
// serialization library does not support std::unique_ptr.
nmtkit::Vocabulary * createVocabulary(
Copy link
Owner

@odashi odashi Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think basically one parameter should have only one meaning to prevent abusing them. CharacterVocabulary and WordVocabulary could take more 1 parameter to choose unk filtering strategy (just specified in config file) to prevent increasing tne number of meanings in unk_frequency.


WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned size) {
NMTKIT_CHECK(size >= 3, "Size should be equal or greater than 3.");
WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned unk_frequency, unsigned size) {
Copy link
Owner

@odashi odashi Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test code in src/test/word_vocabulary_test.cc for unk_frequency?


CharacterVocabulary::CharacterVocabulary(
const string & corpus_filename,
unsigned unk_frequency,
Copy link
Owner

@odashi odashi Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test code in src/test/character_vocabulary_test.cc for unk_frequency?

Copy link
Owner

@odashi odashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for designing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants