I want to train t5 from scratch, and use my own vocabulary.
the model i can load like this:
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)
the vocabulary is like this below, it seems the tokenizer cannot load this vocab. how should i load this to a proper tokenizer?
{
"": 0,
"": 1,
"": 2,
"": 3,
"": 4,
",": 5,
"的": 6,
"?": 7,
"了": 8,
.....
.....
.....
"<s_181>": 33786,
"<s_182>": 33787,
"<s_183>": 33788,
"<s_184>": 33789,
"<s_185>": 33790,
"<s_186>": 33791,
"<s_187>": 33792,
"<s_188>": 33793,
"<s_189>": 33794
}