Visual Token of HowTo100M

Hi, do you transform the raw videos of HTM datasets into visual tokens during the pre-training? And how large of the total size of  its visual tokens? Since HTM takes 12T space, I'm curious about the size of its visual tokens.