-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi Tim!
Many thanks for this awesome repo and your paper.
It is always cool when someone tries DL actually useful, accessible and more efficient!
We are building an open dataset and a set of STT / TTS models for Russian language.
You can see some of our published work here.
A quick recap of our findings in this field to provide some context why I am asking my question (bear with me for a moment):
- STT is usually done by large corporations, therefore they use 8-layer 1024-sized bi-LSTMs and / or a lot of compute and / or networks with 150-300m params. Such networks are either slow or over-parametrized;
- We have found that just applying key achievements from modern CV and NMT (deep residual CNNs, use separable convolutions + mix layers afterwards with 1x1 convolutions, use input sequence scaling with BPE, use curricilum learning, use SCSE layers) - you can speed up on real "in-the-wild" data 3-4x also reducing your network's memory footprint 4-5x times without losing performance (!!!) (i.e. a CNN with 150m params converged 10 days on 4x1080Ti vs a network with 30m params that converges 3-4 days on the same setup)!
- Without reducing the sequence length, the network convergence speed suffers, i.e. network with 30m params without down-scaling the input sequence takes 3x iterations to converge compared to a network with 150m params, but it is 3x faster, so no REAL gain here => I wonder if the same applies to your method as well;
- TLDR - MobileNet ideas can be applied in any field;
Obviously, your paper is very different in technical approach, but very similar in spirit to what we have done.
You also report these results (obviously, we are more interested in ImageNet results):
Now, a couple of questions (maybe I missed it in the paper):
- How much time does it take to train a 20% sparse network on Imagenet? Did you compare the convergence speed vs. plain vanilla baseline network? Does it take 1 / 0.2 more time?
- Did you run any real-life inference tests on Imagenet? I understand that in real life you use "masked" approach, because sparse-layers are just not there yet;
- Do you think that anything in your code may not work well with separable convolution layers?
Many thanks for your feedback!
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested

