Skip to content

Why discreteness of word embedding leads to the optimizer easily fall into local minima? #47

@xsc1234

Description

@xsc1234

最近拜读了您的论文《GPT Understands, Too》,关于这段话有些不理解,希望您能帮忙指导解释下:”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解,您这段话先说明预训练模型的词向量彼此之间相互离散,但是可训练参数h本身就是随机初始化的,并不来自于词向量,词向量的离散对h的优化有什么影响吗?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions