Skip to content

想請教關於 Semantic-based graph 與 Syntactic-based graph 前處理的問題 #11

@Annrison

Description

@Annrison

您好,不好意思打擾了,最近在嘗試用您的code建構語意和句法依存的GCN,想請問生成 mr_stan.pklmr_semantic_0.05.pkl 這兩個檔案的句子,有經過什麼樣的前處理呢?例如說像textGCN裡面會把數字轉成 "NUM",連結轉成 "url "這種。

因為訓練的資料集較小,模型遲遲無法收斂,在想會不會是因為前處理不同的關係,或者可能會有其他的原因嗎?

def clean_str(string):
    """
    String cleaning
    """
    string = string.lower()
    string = re.sub(r"\n", " ", string)
    string = re.sub(r"\s{2,}", " ", string)
    string = re.sub(r""", " ", string)
    string = re.sub(r"(http://)?www\.[^ ]+", " _url_ ", string)
    string = re.sub(r"[^a-z0-9$\'_]", " ", string)
    string = re.sub(r"_{2,}", "_", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'m", " \'m", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r"\$+", " $ ", string)
    string = re.sub(r"(?<=[a-z])I", " I", string)
    string = re.sub(r"(?<= )[0-9]+(?= )", "NUM", string)
    string = re.sub(r"(?<= )[0-9]+$", "NUM", string)
    string = re.sub(r"^[0-9]+(?= )", "NUM", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions