-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
您好,不好意思打擾了,最近在嘗試用您的code建構語意和句法依存的GCN,想請問生成 mr_stan.pkl 和 mr_semantic_0.05.pkl 這兩個檔案的句子,有經過什麼樣的前處理呢?例如說像textGCN裡面會把數字轉成 "NUM",連結轉成 "url "這種。
因為訓練的資料集較小,模型遲遲無法收斂,在想會不會是因為前處理不同的關係,或者可能會有其他的原因嗎?
def clean_str(string):
"""
String cleaning
"""
string = string.lower()
string = re.sub(r"\n", " ", string)
string = re.sub(r"\s{2,}", " ", string)
string = re.sub(r""", " ", string)
string = re.sub(r"(http://)?www\.[^ ]+", " _url_ ", string)
string = re.sub(r"[^a-z0-9$\'_]", " ", string)
string = re.sub(r"_{2,}", "_", string)
string = re.sub(r"\'s", " \'s", string)
string = re.sub(r"\'ve", " \'ve", string)
string = re.sub(r"n\'t", " n\'t", string)
string = re.sub(r"\'re", " \'re", string)
string = re.sub(r"\'d", " \'d", string)
string = re.sub(r"\'m", " \'m", string)
string = re.sub(r"\'ll", " \'ll", string)
string = re.sub(r"\$+", " $ ", string)
string = re.sub(r"(?<=[a-z])I", " I", string)
string = re.sub(r"(?<= )[0-9]+(?= )", "NUM", string)
string = re.sub(r"(?<= )[0-9]+$", "NUM", string)
string = re.sub(r"^[0-9]+(?= )", "NUM", string)
string = re.sub(r"\s{2,}", " ", string)
return string.strip()
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels