Skip to content

Frequency Might be Incorrect #52

@mkmoisen

Description

@mkmoisen

It seems that there are several characters over at HanziCraft that do not have any High Frequency words, but only Medium Frequency words.

For example, the page for states that 幌子 and 札幌 are both Medium Frequency words, not High Frequency words.

The source code to determine frequencies defines a High Frequency word for a character as a word whose frequency is greater than one standard deviation from the mean for all other words that share this character.

I've calculated the mean and standard deviation of words in Weibo containing 幌 to be 7.8 and 24.8, respectively. 幌子 and 札幌 however have a frequency of 114 and 44, respectively, and should thus be considered a High Frequency word for the 幌 character.

Would you please take a look into my calculations below to see if this makes sense or if it is an incorrect conclusion?

I performed the following in Python3.6, using the LWC-words/words_types.txt file downloaded from the Weibo corups open access page here:

import statistics

with open('words_types.txt', 'rb') as f:
    content = f.read().decode()

words = {
    l.split(',')[1][1:-1]: int(l.split(',')[3][1:-1])
    for l in content.split('\n')[:-1]
}

huang3_words = [
    (word, words[word])
    for word in words if '幌' in word
]
pri

frequencies = [w[1] for w in huang3_words]
mean = statistics.mean(frequencies)
std = statistics.stdev(frequencies)

print([w for w in huang3_words if w[1] > mean + std])

Which outputs the following:

All words with 幌
('幌子', 114)
('瞎幌', 1)
('札幌', 44)
('明月幌', 1)
('幌眼', 2)
('係幌子', 1)
('幌卷', 1)
('幌如', 1)
('倚虚幌', 1)
('招幌', 2)
('件幌事', 1)
('札幌乘', 1)
('酒幌', 1)
('愁幌', 1)
('了幌', 1)
('氏幌', 1)
('房间幌', 1)
('幌加内町', 1)
('苏联主义幌', 1)
('幌起', 1)
('一幌', 1)
('扎幌', 1)
('打幌', 1)

High Frequence words with 幌
('幌子', 114)
('札幌', 44)

Thanks and best regards,

Matthew Moisen

PS HanziCraft is awesome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions