摘要
本文对两千万字的藏文语料做字频、音节频度的统计,以及字丁熵值、音节的相对熵值和绝对熵值的计算。统计结果表明(1)藏文标准音节5334个,其中单字音节475个,双字音节3061个,三字音节902个,四字音节896个;(2)藏文字丁或音节的频度分布极不均匀,覆盖统计文本90%、95%的音节分别是703个和1140个。
This paper is about the statistics of the frequenly and information entropy of Tibetan characters and syllables based on the corpus of 20 000 000 characters. The experiment shows: (1)there are 5334 standard Tibetan syllables, of which the syllables including 1 ,2,3 and 4 characters are 475, 3061, 902 and 896 respectively. (2) the frequency of Tibetan characters or syllables is asymmetry. We have discovered that virtually the most frequent 703 Tibetan syllables cover 90% of the corpus, and 1140 syllables cover 95%.
出处
《术语标准化与信息技术》
2004年第2期27-31,共5页
Terminology Standardization & Information Technology
关键词
藏文
字丁频度
音节频度
信息熵
Tibetan, frequency of character, frequency of syllables, entropy