摘要
通过对语料库数据进行分析得到可靠的语言知识的前提和基础是全面、深入地了解语料库数据的性质,并选用与数据性质对应的数据分析工具。本文通过梳理一些关于语料库数据的研究,尝试对语料库数据特点作出系统性总结。本文发现:语料库数据很多情况下不服从正态分布,语料库数据具有层级嵌套结构,语料库数据具有一定程度的非平衡性、非随机性、非代表性和非独立性,语料库数据潜在地包含固定效应和随机效应因素。针对这些特质,目前较为恰当的语料库数据统计分析工具有秩和检验、混合效应/层级模型等。
To obtain any reliable language knowledge from the corpus data,researchers must have a systematic and thorough understanding of the properties of corpus data and correspondingly choose fitting statistical tools.This study tries to systematically summarize the properties of corpus data through sorting relevant studies on corpus data.We find that corpus data does not necessarily conform to normal distribution;corpus data have hierarchical nested structure;they have some degree of unbalancedness,non-randomness,non-representativeness,and dependence.Besides,corpus data(latently)are influenced by both fixed-effect factors and random effect factors.Fitting statistical analyzing tools include:Mann-Whitney-Wilcoxon rank sum test,mixedeffects/multi-level models,etc.
出处
《语料库语言学》
2020年第1期44-56,114,共14页
Corpus Linguistics