摘要
本文在对维文垃圾网页特征分析基础上,利用卡方检验方法提取网页特征,并采用最小二乘估计方法,设计了维吾尔文垃圾网页识别模型.为检测不同特征对模型性能的影响,本文分别对网页维文字符个数及关键词进行对比实验.结果表明,在特征包含网页维文字符长度,特征词个数在5~20个之间时,模型识别精确度达90%左右,网页维文字符个数在维吾尔文垃圾网页模型构建中具有重要作用.
This paper extracts web page's character by chi-squaxe(X2) test based on the analysis of Uighur web page character, and designs Uighur spare web page recognition model in least-square estimation method. To detect different character's influence to model's performance, this page carried out contrast and experiment using different character such as feature word and web page's Uighur char length. The results show that when the character contain Uighur char' length and the number of feature words between 5 and 20, the precision of model can reach 90%, and the length of web page's Uighur char has an important influence to the model.
出处
《新疆大学学报(自然科学版)》
CAS
2012年第2期218-222,共5页
Journal of Xinjiang University(Natural Science Edition)
基金
自治区高技术研究发展项目(201012112)
自治区电子发展专项资金项目(XJDZZXZJ20109)
关键词
维吾尔文文本分类
多元回归分析
特征提取
Uighur web page identification
multiple regression analysis
feature extraction
作者简介
李永可(1985-),男,硕士生,从事搜索引擎领域的研究.
通讯作者:吴向前,E-mail:wxq@xju.edu.cn