摘要
文本数据具有规模大、特征维数高等特点,当前文本分类方法无法刻画文本变化特点,使得文本分类正确率低、误差大、分类时间长,为了获得理想的文本分类效果,设计基于大数据挖掘技术的文本分类方法。首先对当前文本分类的研究进展进行分析,找出导致当前文本分类效果差的原因;然后,提取文本分类原始特征,并引入核主成分分析算法对原始特征进行处理,降低特征维数,简化文本分类器的结构;最后,采用大数据挖掘技术构建文本分类器,并与其他文本分类方法进行对比测试。测试结果表明,所提方法可以更好地描述文本变化特点,能够对各种类型文本进行准确识别和分类,文本分类精度超过95%,明显高于当前其他文本分类方法,并且所提方法的文本分类时间显著减少,具有更好的文本分类效果。
Text data are of characteristics of large scale and high feature dimension. The current text classification methods fail to depict the characteristics of text change,which results in low accuracy,large error and long duration of the classification.In order to get an ideal text classification effect,a text classification method based on big data mining technology is designed.The current research progress of text classification is analyzed to find out the reasons for the poor effect of current text classification. And then,the original features of text classification are extracted,and the kernel principal component analysis(KPCA)algorithm is introduced to process the original features,reduce the feature dimension and simplify the structure of text categorizer. Finally,the text categorizer is constructed with big data mining technology and compared with other text classifiers.The results of contrastive test show that the proposed method can better describe the characteristics of text change,and accurately recognize and classify various types of texts. The accuracy of text classification of the proposed method is above 95%,which is significantly higher than other current text classification methods. Moreover,the classification duration is significantly reduced and the classification effect is better.
作者
孟鑫淼
MENG Xinmiao(H3C Research Institute of Big Data,Zhengzhou 450001,China)
出处
《现代电子技术》
北大核心
2020年第17期126-129,共4页
Modern Electronics Technique
关键词
大规模文本数据
高维特征
大数据挖掘技术
文本分类器
分类精度
分类时间
large-scale text data
high-dimensional feature
big data mining technology
text classifier
classification accuracy
classification duration
作者简介
孟鑫淼(1989-),男,河南郑州人,硕士,讲师,主要从事大数据技术方向研究。