摘要
针对现有很多文本分类算法必须进行训练—测试—再训练的缺点以及通用模型的语法表现度较差等问题,提出一种改进的模糊语法算法(IFGA)。根据一些选取的文本片段建立学习模型;为了适应轻微变化,采用增量式模型,将选取的文本片段转换到底层架构中,形成模糊语法;利用模糊联合操作将单个文本片段语法进行整合,并将所学习的文本片段转换成更加一般的表示形式。与决策表算法、改进的朴素贝叶斯算法等进行了两组对比实验,第一个实验结果表明,IFGA和其他机器学习算法性能并无明显差异;第二个实验结果说明,增量式学习算法比标准机器学习算法更加具有优势,其性能较平稳,数据的尺寸影响更小。提出的算法具有较低的模型重新训练时间。
Concerning that many text classification algorithms need training-testing-retraining, arid the performance of the general models is poor, this paper proposed an improved fuzzy grammar algorithm(IFGA). Firstly, this method built learning model according to some selected text segments. In order to make fit for the slight changes, the learning model used the incre- mental model to transform the selected text segments into the underlying structure, which were the fuzzy grammar. Finally, combined the single text fragment grammar by the fuzzy joint operation, it transformed the learn text fragment into a more general representation. Two group experiments were used for comparing with the decision table algorithm, the improved naive Bias algorithm and some other algorithms. The first experiment results show that there is no significant difference between the IFGA and other machine learning algorithms. The second experimental results show that the incremental learning algorithm has more advantages than the standard machine learning algorithm, and the performance is more stable, and the size of the data is less affected. In addition, the proposed algorithm has a lower model retrained time.
出处
《计算机应用研究》
CSCD
北大核心
2017年第11期3355-3358,3378,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61300234)
湖南省教育科技计划资助项目(13C243
12C1056)
关键词
文本分类
机器学习
增量式
模糊语法
重新训练
text classification
machine learning
incremental
fuzzy grammar
retrained
作者简介
龚静(1972-),女,湖南岳阳人,副教授,硕士,主要研究方向为数据挖掘、自然语言处理等(gongjinghn@126.com);
黄欣阳(1971-),男,湖南祁阳人,副教授,硕士,主要研究方向为数据挖掘、信息安全等.