To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although havin...To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms.展开更多
An algorithm of text classification is given that imitates human's in this paper. On one hand, the algorithmenhances weight of theme when feature vector is processed, because of the assumption that the title of a ...An algorithm of text classification is given that imitates human's in this paper. On one hand, the algorithmenhances weight of theme when feature vector is processed, because of the assumption that the title of a document canproject its content. On the other hand,a weight parameter o vector is designed to simulate human's skimming andskipping behavior for calculating method of a document cluster center, and a weight of the feature that there are morepositive examples than negative ones is enhanced . The experiment shows that the algorithm greatly improves the per-formance of a text classification system.展开更多
由于算法的简单和效果的出色,Na ve Bayes被广泛地应用到了垃圾邮件过滤当中。通过理论与实验分析发现,结构差异较大的邮件集特征分布差异也较大,这种特征分布差异影响到了Na ve Bayes算法的效果。在此基础上,论文提出了一种基于结构特...由于算法的简单和效果的出色,Na ve Bayes被广泛地应用到了垃圾邮件过滤当中。通过理论与实验分析发现,结构差异较大的邮件集特征分布差异也较大,这种特征分布差异影响到了Na ve Bayes算法的效果。在此基础上,论文提出了一种基于结构特征的双层过滤模型,对不同结构的邮件使用不同的Na ve Bayes分类器分开训练和学习。实验分析表明,Na ve Bayes使用该模型之后效果有明显的提高,已经与SVM非常接近。展开更多
There are two well-known characteristics about text classification. One is that the dimension of the sample space is very high, while the number of examples available usually is very small. The other is that the examp...There are two well-known characteristics about text classification. One is that the dimension of the sample space is very high, while the number of examples available usually is very small. The other is that the example vectors are sparse. Meanwhile, we find existing support vector machines active learning approaches are subject to the influence of outliers. Based on these observations, this paper presents a new hybr/d active learning approach. In this approach, to select the unlabelled example(s) to query, the learner takes into account both sparseness and high-dimension characteristics of examples as well as its uncertainty about the examples' categorization. This way, the active learner needs less labeled examples, but still can get a good generalization performance more quickly than competing methods. Our empirical results indicate that this new approach is effective.展开更多
文摘To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms.
基金Supported by the National Natural Science Foundation of China under Grant Nos.6987301169935010+2 种基金60103014 (国家自然科学基金) the National High Technology Development 863 Program of China under Grant No.863-306-ZD02-02-4 (国家863高科技发展计划) th
文摘An algorithm of text classification is given that imitates human's in this paper. On one hand, the algorithmenhances weight of theme when feature vector is processed, because of the assumption that the title of a document canproject its content. On the other hand,a weight parameter o vector is designed to simulate human's skimming andskipping behavior for calculating method of a document cluster center, and a weight of the feature that there are morepositive examples than negative ones is enhanced . The experiment shows that the algorithm greatly improves the per-formance of a text classification system.
文摘由于算法的简单和效果的出色,Na ve Bayes被广泛地应用到了垃圾邮件过滤当中。通过理论与实验分析发现,结构差异较大的邮件集特征分布差异也较大,这种特征分布差异影响到了Na ve Bayes算法的效果。在此基础上,论文提出了一种基于结构特征的双层过滤模型,对不同结构的邮件使用不同的Na ve Bayes分类器分开训练和学习。实验分析表明,Na ve Bayes使用该模型之后效果有明显的提高,已经与SVM非常接近。
文摘There are two well-known characteristics about text classification. One is that the dimension of the sample space is very high, while the number of examples available usually is very small. The other is that the example vectors are sparse. Meanwhile, we find existing support vector machines active learning approaches are subject to the influence of outliers. Based on these observations, this paper presents a new hybr/d active learning approach. In this approach, to select the unlabelled example(s) to query, the learner takes into account both sparseness and high-dimension characteristics of examples as well as its uncertainty about the examples' categorization. This way, the active learner needs less labeled examples, but still can get a good generalization performance more quickly than competing methods. Our empirical results indicate that this new approach is effective.