摘要
随着对LDA模型的研究越来越深入,文本表示和挖掘能力进一步提高。"话题"是LDA模型中一个非常重要的概念,是特征集合的一个多项式概率分布。话题追踪是根据少数已知相关信息在未知报道流中追踪一个话题,找出与该话题相关的所有报道。把LDA模型用于话题追踪,目的有两个:(一)检验LDA话题对追踪话题的表示能力;(二)检验LDA模型在挖掘训练数据中的追踪话题时,LDA话题和追踪话题之间的关系。实验表明:相对于经典的向量空间模型和一元语言模型,以及专门针对追踪话题提出的事件模型,基于LDA模型的追踪性能更好,但由于粒度不同,LDA模型中的话题和追踪话题并没有直接的一一对应的关系,实现可定制话题的LDA模型是下一步工作的目标。
As more and more researches are made for the LDA model,its ability of representing and mining has been increased a lot."Topic" is an important concept in the LDA model,which is represented as a polynomial distribution of the feature set.Topic tracking is monitoring a stream of news stories to find additional stories on a topic identified by several samples.There are two reasons for using the LDA model in topic tracking:one is to show how the performance of the tracking system using the LDA model is;the other is trying to find whether there is some relation between the LDA topic and the tracked topic.The experimental results indicate that the LDA model is better than the vector space model,the unigram language model and the special event model in a topic tracking system.However,since the granularities of two kinds of topics are different,the relation between the LDA topic and the tracked topic is not about bijection.An adjustable LDA model is needed in our future work.
出处
《计算机科学》
CSCD
北大核心
2011年第B10期136-139,152,共5页
Computer Science
基金
国家自然科学基金(60873097
60933005)资助
作者简介
张晓艳(1981-),女,博士,讲师,主要研究方向为自然语言处理、话题发现与追踪,E-mail:zhangxiaoyan@nudt.edu.cn;
王挺(1970-),男,博士,教授,博士生导师,主要研究方向为自然语言处理、计算机软件;
梁晓波(1969-),男,博士,教授,硕士生导师,主要研究方向为语料库语言学、认知语言学。