摘要
针对各网站Web查询接口(WQI)因结构异构而难以被自动发现的问题,提出一种基于超级文本标记语言(HTML)特征和层次聚类的Web查询接口发现方法。利用HTML控件元素之间的层级结构、依附关系和HTML交互控件的终端特性,通过前序和后序遍历相结合的方式解析页面,建立合适的页面树状模型。按照查询区域交互密度的局部集中性定位并初始化聚类集合。将聚类集合中各潜在接口区域结构距离的相似性进行层次聚类,并对所得潜在接口中的交互控件选择合适的文本节点进行语义标注,得出完整WQI区域,利用接口中的文本特征过滤非查询接口。实验结果表明,该方法克服了传统方法对<form>标签的过度依赖,具有较强的通用性,接口识别率与准确率分别达到90.7%和92%。
Aiming at the problem that Web Query Interface( WQI) from different Web sites can not be found automatically due to their highly heterogeneous structure,this paper proposes a method to find WQI based on Hyper Text Markup Language( HTML) features and hierarchical clustering. It establishes a proper page model in the form of tree with a method combined with pre-order traversal and post-order traversal,according to the facts that HTML elements are organized in a hierarchical,attached relationship and interactive elements generally exist in the terminal part of a DOM tree. Local WQIs are detected and the set for clustering,in which each local WQI is considered as one class and named as interaction group,is initially referenced to the interaction density in the model. It clusters different interaction groups hierarchically by structure distance and label the interaction nodes of substantial WQI w ith the nearest text node around in structure. Non-query WQI is filtered out by text filter. This method avoids the excessive dependency on tag "form "and presents a better performance in property of generality than traditional methods. Experimental results show that this method has advantage over researches before,the recognition accuracies of them reach up to 90. 7% and 92% respectively.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第2期56-61,共6页
Computer Engineering
关键词
Web查询接口
超级文本标记语言
层次聚类
结构距离
交互密度
文本过滤器
Web Query Interface(WQI)
Hyper Text Markup Language(HTML)
hierarchical clustering
structure distance
interaction density
text filter
作者简介
魏佳欣(1990-),女,硕士,主研方向为Web语义理解
叶飞跃,博士。