期刊文献+

基于HTML特征与层次聚类的Web查询接口发现 被引量:4

Discovery of Web Query Interface Based on HTML Features and Hierarchical Clustering
在线阅读 下载PDF
导出
摘要 针对各网站Web查询接口(WQI)因结构异构而难以被自动发现的问题,提出一种基于超级文本标记语言(HTML)特征和层次聚类的Web查询接口发现方法。利用HTML控件元素之间的层级结构、依附关系和HTML交互控件的终端特性,通过前序和后序遍历相结合的方式解析页面,建立合适的页面树状模型。按照查询区域交互密度的局部集中性定位并初始化聚类集合。将聚类集合中各潜在接口区域结构距离的相似性进行层次聚类,并对所得潜在接口中的交互控件选择合适的文本节点进行语义标注,得出完整WQI区域,利用接口中的文本特征过滤非查询接口。实验结果表明,该方法克服了传统方法对<form>标签的过度依赖,具有较强的通用性,接口识别率与准确率分别达到90.7%和92%。 Aiming at the problem that Web Query Interface( WQI) from different Web sites can not be found automatically due to their highly heterogeneous structure,this paper proposes a method to find WQI based on Hyper Text Markup Language( HTML) features and hierarchical clustering. It establishes a proper page model in the form of tree with a method combined with pre-order traversal and post-order traversal,according to the facts that HTML elements are organized in a hierarchical,attached relationship and interactive elements generally exist in the terminal part of a DOM tree. Local WQIs are detected and the set for clustering,in which each local WQI is considered as one class and named as interaction group,is initially referenced to the interaction density in the model. It clusters different interaction groups hierarchically by structure distance and label the interaction nodes of substantial WQI w ith the nearest text node around in structure. Non-query WQI is filtered out by text filter. This method avoids the excessive dependency on tag "form "and presents a better performance in property of generality than traditional methods. Experimental results show that this method has advantage over researches before,the recognition accuracies of them reach up to 90. 7% and 92% respectively.
出处 《计算机工程》 CAS CSCD 北大核心 2016年第2期56-61,共6页 Computer Engineering
关键词 Web查询接口 超级文本标记语言 层次聚类 结构距离 交互密度 文本过滤器 Web Query Interface(WQI) Hyper Text Markup Language(HTML) hierarchical clustering structure distance interaction density text filter
作者简介 魏佳欣(1990-),女,硕士,主研方向为Web语义理解 叶飞跃,博士。
  • 相关文献

参考文献13

  • 1孟小峰.Web数据管理研究综述[J].计算机研究与发展,2001,38(4):385-395. 被引量:83
  • 2Khare R,An Y,Song I Y.Understanding Deep Web Search Interfaces:A Survey[J].SIGMOD Record,2010,39(1):33-40.
  • 3刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量:136
  • 4Marin-Castro H M,Sosa-Sosa V J,Martinez-Trinidad J F,et al.Automatic Discovery of Web Query Interfaces Using Machine Learning Techniques[J].Journal of Intelligent Information Systems,2013,40(1):85-108.
  • 5Dragut E C,Kabisch T,Yu Clement,et al.A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration[J].Journal of Very Large Database,2009,2(1):325-336.
  • 6Zhang Zhen,He Bin,Chang K C C.Understanding Web Query Interfaces:Best-effort Parsing with Hidden Syntax[C]//Proceedings of ACM SIGMOD Inter-national Conference on Management of Data.Paris,French:ACM Press,2004:107-118.
  • 7Barbosa L,Freire J.Searching for Hidden-Web Databases[C]//Proceedings of the 8th ACM SIGMOD International Workshop on Web and Databases.Baltimore,USA:ACM Press,2005:1-6.
  • 8Barbosa L,Freire J.Combining Classifiers to Identify Online Databases[C]//Proceedings of the 16th International Conference on World Wide Web.New York,USA:ACM Press,2007:107-118.
  • 9Wang Y,Li H,Zuo W,et al.Research on Discovering Deep Web Entries[J].Computer Science and Information Systems,2011,8(3):779-799.
  • 10Lin L,Zhou L.Web Database Schema Identification Through Simple Query Interface[J].Resource Discovery Lecture Notes in Computer Science,2010,6162(2):18-34.

二级参考文献62

  • 1Wang Q,Proc EDBT 2000,2000年
  • 2Liu L,Proc of ICDE 2000,2000年,611页
  • 3.[EB/OL].http://www.cogsci.Princeton.edu,.
  • 4Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678
  • 5Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70
  • 6Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189
  • 7Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118
  • 8Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348
  • 9Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118
  • 10Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118

共引文献217

同被引文献29

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部