基于HTML特征与层次聚类的Web查询接口发现被引量：4

Discovery of Web Query Interface Based on HTML Features and Hierarchical Clustering

在线阅读下载PDF

导出

摘要针对各网站Web查询接口(WQI)因结构异构而难以被自动发现的问题,提出一种基于超级文本标记语言(HTML)特征和层次聚类的Web查询接口发现方法。利用HTML控件元素之间的层级结构、依附关系和HTML交互控件的终端特性,通过前序和后序遍历相结合的方式解析页面,建立合适的页面树状模型。按照查询区域交互密度的局部集中性定位并初始化聚类集合。将聚类集合中各潜在接口区域结构距离的相似性进行层次聚类,并对所得潜在接口中的交互控件选择合适的文本节点进行语义标注,得出完整WQI区域,利用接口中的文本特征过滤非查询接口。实验结果表明,该方法克服了传统方法对<form>标签的过度依赖,具有较强的通用性,接口识别率与准确率分别达到90.7%和92%。 Aiming at the problem that Web Query Interface（ WQI） from different Web sites can not be found automatically due to their highly heterogeneous structure,this paper proposes a method to find WQI based on Hyper Text Markup Language（ HTML） features and hierarchical clustering. It establishes a proper page model in the form of tree with a method combined with pre-order traversal and post-order traversal,according to the facts that HTML elements are organized in a hierarchical,attached relationship and interactive elements generally exist in the terminal part of a DOM tree. Local WQIs are detected and the set for clustering,in which each local WQI is considered as one class and named as interaction group,is initially referenced to the interaction density in the model. It clusters different interaction groups hierarchically by structure distance and label the interaction nodes of substantial WQI w ith the nearest text node around in structure. Non-query WQI is filtered out by text filter. This method avoids the excessive dependency on tag ＂form ＂and presents a better performance in property of generality than traditional methods. Experimental results show that this method has advantage over researches before,the recognition accuracies of them reach up to 90. 7% and 92% respectively.

作者魏佳欣叶飞跃

机构地区上海大学计算机工程与科学学院

出处《计算机工程》 CAS CSCD 北大核心 2016年第2期56-61,共6页 Computer Engineering

关键词 Web查询接口超级文本标记语言层次聚类结构距离交互密度文本过滤器 Web Query Interface（WQI） Hyper Text Markup Language（HTML） hierarchical clustering structure distance interaction density text filter

分类号 TP391 [自动化与计算机技术—计算机应用技术]

作者简介魏佳欣（1990-），女，硕士，主研方向为Web语义理解叶飞跃，博士。

引文网络
相关文献

参考文献13

1孟小峰.Web数据管理研究综述[J].计算机研究与发展,2001,38(4):385-395. 被引量：83
2Khare R,An Y,Song I Y.Understanding Deep Web Search Interfaces:A Survey[J].SIGMOD Record,2010,39(1):33-40.
3刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量：136
4Marin-Castro H M,Sosa-Sosa V J,Martinez-Trinidad J F,et al.Automatic Discovery of Web Query Interfaces Using Machine Learning Techniques[J].Journal of Intelligent Information Systems,2013,40(1):85-108.
5Dragut E C,Kabisch T,Yu Clement,et al.A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration[J].Journal of Very Large Database,2009,2(1):325-336.
6Zhang Zhen,He Bin,Chang K C C.Understanding Web Query Interfaces:Best-effort Parsing with Hidden Syntax[C]//Proceedings of ACM SIGMOD Inter-national Conference on Management of Data.Paris,French:ACM Press,2004:107-118.
7Barbosa L,Freire J.Searching for Hidden-Web Databases[C]//Proceedings of the 8th ACM SIGMOD International Workshop on Web and Databases.Baltimore,USA:ACM Press,2005:1-6.
8Barbosa L,Freire J.Combining Classifiers to Identify Online Databases[C]//Proceedings of the 16th International Conference on World Wide Web.New York,USA:ACM Press,2007:107-118.
9Wang Y,Li H,Zuo W,et al.Research on Discovering Deep Web Entries[J].Computer Science and Information Systems,2011,8(3):779-799.
10Lin L,Zhou L.Web Database Schema Identification Through Simple Query Interface[J].Resource Discovery Lecture Notes in Computer Science,2010,6162(2):18-34.

二级参考文献62

1Wang Q，Proc EDBT 2000，2000年
2Liu L，Proc of ICDE 2000，2000年，611页
3.[EB/OL].http://www.cogsci.Princeton.edu,.
4Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678
5Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70
6Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189
7Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118
8Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348
9Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118
10Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118

共引文献217

1严彩梅.Web智能信息检索体系结构的研究[J].计算机应用研究,2002,19(11):51-52. 被引量：3
2詹雅静.浅谈学校学籍管理系统的设计和实现[J].内蒙古科技与经济,2005(z1):151-152. 被引量：1
3魏勇刚,张国春,常勇,袁方.基于词性分析和领域知识的Deep Web语义标注[J].郑州大学学报（理学版）,2009,41(1):52-55. 被引量：7
4郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
5李颖,刘国华,佟冰,刘顺江.基于素数的多源模式匹配方法的研究[J].燕山大学学报,2009,33(2):141-145. 被引量：1
6李亚.学生学籍管理系统的设计与实现[J].科技风,2008(23):118-119. 被引量：1
7黄俊涛.基于Web成绩管理系统的设计与实现[J].成功,2010(4):292-292.
8余正涛,樊孝忠,耿增民.受限领域自然语言数据库查询接口研究[J].昆明理工大学学报（理工版）,2004,29(4):133-138. 被引量：5
9徐贵水,祝朝安,徐启丰.基于WEB的专题建库系统的设计与实现[J].计算机应用与软件,2004,21(9):112-114.
10张英朝,张维明,肖卫东,黄金才.信息网格中基于本体的信息共享全局视图构建方法研究[J].计算机研究与发展,2004,41(10):1856-1863. 被引量：9

同被引文献29

1冯静,金远平,冯欣.基于主成分分析及匹配聚类分析的数据表语义压缩方法[J].东南大学学报（自然科学版）,2006,36(6):927-930. 被引量：2
2孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008(1):48-61. 被引量：1083
3胡晶.基于HTML5的Web移动应用开发研究[J].工业控制计算机,2014,27(10):80-81. 被引量：12
4张朝昆,崔勇,唐翯翯,吴建平.软件定义网络(SDN)研究进展[J].软件学报,2015,26(1):62-81. 被引量：444
5王闯.HTML语言的网页制作技巧与方法分析[J].无线互联科技,2015,12(11):38-39. 被引量：7
6任磊,魏永长,杜一,张小龙,戴国忠.面向信息可视化的语义Focus+Context人机交互技术[J].计算机学报,2015,38(12):2488-2498. 被引量：12
7田宇,罗辛.一种基于图像去噪的多密度网格聚类算法[J].智能计算机与应用,2016,6(1):44-47. 被引量：2
8李驰,李林.基于HTML5的Web前端安全性研究[J].软件导刊,2016,15(5):185-188. 被引量：3
9吴建平,李丹,毕军,徐恪,李星,朱晶.ADN:地址驱动的网络体系结构[J].计算机学报,2016,39(6):1081-1091. 被引量：16
10李娜,王磊,张文月,王玉玮,舒艳,张超.基于高维数据优化聚类的长周期峰谷时段划分模型研究[J].现代电力,2016,33(4):67-71. 被引量：13

引证文献4

1刘治纲,肖庆汇,丁雪非,罗尉平.软件定义网络用户动态访问控制模型仿真[J].计算机仿真,2019,36(1):308-311. 被引量：10
2尚靖博,左万利.基于清晰有理数均值的新匹配聚类算法[J].吉林大学学报（理学版）,2018,56(2):399-401.
3蒋文娟,苏喜红,孟丽珍.基于微信小程序的音乐播放器研究与实现[J].软件导刊,2020,19(6):141-145. 被引量：5
4佘俊,余少锋,周宇鹏,廖崇阳,罗勇.基于超文本标记语言的文档信息自动提取技术研究[J].粘接,2020,43(8):80-84. 被引量：4

二级引证文献19

1邓春华,饶经纬.在自组织网络中的自动化和SDN[J].通信电源技术,2020,37(18):83-85.
2彭思喜,彭鹏.基于RBAC的B/S结构学生收费系统安全机制[J].汕头大学学报（自然科学版）,2021,36(1):12-20. 被引量：7
3徐祎麟,刘星含.微阅读与电影小程序的设计与实现[J].电脑知识与技术,2021,17(4):83-84.
4刘绍婕.基于微信小程序的考勤签到系统的设计与实现[J].信息与电脑,2021,33(7):158-162. 被引量：1
5魏占祯,彭星源,赵洪.SDN中基于用户信任度的资源访问控制方案[J].信息网络安全,2021(10):33-40. 被引量：2
6陈建武,何挺.输变电工程生态环境预警系统研究[J].能源与环保,2021,43(12):64-70.
7叶晓鸿.基于逻辑度的嵌入式软件访存压力优化方法[J].太原师范学院学报（自然科学版）,2021,20(4):65-69.
8黄赞,周双娥.基于SPIE Journals文献的光电图像数据获取技术[J].计算机应用,2022,42(S01):136-139.
9郑永涛,孔维宾,陈俭朝,杨晓芳,张红艳.基于微信小程序的高校办公室管理系统设计与实现[J].软件,2022,43(6):35-39.
10赵男男.基于PCA和改进BP神经网络的信息安全评估模型构建[J].宁夏师范学院学报,2022,43(7):86-93. 被引量：1

1邵秀丽.以模型方式开发的超级文本[J].软件,1993,14(5):1-20.
2夏国平,赵恒峰.专家系统中动画和超级文本的一种设计[J].计算机研究与发展,1993,30(2):40-45. 被引量：1
3刘霜,潘立武.HTML发展应用中的探索与研究[J].信息与电脑,2016,28(11):72-73. 被引量：3
4人鱼姬.PHP 5.x COM functions提权漏洞的利用[J].黑客防线,2007(12):48-49.
5魏际洲,汤庸,李显济.多媒体数据库系统在 Internet 上的实现[J].广东工业大学学报,1997,14(3):14-17.
6费建林 Oppli.,R.全球网安全应用策略[J].电子计算机,2000(3):57-61.
7黄林鹏,倪德明.超级文本和超级媒体的市场发展及展望[J].计算机技术,1993(2):47-49.
8洪留荣.Web页中的三种通信[J].微型机与应用,1999,18(7):57-58.
9李中,王媛.在ASP.NET中用户控件的使用[J].价值工程,2012,31(12):147-148.
10杨少波.网页中HTML控件和Java Applet的交互技术[J].微计算机应用,1998,19(4):220-223.

计算机工程

2016年第2期

浏览历史

内容加载中请稍等...

基于HTML特征与层次聚类的Web查询接口发现被引量：4

参考文献13

二级参考文献62

共引文献217

同被引文献29

引证文献4

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

基于HTML特征与层次聚类的Web查询接口发现 被引量：4

参考文献13

二级参考文献62

共引文献217

同被引文献29

引证文献4

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

基于HTML特征与层次聚类的Web查询接口发现被引量：4