Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A nove...Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively.展开更多
讨论了基于M atlab W eb Server的M atlab网络应用开发原理,介绍了M atlab W eb程序处理的一般流程和相关配置文件的详细配置方法,并给出M atlab W eb开发中的两个关键问题:通过输入模块从HTML页面获取输入参数和通过输出模块生成包括...讨论了基于M atlab W eb Server的M atlab网络应用开发原理,介绍了M atlab W eb程序处理的一般流程和相关配置文件的详细配置方法,并给出M atlab W eb开发中的两个关键问题:通过输入模块从HTML页面获取输入参数和通过输出模块生成包括输出数据和图片的HTML文件.利用M atlab W eb Server环境实现了远程控制实验室的控制效果仿真,并以二维图形的输出形式显示仿真结果,为网上控制实验室的建立提供了控制参数选择以及试验结果验证参照.本远程数据处理方法可推广应用到不同的远程数据处理领域,具有很高的推广价值.展开更多
基金Project(2012BAH18B05) supported by the Supporting Program of Ministry of Science and Technology of China
文摘Content extraction of HTML pages is the basis of the web page clustering and information retrieval,so it is necessary to eliminate cluttered information and very important to extract content of pages accurately.A novel and accurate solution for extracting content of HTML pages was proposed.First of all,the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated.Secondly,the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors.Finally,the information blocks are found according to the definition,and a universal classification algorithm is used to identify the content blocks.The experimental results show that the algorithm can extract content effectively and accurately,and the recall rate and precision are 96.5% and 93.8%,respectively.
文摘讨论了基于M atlab W eb Server的M atlab网络应用开发原理,介绍了M atlab W eb程序处理的一般流程和相关配置文件的详细配置方法,并给出M atlab W eb开发中的两个关键问题:通过输入模块从HTML页面获取输入参数和通过输出模块生成包括输出数据和图片的HTML文件.利用M atlab W eb Server环境实现了远程控制实验室的控制效果仿真,并以二维图形的输出形式显示仿真结果,为网上控制实验室的建立提供了控制参数选择以及试验结果验证参照.本远程数据处理方法可推广应用到不同的远程数据处理领域,具有很高的推广价值.