期刊文献+

ReDE:一个基于正则表达式的生物数据抽取方法 被引量:8

ReDE: A Regular Expression-Based Method for Extracting Biological Data
在线阅读 下载PDF
导出
摘要 从异构生物数据源抽取数据,建立查询分析平台是目前研究的热点,而抽取过程会涉及大量相互依赖的元数据,充分利用这种依赖关系可降低维护工作量·基于正则表达式(RE)提出了ReDE抽取方法:通过围绕RE组建立分析树,设计了基于RE的关系数据库模式生成算法和通用抽取与组装算法,其特点是:RE是惟一的元数据,易于管理和维护·该方法奠定了生物数据库辅助设计工具和高自动化抽取工具的基础,已用于构建国内第1个整合的生物信息在线数据仓库· Extracting data from heterogeneous biological data sources to build a query and analysis platform for biological scientists is currently a hot research topic. In general, data extraction process concerns many interdependent metadata. Making full use of dependencies among metadata to generate one metadata from another can reduce metadata maintenance overhead. However, many data extraction methods overlook these dependencies and require much effort to construct and maintain many metadata. In this paper, a regular expression (RE) based method named as ReDE is proposed to avoid this drawback: by building a parse tree for RE groups, an RE-based algorithm for generating relational database scheme and a general data extraction and assembling algorithm are designed. The novelty is that the RE is the only necessary metadata whose management and maintenance are relatively easy. This method can serve as the basis for building a biological database design-aiding tool and a high automatic tool for data extraction, and has been applied to extract data for the first online integrated biological data warehouse of China.
出处 《计算机研究与发展》 EI CSCD 北大核心 2005年第12期2184-2191,共8页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2002AA231011) 上海市重大科技基金项目(02DJ14013)
关键词 生物数据源 数据抽取 元数据 正则表达式 抽取算法 biological data source data extraction metadata regular expression extraction algorithm
作者简介 xbdeng@fudan.edu.cn。邓绪斌,1964年生,博士,讲师,主要研究方向为数据库、数据挖掘、生物信息学. 朱扬勇,1963年生,教授,博士生导师,主要研究方向为数据库与知识库、数据挖掘、生物信息学.
  • 相关文献

参考文献10

  • 1H. Do, E. Rahm. Flexible integration of molecular-biological annotation data: The GenMapper approach. In: Proc. 9th Int'l Conf. Extending Database Technology. Berlin: Springer-Verlag,2004. 811-822.
  • 2S. K. Ng, L. Wong. Accomplishments and challenges in bioinformatics. IEEE IT Pro, 2004, 6(1): 12-18.
  • 3A.H.F. Laender, A. S. da Silva, B. Ribeiro-Neto, et al. The Debye environment for Web data management. IEEE Internet Computing, 2002, 6(4): 60-69.
  • 4A.H.F. Laender, B. Ribeiro-Neto, A. S. da Silva. DEByE:Data extraction by example. Data and Knowledge Engineering,2002, 40(2): 121-154.
  • 5B. Adelberg. NoDoSE: A tool for semi-automatically extracting structured and semistructured data from text documents. In:Proc. ACM SIGMOD Conf. Management of Data. New York:ACM Press, 1998. 283-294.
  • 6胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 7V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. 27th Int'l Conf. Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 109-118.
  • 8C.Y. Chan, M. N. Garofalakis, R. Rastogi. RE-Tree: An efficient index structure for regular expressions. VLDB Journal,2003, 12(2): 102-119.
  • 9J. Shanmugasundaram, K. Tufte, C. Zhang, et al. Relational databases for querying XML documents: Limitations and opportunities. In: Proc. 25th Int'l Conf. Very Large Data Bases.San Francisco: Morgan Kaufmann, 1999. 302-314.
  • 10B. Ribeiro-Neto, A. H. F. Laender, A. S. da Silva. Top-down extraction of semi-structured data. In: Proc. 6th Symposium on String Processing and Information Retrieval. Los Alamitos, CA:IEEE Computer Society Press, 1999. 176- 183.

二级参考文献7

  • 1Meng X F, Lu H J, Wang H Y, et al. SG-WRAP: A schemaguided wrapper generator demonstration. In: Proc of ICDE'2002. Los Alamitos, CA: IEEE Computer Society Press, 2002.331 ~332
  • 2Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web-data extraction. In: Proc of ACM WIDM' 2003. New York: ACM Press, 2003. 1~8
  • 3Meng X F, Wang H Y, Hu D D, et al. Sg-wram: Schema guided wrapper maintenance. In: Proc of ICDE' 2003. Los Alamitos,CA: IEEE Computer Society Press, 2003. 750~752
  • 4Meng X F, Lu H J, Wang H Y, et al. Schema-guided data extraction from the Web. Journal of Computer Science and Technology, 2002, 17(4): 377~388
  • 5V Crescenzi, G Mecca, P Merialdo. ROADRUNNER: Towards automatic data extraction from large Web sites. In: Proc of VLDB'2001. San Francisco, CA: Morgan Kaufmann, 2001. 109~118
  • 6A Arasu, H Garcia-Molina. Extracting structured data from Web pages. In: Proc of ACM SIGMOD'03. New York: ACM Press,2003. 337~348
  • 7St(e)phane Grumbach, Giansalvatore Mecca. In search of the lost schema. In: Proc of ICDT'1999. Berlin: Springer, 1999. 314~331

共引文献20

同被引文献39

引证文献8

二级引证文献52

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部