摘要
传统的古籍开发与应用模式已难以适应人文学科研究的需要,人文学科研究者期待一个技术逻辑和人文逻辑相耦合的数字人文研究范式的出现。本文从古籍文献深层次开发与利用出发,利用新的信息技术与面向数字人文研究跨学科方法,以大规模中国古籍文本为研究对象,采用大数据研究理念,对古籍进行整理、标注、自动分词等处理,以词频分析统计为研究核心,采用数据降噪、基于窗口时间单位的统计分析计算、滑动窗口预测等分析与挖掘方法,采用大数据实时分析技术,实现了实时、在线、立体、可视化、定量分析字词的历史词频分布规律,创建了一个以语言学、历史文献学、历史地理学等人文学科研究为主的古籍实时统计分析平台,可辅助研究者在大量的古籍文献中发现新的模式、现象、趋势等,实现古籍开发与应用模式创新的初步尝试。图11。参考文献36。
Digital humanity, a new research pattern, brings consequently a new way of research for traditional humanity and social sciences for traditional development and utilization mode of the ancient literature resources that no longer fit the requirements of humanity researches. This paper aims at the deep development and utilization of ancient literature resources by using new information technology and method of digital humanity with the ancient Chinese literatures as to construct a new platform for real-time textual statistic analysis of linguistics, studies of historical literature and historical geography etc.This study adopts a big data concept, and applies sorting and labelling to Chinese ancient texts for the construction of a corpus of more than 40 000 kinds of ancient texts. This study also adopts means of dictionary superposition of piecewise and Bigram model to carry out word segmentation of Chinese ancient texts and also with the application of Grubbs method for data denoising and the maximum elimination of problematic data. With word frequency statistical analysis as the research focus base on ancient corpus, we use time window unit analytical computing to analyze the word frequency, apply the idea of memory realtime computing to solve the bottleneck problem of reading big data. The results of the statistics and analysis are displayed by the micro-level scatter plot and the macro-level curve graph based on the time axis as the main line. With the author of the ancient books as the main line, we use the geographic information system( GIS) technology to integrate and display digital ancient books, and with the retrieval of the ancientliterature as a clue to show the geographical distribution of the authors. This study improves the efficiency of real-time inquiry and realizes the visualization of the scatter diagram and curve graph of the word frequency according to the years. A statistical and analytical platform of ancient literatures and documents in linguistics, history and historical geography will be established based on the new methods and pattern.The study not only extends the research paradigm and method of the humanities, but also enriches the research tools of the humanities research. This research broadens the dimension of the utilization and development of ancient literature and texts, and expands the scope of humanities materials. The platform has a vast application prospect in linguistics, history and historical geography.This research is a new attempt in the deep development and utilization of ancient texts and documents by means of digital humanity within the scope of big data. First of all, this study builds a large-scale ancient text corpus of more than 40 000 kinds of ancient books; secondly, this study uses statistical methods and superposition of word segmentation method to implement word segmentation in ancient texts; finally, with the help of big data technique, this study improves the efficiency of real-time inquiry and realizes the visualization of the scatter diagram and curve graph of the word frequency according to the years, which provides a direct visual display of the result of the analysis.Due to the insufficient vocabulary database, the accuracy of word segmentation needs to be improved; in addition, in order to improve the quality of the corpus, the information of edition of ancient books and the authors also requires verification. The extraction of the entity in corpus of ancient books, such as persons,historical events, places, titles and names needs to be developed further. 11 figs. 36 refs.
出处
《中国图书馆学报》
CSSCI
北大核心
2016年第2期66-80,共15页
Journal of Library Science in China
关键词
数字人文
文本可视化
数据挖掘
古籍文献
Digital humanities
Text visualization
Data mining
Ancient literature
作者简介
欧阳剑上海师范大学语言研究所计算语言学博士研究生,广西民族大学图书馆研究馆员。上海200234。通信作者:欧阳剑,Email:oyjjj@163.com,