摘要
[目的/意义]理解老子思想关乎理解中国早期文化,结合数字人文的方法,开展实证研究。利用大数据计算的方式,通过定量统计、定性分析,解决老子研究领域长期存在的疑而难决的源头、影响等方面的问题,发掘依靠阅读经验难以发现的文本组织特征及相互关系。[方法/过程]统计河上公版《老子》语料的字频;进行相似度分析和典籍引用情况分析;最后训练出古汉语的BERT模型,利用生成的字嵌入计算典籍句子之间的相似程度,在《老子》之前的典籍上进行相关性研究。[结果/结论]使用TF-IDF进行文本向量化,得出《老子》与其后世的作品中的《淮南子》最为相似;使用BERT模型的自监督学习训练,达到在完形填空任务上52.11%的精度和在预测是否是下一个句子上98.45%的精度,相似度计算结果显示出《墨子》与《老子》密切相关。这种方法引起了我们对《老子》和《墨子》间论说思想关系的一番新思考。
[Purpose/Significance]Understanding the Laozi’s thoughts relates to comprehend the early culture of Chinese. In this study, digital humanities methods were applied to empirical research. By using the method of big data calculation, including quantitative statistics and qualitative analysis, many long-standing problems in the field of Laozi’s research were deeply explored, such as the source, influences and other aspects of difficulties, mainly about the text organization characteristics and interrelationships which are difficult to find by reading. [Method/Process] The word frequencies were counted on the "Laozi" corpus of Heshanggong’s version. Similarity analysis were conducted and the citation of classics were analyzed. The BERT model were trained on ancient Chinese, and the generated word embeddings were used to calculate the similarity between classic sentences. [Result/Conclusion] By using TF-IDF for text vectorization, we found that "Huainanzi" is the most similar work with "Laozi" among its later works. By training the self-supervised learning model, BERT, a model whose accuracy reached 52.11% on the cloze task and 98.45% on predicting whether it’s the next sentence task was got. The result of similarity calculation indicates the close relevance of "Laozi" and "Mozi".The proposed method could help us to rethink about the theoretical and ideological relationship between "Laozi" and "Mozi".
作者
高瑞卿
董启文
方达
王弘治
方勇
Gao Ruiqing;Dong Qiwen;Fang Da;Wang Hongzhi;Fang Yong(School of Data Science and Engineering,East China Normal University,Shanghai 200062;Department of Chinese Language and Literature,East China Normal University,Shanghai 200062;School of Humanities,Shanghai Normal University,Shanghai 200234)
出处
《情报杂志》
CSSCI
北大核心
2021年第10期99-107,共9页
Journal of Intelligence
基金
国家社会科学重大基金项目“中国诸子学通史”(编号:19ZDA244)研究成果之一
国家社会科学基金项目“《经典释文》音义辞典”(编号:19FYYB008)研究成果之一
华东师大幸福之花先导基金重大研究专项“‘幸福之花’先导研究基金项目--大数据视野下的老子思想源头与涵义研究”(编号:44300-19312-542500/005)的研究成果之一。
关键词
BERT
数字人文
相似度
关系挖掘
先秦
老子
BERT
digital humanities
similarity
relationship mining
Pre-Qin
Laozi
作者简介
高瑞卿,女,1997年生,硕士,研究方向:自然语言处理和文本挖掘;董启文,男,1977年生,博士,教授,研究方向:数据科学应用技术、包括网络信息学、机器学习和计算广告等;方达,男,1987年生,博士,助理研究员,研究方向:诸子学研究;通信作者:王弘治,男,1977年生,博士,副教授,研究方向:汉语史;方勇,男,1956年生,博士,教授,研究方向:诸子学研究。