Letting AI Do the Reading: Discovering Hidden Gems in Ancient TCM Books

Written by textmining | Published 2025/04/30
Tech Story Tags: natural-language-processing | traditional-chinese-texts | entity-extraction | knowledge-gap | text-mining | llms-in-traditional-medicine | ai-in-medicine | tf-idf-algorithm

TLDRAI deciphers ancient TCM texts, extracting key entities and relationships to build knowledge graphs and spark new medical discoveries.via the TL;DR App

Authors:

(1) Hanqing ZHAO, College of Traditional Chinese Medicine, Hebei University, Funded by National Natural Science Foundation of China (No.82004503) and Science and Technology Project of Hebei Education Department(BJK2024108) and a Corresponding Author (zhaohq@hbu.edu.cn);

(2) Yuehan LI, College of Traditional Chinese Medicine, Hebei University.

Table of Links

Abstract and 1. Introduction

2. Materials and Methods

2.1 Experimental Data and 2.2 Conditional random fields mode

2.3 TF-IDF algorithm and 2.4 Dependency Parser Based on Neural Network

2.5 Experimental Environment

3 Experimental results

3.1 Results of word segmentation and entity recognition

3.2 Visualization results of related entity vocabulary map

3.3 Results of dependency parsing

4 Final Remarks

5 References

Abstract: Entity and relationship extraction is a crucial component in natural language processing tasks such as knowledge graph construction, question answering system design, and semantic analysis. Most of the information of the Yishui school of traditional Chinese Medicine (TCM) is stored in the form of unstructured classical Chinese text. The key information extraction of TCM texts plays an important role in mining and studying the academic schools of TCM. In order to solve these problems efficiently using artificial intelligence methods, this study constructs a word segmentation and entity relationship extraction model based on conditional random fields under the framework of natural language processing technology to identify and extract the entity relationship of traditional Chinese medicine texts, and uses the common weighting technology of TF-IDF information retrieval and data mining to extract important key entity information in different ancient books. The dependency syntactic parser based on neural network is used to analyze the grammatical relationship between entities in each ancient book article, and it is represented as a tree structure visualization, which lays the foundation for the next construction of the knowledge graph of Yishui school and the use of artificial intelligence methods to carry out the research of TCM academic schools.

1 Introduction

In the era of artificial intelligence and big data technology, the mining and utilization of ancient Chinese medicine literature knowledge is one of the important basic tasks for the inheritance and innovation and development of traditional Chinese medicine. With the progress of technology, although certain achievements [1] have been made in related fields in recent years, there are still great challenges, especially in the inheritance and development of traditional Chinese medicine schools.

Most of the academic schools of TCM have been inherited in the form of ancient books, and the data are mainly in the form of unstructured text. The manual processing and extraction of ancient book data such as named entities i s time-consuming and labor-intensive. Ancient documents are recorded in classical Chinese, which uses concise words and words, and is quite different from modern Chinese in vocabulary and semantics. In particular, there is a lack of standard data sets for artificial intelligence analysis, which provides great obstacles for computer methods to automatically extract ancient documents. At present, Chinese named entity recognition methods are mainly based on rule-based, statistical machine learning and deep learning methods [2]. Among them, the rule-based method relies on manual rules, combines the named entity library, and determines the type of the entity by the consistency between the entity and the rules. This method can achieve good recognition results, but the rules in different fields are different and these rules cannot be used interactively. Therefore, machine learning methods have gradually emerged. At present, the machine learning models used for Chinese named entity recognition mainly include Hidden Markov model (HMM), conditional random field (CRF) [3] and so on. With the improvement of hardware computing power, the methods based on deep learning are more and more common, and the effect is better than the methods based on statistical machine learning. At present, the methods based on deep learning mainly train the model through neural networks. The mainstream neural network models include convolutional neural networks (CNN) [4], recurrent neural networks (RNN) [5] and so on. In the data relation extraction task in the field of traditional Chinese medicine, some scholars have used the pipeline relation extraction model to extract the relationship of traditional Chinese medicine (TCM) texts. Xie et[6] al. use Long-short term memory (LSTM) network to recognize entities from the labeled data, and then classify the extracted entities for relation extraction to complete the extraction of the entire triplet. In the process of classification, through the convolutional neural network (Convolutional Neural Network, CNN) entity relationship of polysemy knowledge fusion. Zhang et [7] al. use conditional random fields for entity recognition and extraction, use crawlers to crawl entity attributes, and use BiLSTM with attention mechanism for relation extraction, and realize the processing of polysemy through entity attributes. WangShang [8] used a comprehensive cross-entropy loss function and the SEGATT layer of the segmented attention mechanism for relation classification, and used CNN for knowledge fusion.

By using classical natural language processing methods, this study first segments the text data of classical ancient books, and performs named entity recognition according to the general PKU scheme. On this basis, the TF-IDF algorithm is used to extract key entity words, and then the dependency syntax analysis is carried out to provide data samples for subsequent knowledge graph construction. In the implementation of the specific scheme, this study uses conditional random field natural language processing model +TF-IDF algorithm key entity extraction algorithm+ high-performance dependency syntax parser based on neural network to automatically analyze and visualize the representative text data of Yishui School, which provides reference for the research and application of artificial intelligence technology and traditional Chinese medicine school.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


Written by textmining | Text Mining
Published by HackerNoon on 2025/04/30