Authors:
(1) Hanqing ZHAO, College of Traditional Chinese Medicine, Hebei University, Funded by National Natural Science Foundation of China (No.82004503) and Science and Technology Project of Hebei Education Department(BJK2024108) and a Corresponding Author (zhaohq@hbu.edu.cn);
(2) Yuehan LI, College of Traditional Chinese Medicine, Hebei University.
Table of Links
2. Materials and Methods
2.1 Experimental Data and 2.2 Conditional random fields mode
2.3 TF-IDF algorithm and 2.4 Dependency Parser Based on Neural Network
3 Experimental results
3.1 Results of word segmentation and entity recognition
3.2 Visualization results of related entity vocabulary map
3.3 Results of dependency parsing
2.1 Experimental Data
The data of this study are the publicly available versions of the Origin of Medicine, Spleen and Stomach Theory and Yin Syndrome Lue Case. The full text content is transformed into txt documents, the table of contents is removed, only the full text title and all the main text are retained, and Spaces and blank lines are removed, and no data cleaning is performed.
2.2 Conditional random fields model
Conditional Random Field (CRF) is a basic model of natural language processing, which is widely used in Chinese word segmentation, named entity recognition, part-of-speech tagging and other tagging scenarios. Chinese word segmentation uses BMES word position method, that is, word head, word middle, word end and independent word. The input sentence S is equivalent to the sequence X, and the output label sequence L is equivalent to the sequence Y. We want to train a model to find the optimal corresponding L under the premise of a given S. The keypoint of training this model is the selection of the feature function F and the determination of the weight of each feature function W. For each feature function, its input has the following four elements: ① Sentence S②i, which is used to represent the ith word in sentence S ③li, which is the part-of-speech (POS) tagged by the scoring sequence to the ith word ④li−1, which is the part-of-speech tagged by the scoring sequence to the ith word. Its output value is either 0 or 1, where 0 means that the sequence to be scored does not conform to this feature, and 1 means that the sequence to be scored does. For the sequences L and S, we can construct the conditional probability distribution model formula:
IOB labeling method is used for named entity recognition on the basis of word segmentation, as shown in Figure 1, using the gram generated by the word segmentation sequence of each sentence, the tri-gram model is used to extract features, and finally input into the CRF model to complete the labeling.
2.3 TF-IDF algorithm
TF-IDF (Term frequency-inverse Document Frequency) is a common weighting technique used in information retrieval and data mining, which is often used to mine keywords in articles. TF-IDF is a statistical analysis method used to evaluate the importance of a term to a document set or a corpus. Term Frequency (TF) is the number or frequency of occurrences of a term in a document. If a term appears more than once in a document, it is likely to be an important term. The formula is as follows:
Term Frequency (TF) = The number of times a term appears in a document/the total number of words in the document
Inverse Document Frequency (IDF) =log(total number of documents in the corpus /(number of documents containing the term +1))
TF−IDF= Term frequency (TF) × Inverse Document frequency (IDF)
The importance of a term is directly proportional to the number of times it appears in the document and inversely proportional to the number of times it appears in the corpus. This calculation method can effectively avoid the influence of common words on keywords, and improve the correlation between keywords and articles.
2.4Dependency Parser Based on Neural Network
Dependency parsing can help us understand the meaning of text. Grammar parsing is an important part of language understanding. The goal is to analyze the grammatical structure of a sentence and represent it into an understandable structure, usually a tree structure. The dependency syntax theory believes that there is a master-slave relationship between words. If a word modifies another word in a sentence, the modifier is called a dependent word, and the modified word is called a dominant word.
Dependency parsing based on neural network converts the sequence of words in a sentence into a graph structure by analyzing the grammatical relationship inside the sentence. Common grammatical relations include verb-object relation, left-adjunction relation, right-adjunction relation, coordinate relation, determination-central relation, subject-verb relation, etc. Dependency grammar is a commonly used grammar system. Dependency arc connects two sentences in a sentence that have a certain grammatical relationship, forming a syntactic dependency tree. The dependency tree is constructed by the stack, starting from the root node, and then all the words stored in the cache are pushed into the stack one by one by using three states: Shift, Left-Reduce and Right-Reduce. Neural network consists of three layers, Input layer (Softmax layer), Hidden layer (Hidden layer) and output layer (Input layer). The model is fromthe 2014 paper "A Fast and Accurate Dependency Parser using Neural Networks" by Danqi Chen and Christopher D. anning. In this study, HanLP is used to implement a dependency parser based on neural networks.
2.5 Experimental Environment
The research was implemented on a small artificial intelligence platform equipped with Intel Xeon Gold 6248R CPU@3.00Ghz*96, 256GB memory and NVIDIA A100 80G*2 GPU computing card in the Laboratory of Traditional Chinese Medicine Informatics of Hebei University. Ubuntu 18.04.6LTS, Python 3.9 environment.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.