Construction method, device and application of text matching model for automatic icd-10 coding of clinical diagnosis text
By constructing an interactive text matching model and combining BERT and BiLSTM models with convolutional neural networks, the problem of large differences between Chinese clinical diagnostic texts and ICD-10 names was solved, achieving high-precision automatic encoding of ICD-10 and improving the robustness and accuracy of the encoding.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2022-08-09
- Publication Date
- 2026-06-23
AI Technical Summary
Existing automatic ICD-10 encoding algorithms struggle to effectively address the low robustness issues caused by significant differences between ICD-10 names and clinical diagnostic texts when processing Chinese clinical diagnostic texts, and they also fail to fully utilize the value of human encoding history.
A text matching model is constructed, including a pre-trained BERT model, a BiLSTM model, a matching module, an aggregation module, and a binary classification model. Through an interactive text matching algorithm, element-wise subtraction and multiplication are combined with neural networks, analogical reasoning is performed using the encoding history of previous years, and aggregation is performed using convolutional neural networks to improve the matching effect.
It significantly improved the matching accuracy between Chinese clinical diagnostic texts and ICD-10 names, enhanced the robustness and comprehensiveness of automatic ICD-10 encoding, and achieved high-precision automatic ICD-10 encoding.
Smart Images

Figure CN115359918B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of medical data standardization, and particularly relates to the standardization of clinical diagnostic text. Specifically, it relates to a method, apparatus and application for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text. Background Technology
[0002] Since 2002, the country has been vigorously promoting the International Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), which is the primary source of information for the performance evaluation of public hospitals and the settlement of social security payments for disease diagnosis. ICD-10 coding is typically undertaken by coders in hospital medical records departments. To complete this task, coders must possess knowledge of the medical field, coding rules, and medical terminology. Due to the complexity of the ICD-10 structure and the increasing number of ICD-10 codes, ICD-10 coding has become more labor-intensive and time-consuming. Even skilled coders require an average of approximately 30 minutes per code, resulting in a severe shortage of coding personnel. Considering these limitations, automated ICD-10 coding algorithms can help simplify the coding process and improve the efficiency of coders, making them worthy of in-depth research and exploration.
[0003] In recent years, with the continuous development of machine learning, a large number of new autocoding algorithms have emerged. Early research typically used supervised machine learning methods, such as support vector machines and random forests, but these methods struggle to handle high-noise and highly redundant clinical diagnostic texts influenced by the diverse expressions of doctors. Recent research often applies deep learning to autocoding algorithms for ICDs, such as bidirectional recurrent neural networks and bidirectional gated recurrent networks. These methods have improved the accuracy of existing autocoding algorithms for ICD-10 to some extent. However, most deep learning-based autocoding research focuses on English ICD-10. Due to differences in language features, methods for English texts cannot be directly applied to Chinese texts. Furthermore, because ICD-10 is a clinical classification standard rather than a clinical nomenclature standard, ICD-10 standard names are usually coarse-grained, resulting in significant differences between clinical diagnostic texts and ICD-10 names. For example, a clinical diagnosis of "burns with residual wounds covering 5% of the trunk and left upper and lower extremities" needs to be coded as "T31.001," while the corresponding ICD-10 name is "mild burns covering less than 10% of the body surface area." Therefore, there are few existing Chinese automatic encoding algorithms, and they are difficult to match texts that are significantly different in meaning, so the accuracy still needs to be improved.
[0004] ICD-10 coding has been used for many years, and many hospitals have stored a large amount of human ICD-10 coding history. However, most existing automatic coding algorithms only use these coding histories as the gold standard and training set, without fully exploring the value of this valuable human experience.
[0005] Therefore, there is an urgent need for a text matching model for automatic ICD-10 encoding of clinical diagnostic texts to deeply match previous encoding history and unencoded clinical diagnostic texts, in order to solve the problem of low robustness of existing automatic encoding models caused by the large differences between clinical diagnostic texts and ICD-10 names. Summary of the Invention
[0006] In view of the above, the purpose of this invention is to provide a method, apparatus and application for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, thereby improving the robustness of ICD-10 encoding by solving the problem of large differences between clinical diagnostic text and ICD-10 names.
[0007] To achieve the above-mentioned objectives, an embodiment provides a method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, comprising the following steps:
[0008] Step 1: Prepare a matching set containing cleaned clinical diagnosis texts and corresponding ICD-10 disease classification name texts, and an equal set of non-matching sets containing cleaned clinical diagnosis texts and unrelated ICD-10 disease classification name texts, to form a training set.
[0009] Step 2: Construct a text matching model for clinical diagnosis text and ICD-10 disease classification name text, including a pre-trained BERT model, a BiLSTM model, a matching module, an aggregation module, and a binary classification model. The pre-trained BERT model is used to extract vector representations of the clinical diagnosis text and the ICD-10 disease classification name text. The BiLSTM model is used to extract the time step representations of the clinical diagnosis text and the ICD-10 disease classification name text based on the vector representations. The matching module is used to perform multi-angle weighted matching of all time step representations of the clinical diagnosis text and all time step representations of the ICD-10 disease classification name text to obtain a set of matching vectors. The aggregation module is used to aggregate the set of matching vectors to obtain the aggregation result. The binary classification model is used to perform binary classification based on the aggregation result to obtain the matching prediction probability of whether the two texts match.
[0010] Step 3: Use the output matching prediction probability and the cross-entropy of whether they match as the loss function, and use the training set to optimize the parameters of the text matching model to obtain a text matching model that can determine whether clinical diagnosis text matches ICD-10 disease classification name text.
[0011] In one embodiment, the matching module includes attention weight calculation operation, weighted summation operation, element-wise subtraction operation, element-wise multiplication operation, and fusion operation, wherein:
[0012] The attention weight calculation operation is used to determine the importance of different characters in the text to the matching prediction probability. The cosine similarity between a single time step representation of the clinical diagnosis text and all time step representations of the ICD-10 disease classification name text is calculated as the attention weight.
[0013] The weighted summation operation is used to calculate the weighted summation of all time step representations of the ICD-10 disease classification name text with the corresponding attention weights to obtain the weighted summation representation;
[0014] The element-wise subtraction operation is used to subtract the single time step representation of a clinical diagnosis text from its weighted sum representation element-wise to obtain the subtraction result.
[0015] The element-wise multiplication operation is used to multiply the single time step representation of the clinical diagnosis text with the weighted summation representation element-wise to obtain the multiplication result;
[0016] The fusion operation uses a neural network to compress and fuse the subtraction and multiplication results, and obtains the matching results of a single time step representation of the clinical diagnosis text with all time step representations of the ICD-10 disease classification name text as a matching vector.
[0017] All time-step representations of the clinical diagnostic text are processed through attention weight calculation, weighted summation, element-wise subtraction, element-wise multiplication, and fusion operations to obtain a set of matching vectors from all matching vectors.
[0018] In one embodiment, the aggregation module employs a convolutional neural network. For all matching vectors, the convolutional neural network uses convolutional kernels with the same weights to perform aggregation calculations to obtain the aggregation result.
[0019] In one embodiment, the binary classification model includes a fully connected layer and a sigmoid function layer, used to output the matching prediction probability of whether two texts match.
[0020] To achieve the above-mentioned objectives, the embodiments also provide an apparatus for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, characterized in that it includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above-mentioned method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text.
[0021] To achieve the above-mentioned objectives, the embodiments also provide a method for automatically encoding clinical diagnostic text using ICD-10, the method utilizing a text matching model constructed by any of the above-described construction methods, and including the following steps:
[0022] Step 1: Prepare a priority matching set containing clinical diagnostic texts that have already been manually ICD-10 encoded;
[0023] Step 2: Obtain the clinical diagnosis text to be encoded and clean the clinical diagnosis text;
[0024] Step 3: Using the already manually ICD-10 encoded clinical diagnostic text, perform ICD-10 encoding matching prediction for the clinical diagnostic text to be encoded through a text matching model, and output the matching prediction probability;
[0025] Step 4: If, based on the matching prediction probability, there is a clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match, then the ICD-10 code corresponding to the clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match is used as the ICD-10 code of the clinical diagnostic text to be encoded.
[0026] In step 3 of one embodiment, using the already manually ICD-10 encoded clinical diagnostic text, a text matching model is used to perform ICD-10 encoding matching prediction for the clinical diagnostic text to be encoded, including:
[0027] Step 3-1: Input the cleaned clinical diagnostic text to be encoded and each manually ICD-10 encoded clinical diagnostic text in the priority matching set into the BERT model to extract the vector representation of the clinical diagnostic text to be encoded and the vector representation of the manually ICD-10 encoded clinical diagnostic text.
[0028] Step 3-2: Input the vector representation of the clinical diagnostic text to be encoded and the vector representation of the clinical diagnostic text that has been manually encoded by ICD-10 into the BiLSTM model to extract the temporal information in the text, so as to obtain the representation of each time step of the clinical diagnostic text to be encoded and the representation of each time step of the clinical diagnostic text that has been manually encoded by ICD-10.
[0029] Step 3-3: The matching module inputs all time step representations of the clinical diagnostic text to be encoded one by one, and performs weighted matching with all time step representations of a single clinical diagnostic text that has already been manually ICD-10 encoded to obtain a set of matching vectors;
[0030] Steps 3-4: Input a set of matching vectors into the aggregation module for aggregation, then input the aggregation result into the binary classification model, and output the matching prediction result after calculation;
[0031] Steps 3-5 iterate through steps 3-3 and 3-4 until the clinical diagnostic text to be encoded is matched and predicted with all manually ICD-10 encoded clinical diagnostic texts in the priority matching set.
[0032] In one embodiment, the method for automatically encoding clinical diagnostic text using ICD-10 further includes preparing a set of alternative matches containing ICD-10 disease classification name text;
[0033] When the matching prediction probability determines that there is no matching clinical diagnosis text that has been manually ICD-10 encoded in the priority matching set, the cleaned clinical diagnosis text to be encoded is matched with the ICD-10 disease classification name text in the candidate matching set. That is, each clinical diagnosis text that has been manually ICD-10 encoded is replaced with the ICD-10 disease classification name text. Step 3 is executed until the diagnosis text to be encoded is matched with all ICD-10 disease classification name texts in the candidate matching set. If the matching prediction probability determines that there is a matching ICD-10 disease classification name text, the ICD-10 code corresponding to the matching ICD-10 disease classification name text is used as the ICD-10 code of the clinical diagnosis text to be encoded.
[0034] The beneficial effects of the technical solutions provided in the above embodiments include at least the following:
[0035] First, by processing the hospital coding history from previous years into a priority matching set, the coding results of uncoded clinical diagnostic texts are inferred by analogy with the manual coding history from previous years. Then, an automatic ICD-10 coding algorithm based on interactive text matching is implemented. During the matching process, each character in the two texts is interactively matched. A matching strategy using element-wise subtraction and element-wise multiplication through a neural network is employed, preserving more original information and achieving better matching results. In the aggregation process, a convolutional neural network is used, employing convolutional kernels with equal weights to merge features and retain even more original information, further improving the matching effect. Finally, data that cannot be matched with the coding history from previous years can still be matched a second time with the ICD-10 disease classification names to ensure comprehensive matching. Attached Figure Description
[0036] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0037] Figure 1 This is a flowchart illustrating the method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, as provided in the embodiment.
[0038] Figure 2 This is a schematic diagram of the text matching model provided in the embodiment;
[0039] Figure 3 This is a flowchart of the matching module provided in the embodiment;
[0040] Figure 4 This is a flowchart of a method for automatically encoding clinical diagnostic text using ICD-10, as provided in the embodiments. Detailed Implementation
[0041] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of this invention.
[0042] To address the low robustness of existing automatic coding models caused by the significant discrepancy between clinical diagnostic texts and ICD-10 names, this embodiment proposes to perform text matching between uncoded clinical diagnostic texts and historically encoded texts from previous years. This allows for the inference of the encoding result of the uncoded clinical diagnostic texts based on the experience of manual coding. Analogical reasoning, a key concept in automatic ICD-10 coding of clinical diagnostic texts, refers to inferring that two clinical diagnostic texts have the same ICD-10 code based on their matching or similarity. Analogical reasoning can significantly leverage the advantages of text matching algorithms. Traditional ICD-10 coding algorithms struggle to achieve accurate ICD-10 coding due to the large discrepancy between clinical diagnostic texts and standard ICD-10 names. Using analogical reasoning effectively addresses this issue, enabling high-precision automatic ICD-10 coding through an interactive text matching algorithm.
[0043] To address the significant discrepancy between clinical diagnostic text and ICD-10 nomenclature, this embodiment provides a method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text. For example... Figure 1 As shown in the embodiment, the method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text includes the following steps:
[0044] Step 1: Prepare a matching set containing the cleaned clinical diagnosis text to be encoded and the corresponding ICD-10 disease classification name text, and an equal set of non-matching sets containing the cleaned clinical diagnosis text and the unrelated ICD-10 disease classification name text, to form a training set.
[0045] In the embodiments, to ensure that there are no quality issues with the clinical diagnosis texts in the matching set and the non-matching set, it is necessary to clean the clinical diagnosis texts. Specific cleaning includes unifying text delimiters, removing text numbers (in various formats), removing meaningless words, standardizing common expressions, screening for typos, etc. Text cleaning has a great impact on subsequent text matching. A complete and comprehensive clinical diagnosis text helps to improve the performance of text matching. Among them, unifying text delimiters means using rules to process the diverse text delimiters in the text, including: ";", ",", ".", "?", " / ", "\", "", "|"; removing text numbers means removing the diverse encodings in the text, including "1", "one", "①", "Ⅰ"; removing meaningless words includes English letters, special symbols, and dates; standardizing common expressions means automatically replacing some non-standard words with a total of 21,264 pairs of disease diagnosis synonyms extracted from common medical terms, various versions of ICD-10, MedDRA, redirect data from Wikipedia and Baidu Encyclopedia and manually reviewed; screening for typos uses FASPell to solve Chinese spelling errors, and a deep denoising encoder based on BERT and a decoder based on confidence-phonetic and glyph similarity are trained. There is a matching relationship between the clinical diagnosis texts in the matching set and the corresponding ICD-10 disease classification name texts, and there is a non-matching relationship between the clinical diagnosis texts in the non-matching set and the irrelevant ICD-10 disease classification name texts. This matching relationship and non-matching relationship are used as supervision labels for training.
[0046] Step 2, construct a text matching model for the to-be-encoded clinical diagnosis texts and the ICD-10 disease classification name texts.
[0047] As Figure 2 shown, the ICD-10 encoding model includes a pre-trained BERT model, a BiLSTM model, a matching module, an aggregation module, and a binary classification model.
[0048] In the embodiments, the pre-trained BERT model refers to loading a pre-trained Bert model. Specifically, other encyclopedias such as Chinese Wikipedia, news, Q&A, and other data are used to pre-train the BERT model. During the optimization process of text matching parameters, the parameters of the BERT model are slightly adjusted. In terms of extracting word vectors, the pre-trained BERT model has improvements compared to traditional models such as FastText, Glove, Word2Vec, Word2gm, and prob-fasttext.
[0049] This pre-trained BERT model is used to extract vector representations of clinical diagnostic texts and ICD-10 disease classification name texts. Specifically, the clinical diagnostic text and all ICD-10 disease classification name texts are input into the pre-trained BERT model, which then calculates the vector representations of the clinical diagnostic text and all ICD-10 disease classification name texts. It should be noted that the vector representations include word vectors for individual Chinese characters, but do not involve text segmentation or other word segmentation methods.
[0050] In this embodiment, the BiLSTM model is used to extract the time step representation of the clinical diagnosis text and the time step representation of the ICD-10 disease classification name text based on the vector representation of the clinical diagnosis text and the vector representation of the ICD-10 disease classification name text, respectively. Each time step representation is the word representation of each Chinese character. This word representation combines the bidirectional temporal information in the text, which is an improvement over traditional RNN, GRU and LSTM methods.
[0051] In this embodiment, the matching module is used to perform multi-angle weighted matching of all time step representations of the clinical diagnosis text with all time step representations of the ICD-10 disease classification name text, resulting in a set of matching vectors. For example... Figure 3 As shown, specifically, the matching module includes attention weight calculation operation, weighted summation operation, element-wise subtraction operation, element-wise multiplication operation, and fusion operation, wherein:
[0052] Attention weighting is used to determine the importance of different characters in the text for the probability of matching prediction, and to calculate the single-time-step representation of clinical diagnostic text. Representing all time steps of the ICD-10 disease classification name text respectively cosine similarity a i,j As attention weights, they can be expressed by the following formula:
[0053]
[0054] Where i represents the index of the time step representation of the clinical diagnosis text, j represents the index of the time step representation of the ICD-10 disease classification name text, and N represents the total number of time steps of the ICD-10 disease classification name text.
[0055] The weighted summation operation is used to calculate the time-step representation of the ICD-10 disease classification name text. With the corresponding attention weight a i,j The weighted sum is calculated to obtain the weighted sum representation. Expressed as a formula:
[0056]
[0057] The element-wise subtraction operation is used to subtract the individual time-step representation of a clinical diagnostic text from its weighted sum representation element-wise to obtain the subtraction result. Expressed as a formula:
[0058]
[0059] The element-wise multiplication operation is used to multiply the individual time-step representation of a clinical diagnostic text element-wise with the weighted summation representation to obtain the multiplication result. Expressed as a formula:
[0060]
[0061] The fusion operation uses a neural network to process the subtraction result. The result of multiplication Compression and fusion are performed to obtain a matching vector by matching the single time step representation of the clinical diagnosis text with all time step representations of the ICD-10 disease classification name text. The fusion operation can employ a neural network containing one input layer, two hidden layers, and one output layer. All time step representations of the clinical diagnosis text are processed through attention weight calculation, weighted summation, element-wise subtraction, element-wise multiplication, and fusion operations to obtain a set of matching vectors.
[0062] By first performing element-wise subtraction and element-wise multiplication on the two text representations, and then fusing the two results through a neural network, the original information about the different dimensions of the two text representations is preserved compared to traditional methods.
[0063] In this embodiment, the aggregation module is used to aggregate a set of matching vectors to obtain an aggregation result. The matching module performs local matching of the text. To obtain the final matching effect, the local results need to be aggregated. Specifically, the aggregation module uses a convolutional neural network to perform aggregation calculations on all matching vectors using convolutional kernels with the same weights, obtaining an aggregation result that retains as much of the original information as possible. Specifically, the aggregation module can use a convolutional neural network with a 3x3 kernel and average pooling.
[0064] In this embodiment, a binary classification model is used to perform binary classification based on the aggregation result to obtain the matching prediction probability of whether two texts match. Specifically, the binary classification model includes a fully connected layer and a sigmoid function layer, which are used to output the matching prediction probability of whether two texts match based on the aggregation result.
[0065] Step 3: Optimize the parameters of the text matching model in a supervised manner using the prepared training set.
[0066] In this embodiment, the cross-entropy between the output matching prediction probability and the matching status is used as the loss function. The parameters of the text matching model are optimized using the training set to obtain a text matching model that can determine whether clinical diagnosis text matches ICD-10 disease classification name text.
[0067] The resulting text matching model performs interactive matching on each character in the two texts during the matching process. It adopts a matching strategy of element-wise subtraction and element-wise multiplication through a neural network, which preserves more original information and achieves better matching results. In the aggregation process, a convolutional neural network is used to merge features using convolutional kernels with the same weights to retain more original information and further improve the matching effect.
[0068] Based on the same inventive concept, the embodiment also provides an apparatus for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the above-described method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, including the following steps:
[0069] Step 1: Prepare a matching set containing the cleaned clinical diagnosis text and the corresponding ICD-10 disease classification name text, and an equal set of non-matching sets containing the cleaned clinical diagnosis text and the unrelated ICD-10 disease classification name text, to form a training set.
[0070] Step 2: Construct a text matching model for clinical diagnostic text and ICD-10 disease classification name text.
[0071] Step 3: Optimize the parameters of the text matching model in a supervised manner using the prepared training set.
[0072] It should be noted that in practical applications, the memory can be volatile memory located at the local end, such as RAM, or non-volatile memory, such as ROM, FLASH, floppy disks, hard disks, etc., or even a remote storage cloud. The processor can be a central processing unit (CPU), microprocessor (MPU), digital signal processor (DSP), or field-programmable gate array (FPGA). That is, these processors can be used to implement the steps of constructing the text matching model for automatic ICD-10 encoding of clinical diagnostic texts.
[0073] To address the significant discrepancy between clinical diagnostic texts and ICD-10 nomenclature, this embodiment also provides a method for automatically encoding clinical diagnostic texts using ICD-10. This method utilizes an ICD-10 encoding model constructed using the aforementioned method, such as... Figure 4 As shown, it includes the following steps:
[0074] Step 1: Prepare a set of clinical diagnostic texts that have been manually coded in ICD-10 as the priority matching set and a set of candidate matching sets containing ICD-10 disease classification name texts.
[0075] Step 2: Obtain the clinical diagnosis text to be encoded and clean the clinical diagnosis text.
[0076] Step 3: Using the already manually ICD-10 encoded clinical diagnostic text, a text matching model is used to perform ICD-10 encoding matching predictions for the clinical diagnostic text to be encoded, and the matching prediction probability is output. Specifically, this includes:
[0077] Step 3-1: Input the cleaned clinical diagnosis text and each manually ICD-10 encoded clinical diagnosis text in the priority matching set into the BERT model to extract the vector representation of the clinical diagnosis text to be encoded and the vector representation of the manually ICD-10 encoded clinical diagnosis text.
[0078] Step 3-2: Input the vector representation of the clinical diagnostic text to be encoded and the vector representation of the clinical diagnostic text that has been manually encoded by ICD-10 into the BiLSTM model to extract the temporal information in the text, so as to obtain the representation of each time step of the clinical diagnostic text to be encoded and the representation of each time step of the clinical diagnostic text that has been manually encoded by ICD-10.
[0079] Step 3-3: The matching module inputs all time step representations of the clinical diagnostic text to be encoded one by one, and performs weighted matching with all time step representations of a single clinical diagnostic text that has already been manually ICD-10 encoded to obtain a set of matching vectors;
[0080] Steps 3-4: Input a set of matching vectors into the aggregation module for aggregation, then input the aggregation result into the binary classification model, and output the matching prediction result after calculation;
[0081] Steps 3-5 iterate through steps 3-3 and 3-4 until the clinical diagnostic text to be encoded is matched and predicted with all manually ICD-10 encoded clinical diagnostic texts in the priority matching set.
[0082] Step 4: If, based on the matching prediction probability, there is a clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match, then the ICD-10 code corresponding to the clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match is used as the ICD-10 code of the clinical diagnostic text to be encoded.
[0083] Step 5: When it is determined based on the matching prediction probability that there is no matching clinical diagnosis text that has been manually ICD-10 encoded in the priority matching set, the cleaned clinical diagnosis text is matched with the ICD-10 disease classification name text in the candidate matching set. That is, each clinical diagnosis text that has been manually ICD-10 encoded is replaced with the ICD-10 disease classification name text. Step 3 is executed until the diagnosis text to be encoded is matched with all ICD-10 disease classification name texts in the candidate matching set. If it is determined based on the matching prediction probability that there is a matching ICD-10 disease classification name text, the ICD-10 code corresponding to the matching ICD-10 disease classification name text is used as the ICD-10 code of the clinical diagnosis text to be encoded.
[0084] In step 5, ICD-10 disease classification names are used as candidate matching sets. However, the coding history of previous years has limitations, and there may be texts to be coded that cannot be compared and reasoned using the coding history of previous years. Therefore, using the standard ICD-10 disease names for supplementary matching can effectively increase the comprehensiveness of the automatic ICD-10 coding model and improve the coding effect.
[0085] In this embodiment, the text matching based on the text matching model abandons the post-matching approach of traditional models. The global matching degree depends on the local matching degree; word matching is performed first in the matching module, and the matching results are used as grayscale images for subsequent modeling. This effectively grasps the semantic focus and improves upon the semantic and structural limitations of traditional matching algorithms. It also reasonably models the importance of context, demonstrating excellent matching performance. Therefore, the text matching method combined with the interactive text matching model provides a better solution for matching the text to be encoded with historical encoding data from previous years in automatic ICD-10 encoding.
[0086] Experimental Example
[0087] A specific experiment was conducted using the text matching model construction method described above for automatic ICD-10 encoding of clinical diagnostic texts to verify the effectiveness of the method. In the experimental example, the experimental data came from the manual coding results of the medical records department coders of a hospital in Hainan from February 1, 2016 to August 30, 2021.
[0088] Four classic text matching algorithms were selected for the experiments: Deep Structured Semantic Model (DSSM), Convolutional Neural Network (ConvNet), Enhanced LSTM (ESIM), and Attention-Based Convolutional Neural Network (ABCNN). 162,085 manually encoded clinical diagnostic texts were used as matching pairs, with 143,680 used for the training set, 9,214 for the validation set, and 9,191 for the test set. An equal number of non-matching negative samples were randomly generated. The experimental results are shown in Table 1. The text matching model achieved an accuracy of 0.988, a precision of 0.985, a recall of 0.982, and an F1 score of 0.983, surpassing all classic text matching algorithms. The interactive text matching model fully compares the two texts and utilizes the rich correlation between them, thus achieving a significant improvement over traditional text matching algorithms.
[0089] Table 1 Experimental Results
[0090] algorithm accuracy Accuracy Recall rate F1 value DSSM 0.723 0.722 0.844 0.778 ConvNet 0.952 0.958 0.959 0.958 eSIM 0.971 0.965 0.986 0.976 ABCNN 0.975 0.979 0.977 0.978 Text matching model 0.988 0.985 0.982 0.983
[0091] The specific embodiments described above illustrate the technical solution and beneficial effects of the present invention in detail. It should be understood that the above description is only the most preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, characterized in that, Includes the following steps: Step 1: Prepare a matching set containing cleaned clinical diagnosis texts and corresponding ICD-10 disease classification name texts, and an equal set of non-matching sets containing cleaned clinical diagnosis texts and unrelated ICD-10 disease classification name texts, to form a training set. Step 2: Construct a text matching model for clinical diagnosis text and ICD-10 disease classification name text, including a pre-trained BERT model, a BiLSTM model, a matching module, an aggregation module, and a binary classification model. The pre-trained BERT model is used to extract vector representations of the clinical diagnosis text and the ICD-10 disease classification name text. The BiLSTM model is used to extract the time step representations of the clinical diagnosis text and the ICD-10 disease classification name text based on the vector representations. The matching module is used to perform multi-angle weighted matching of all time step representations of the clinical diagnosis text and all time step representations of the ICD-10 disease classification name text to obtain a set of matching vectors. The aggregation module is used to aggregate the set of matching vectors to obtain the aggregation result. The binary classification model is used to perform binary classification based on the aggregation result to obtain the matching prediction probability of whether the two texts match. The matching module includes attention weight calculation, weighted summation, element-wise subtraction, element-wise multiplication, and fusion operations, wherein: The attention weight calculation operation is used to determine the importance of different characters in the text to the matching prediction probability. The cosine similarity between a single time step representation of the clinical diagnosis text and all time step representations of the ICD-10 disease classification name text is calculated as the attention weight. The weighted summation operation is used to calculate the weighted summation of all time step representations of the ICD-10 disease classification name text with the corresponding attention weights to obtain the weighted summation representation; The element-wise subtraction operation is used to subtract the single time step representation of a clinical diagnosis text from its weighted sum representation element-wise to obtain the subtraction result. The element-wise multiplication operation is used to multiply the single time step representation of the clinical diagnosis text with the weighted summation representation element-wise to obtain the multiplication result; The fusion operation uses a neural network to compress and fuse the subtraction and multiplication results, and obtains a matching vector as the matching result of a single time step representation of the clinical diagnosis text and all time step representations of the ICD-10 disease classification name text. All time-step representations of the clinical diagnosis text are processed through attention weight calculation, weighted summation, element-wise subtraction, element-wise multiplication, and fusion operations to obtain a set of matching vectors from all matching vectors. Step 3: Use the output matching prediction probability and the cross-entropy of whether they match as the loss function, and use the training set to optimize the parameters of the text matching model to obtain a text matching model that can determine whether clinical diagnosis text matches ICD-10 disease classification name text.
2. The method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text according to claim 1, characterized in that, The aggregation module employs a convolutional neural network. For all matching vectors, the convolutional neural network uses convolutional kernels with the same weights to perform aggregation calculations and obtain the aggregation result.
3. The method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text according to claim 1, characterized in that, The binary classification model includes a fully connected layer and a sigmoid function layer, which are used to output the matching prediction probability of two texts.
4. A device for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text, characterized in that, The invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements the method for constructing a text matching model for automatic ICD-10 encoding of clinical diagnostic text as described in any one of claims 1 to 3.
5. A method for automatically encoding clinical diagnostic text using ICD-10, characterized in that, The method utilizes a text matching model constructed using the construction method described in any one of claims 1-3, and includes the following steps: Step 1: Prepare a priority matching set containing clinical diagnostic texts that have already been manually ICD-10 encoded; Step 2: Obtain the clinical diagnosis text to be encoded and clean the clinical diagnosis text; Step 3: Using the already manually ICD-10 encoded clinical diagnostic text, perform ICD-10 encoding matching prediction for the clinical diagnostic text to be encoded through a text matching model, and output the matching prediction probability; Step 4: If, based on the matching prediction probability, there is a clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match, then the ICD-10 code corresponding to the clinical diagnostic text that has been manually encoded with ICD-10 and is a priority match is used as the ICD-10 code of the clinical diagnostic text to be encoded.
6. The method for automatically encoding clinical diagnostic text using ICD-10 according to claim 5, characterized in that, In step 3, the clinical diagnostic text that has already been manually ICD-10 encoded is used to perform ICD-10 encoding matching prediction for the clinical diagnostic text to be encoded using a text matching model, including: Step 3-1: Input the cleaned clinical diagnostic text to be encoded and each manually ICD-10 encoded clinical diagnostic text in the priority matching set into the BERT model to extract the vector representation of the clinical diagnostic text to be encoded and the vector representation of the manually ICD-10 encoded clinical diagnostic text. Step 3-2: Input the vector representation of the clinical diagnostic text to be encoded and the vector representation of the clinical diagnostic text that has been manually encoded by ICD-10 into the BiLSTM model to extract the temporal information in the text, so as to obtain the representation of each time step of the clinical diagnostic text to be encoded and the representation of each time step of the clinical diagnostic text that has been manually encoded by ICD-10. Step 3-3: The matching module inputs all time step representations of the clinical diagnostic text to be encoded one by one, and performs weighted matching with all time step representations of a single clinical diagnostic text that has already been manually ICD-10 encoded to obtain a set of matching vectors; Steps 3-4: Input a set of matching vectors into the aggregation module for aggregation, then input the aggregation result into the binary classification model, and output the matching prediction result after calculation; Steps 3-5 iterate through steps 3-3 and 3-4 until the clinical diagnostic text to be encoded is matched and predicted with all manually ICD-10 encoded clinical diagnostic texts in the priority matching set.
7. The method for automatic ICD-10 encoding of clinical diagnostic text according to claim 5, characterized in that, It also includes preparing a set of alternative matches containing text of ICD-10 disease classification names; When it is determined that there is no clinically diagnosed text with a manually assigned ICD-10 code in the preferred matching set based on the matching prediction probability, the clinically diagnosed text to be coded after cleaning is matched with the ICD-10 disease classification name text in the alternative matching set. That is, each clinically diagnosed text with a manually assigned ICD-10 code is replaced with the ICD-10 disease classification name text, and step 3 is executed until the clinically diagnosed text to be coded is completely matched with all the ICD-10 disease classification name texts in the alternative matching set. If it is determined that there is a matching ICD-10 disease classification name text based on the matching prediction probability, the ICD-10 code corresponding to the matching ICD-10 disease classification name text is used as the ICD-10 code for the clinically diagnosed text to be coded.
8. The method for automatic ICD-10 encoding of clinical diagnostic text according to claim 5, characterized in that, Cleaning the clinically diagnosed text includes: unifying text delimiters, removing text numbers, removing meaningless characters, standardizing common expressions, and screening for typos. Among them, unifying text delimiters means using rules to process the diverse text delimiters in the text, including: ";", ",", ".", "?", "、", "\", " ", "|"; removing text numbers means removing the diverse codes in the text, including "1", "一", "①", "Ⅰ"; removing meaningless characters includes English letters, special symbols, and dates; standardizing common expressions means automatically replacing some non-standard characters with the disease diagnosis synonyms extracted from commonly used medical terms, various versions of ICD-10, MedDRA, redirect data from Wikipedia and Baidu Encyclopedia, and manually reviewed; screening for typos uses FASPell to solve Chinese spelling errors, and a deep denoising encoder based on BERT and a decoder based on confidence-phonetic and glyph similarity are trained.