Method and device for de-identifying clinical information text

A machine learning-based model addresses the challenges of anonymizing Korean clinical texts by training on augmented datasets and applying preprocessing techniques, ensuring effective de-identification of personal information for enhanced text usability.

WO2026127372A1PCT designated stage Publication Date: 2026-06-18THE ASAN FOUND +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
THE ASAN FOUND
Filing Date
2025-10-31
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing named entity recognition models for Korean clinical information texts face challenges due to the mixture of Hangul, English, and special characters, and the scarcity of training datasets, limiting their effectiveness and requiring manual efforts for anonymization.

Method used

A machine learning-based de-identification tag generation model is trained using an augmented training dataset and annotation data to identify and replace or mask personal information in clinical information texts, utilizing preprocessing techniques like converting to lowercase, removing special characters, and tokenizing for accurate de-identification.

Benefits of technology

The model effectively anonymizes personal information in Korean clinical texts, enhancing their usability for research and analysis by accurately identifying and transforming tokens into non-personal information expressions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025017752_18062026_PF_FP_ABST
    Figure KR2025017752_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An electronic device for de-identifying personal information in clinical information text according to an embodiment comprises: a memory; and a processor for training a machine learning-based de-identification tag generation model on the basis of annotation data indicating a de-identification category for each token of training text and a training dataset including the training text, providing clinical information text for a patient input by medical personnel to the trained de-identification tag generation model, so as to output a de-identification category for each token of the clinical information text, and changing a token corresponding to personal information in the clinical information text into an expression indicating non-personal information on the basis of the output de-identification category.
Need to check novelty before this filing date? Find Prior Art

Description

Method and device for de-identifying clinical information text

[0001] The following disclosure relates to a method and apparatus for training a machine learning-based de-identification tag generation model from training text and annotation data, and for de-identifying personal information within clinical information text based on the trained de-identification tag generation model.

[0002] Clinical information texts, such as Electronic Medical Records (EMRs) and discharge records, contain detailed information including a patient's medical history and current health status. Utilizing these texts for research and analysis can significantly aid in understanding disease patterns, developing new treatments, and improving healthcare policies. However, clinical information texts are not being effectively utilized because they contain patients' personal information (e.g., name, date of birth, medical records, etc.); therefore, measures to anonymize personal information within these texts are necessary to facilitate their use.

[0003] Named Entity Recognition (NER) is a technology in natural language processing that identifies and classifies specific entities (e.g., names of people, organizations, regions, etc.) within text and can be effectively used for the anonymization of personal information. In particular, methods for recognizing named entities using machine learning-based models (e.g., models including neural networks) are receiving attention, and technologies for recognizing named entities using machine learning-based models specialized for English and Chinese have already been developed.

[0004] However, in the case of Korean clinical information texts, the mixture of Hangul, English, and special characters limits the direct application of existing named entity recognition models specialized for English or Chinese. Furthermore, due to the scarcity of training datasets for Korean, experts and engineers frequently have to perform named entity recognition tasks manually.

[0005] The background technology described above is possessed or acquired by the inventor in the process of deriving the content of the disclosure of the present application, and cannot necessarily be considered as prior art disclosed to the general public prior to the filing of this application.

[0006] A method for de-identifying personal information within clinical information text, performed by a processor according to one embodiment, comprises: training a machine learning-based de-identification tag generation model based on an annotation data indicating a de-identification category for each token of a training text and a training dataset including said training text; providing a clinical information text about a patient entered by a medical professional to said trained de-identification tag generation model to output a de-identification category for each token of said clinical information text; and changing a token corresponding to personal information in said clinical information text into a representation indicating non-personal information based on said output de-identification category.

[0007] The step of changing a token corresponding to the personal information to an expression representing non-personal information includes: a step of determining a token among the tokens of the clinical information text from which a de-identification category indicating personal information is output as the token corresponding to the personal information; and a step of performing at least one of the operations of replacing at least a portion of the token corresponding to the personal information with an expression representing the de-identification category output for the said token, masking, and deleting.

[0008] The above-mentioned de-identification category includes at least one of a personal name category, an organization category including the name of an institution, a region category including the name of an administrative division, a time category, and a date category.

[0009] The step of training the above-mentioned de-identification tag generation model includes the step of augmenting the training dataset with augmented training text obtained by replacing a specified word within the training text with a synonym of the specified word.

[0010] The step of outputting the de-identified category includes the step of preprocessing the clinical information text, and the step of preprocessing the clinical information text includes at least one of converting English letters of the clinical information text into lowercase letters, removing at least one special character from the remaining characters other than English letters, Korean characters, and numbers in the clinical information text, unifying categorizable words into the word with the highest term frequency among the categorizable words, and tokenizing the clinical information text into word units.

[0011] The above at least one special character includes a special character different from at least one of the following: a period (dot), a double quotation mark, a plus sign, a minus sign, an inequality sign, a tilde, and a percent sign.

[0012] An electronic device for de-identifying personal information within clinical information text according to one embodiment includes: a memory; and a processor that trains a machine learning-based de-identification tag generation model based on an annotation data indicating a de-identification category for each token of a training text and a training dataset including said training text, provides a clinical information text about a patient entered by a medical professional to said trained de-identification tag generation model to output a de-identification category for each token of said clinical information text, and, based on said output de-identification category, changes a token corresponding to personal information in said clinical information text into an expression representing non-personal information.

[0013] The processor determines a token among the tokens of the clinical information text that outputs a de-identification category indicating personal information as a token corresponding to the personal information, and performs at least one of the operations of replacing at least a portion of the token corresponding to the personal information with an expression representing the de-identification category output for the token, masking, and deleting.

[0014] The above-mentioned de-identification category includes at least one of a personal name category, an organization category including the name of an institution, a region category including the name of an administrative division, a time category, and a date category.

[0015] The processor augments the training dataset with augmented training text obtained by replacing a specified word within the training text with a synonym of the specified word.

[0016] The above processor preprocesses the clinical information text by performing at least one of converting English letters of the clinical information text into lowercase, removing at least one special character from the remaining characters other than English letters, Korean characters, and numbers in the clinical information text, unifying categorizable words into the word with the highest term frequency among the categorizable words, and tokenizing the clinical information text into word units.

[0017] The above at least one special character includes a special character different from at least one of the following: a period (dot), a double quotation mark, a plus sign, a minus sign, an inequality sign, a tilde, and a percent sign.

[0018] FIG. 1 is a diagram schematically illustrating a method for de-identifying clinical information text according to one embodiment.

[0019] FIG. 2 is a diagram showing clinical information text and de-identification tags corresponding to said clinical information text according to one embodiment.

[0020] FIG. 3 is a flowchart illustrating a method for de-identifying clinical information text according to one embodiment.

[0021] FIG. 4 is a diagram showing the training and evaluation operations of a de-identification tag generation model according to one embodiment.

[0022] FIG. 5 is a diagram showing a comparison between de-identified tags generated based on a tag generation model according to one embodiment and tags generated based on a large-scale language model.

[0023] FIG. 6 is a drawing showing a clinical information text de-identification device according to one embodiment.

[0024] Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be modified and implemented in various forms. Accordingly, actual implementations are not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or substitutions included in the technical concept described by the embodiments.

[0025] Terms such as "first" or "second" may be used to describe various components, but these terms should be interpreted solely for the purpose of distinguishing one component from another. For example, the first component may be named the second component, and similarly, the second component may be named the first component.

[0026] When it is stated that a component is "connected" to another component, it should be understood that it may be directly connected to or coupled with that other component, or that there may be other components in between.

[0027] The singular expression includes the plural expression unless the context clearly indicates otherwise. In this specification, terms such as "comprising" or "having" are intended to specify the existence of the described features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0028] Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this specification.

[0029] Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the attached drawings, identical components are given the same reference numeral regardless of the drawing number, and redundant descriptions thereof will be omitted.

[0030] FIG. 1 is a diagram schematically illustrating a method for de-identifying clinical information text according to one embodiment.

[0031] An electronic device according to one embodiment can train a de-identification tag generation model (111) and obtain de-identified clinical information text (131) from clinical information text (121) by performing a clinical information text de-identification method. The clinical information text de-identification method includes steps (110, 120, and 130). The electronic device may be, for example, a clinical information text de-identification device or a processor included in a clinical information text de-identification device. A clinical information text de-identification device including a processor is described in detail in FIG. 6 below.

[0032] In step (110), the electronic device may obtain a trained de-identification tag generation model (113) by training (e.g., fine-tuning) a de-identification tag generation model (111) based on a training dataset (112). The de-identification tag generation model (111) may include a machine learning-based neural network, which is a pre-trained model (e.g., Llama 3, BERT-based model) to generate de-identification tags corresponding to each token of the text from the text. The training dataset (112) (e.g., Naver x Changwon University NER dataset) may include training text and annotation data. The annotation data may be data indicating the de-identification category for each token of the training text. For example, the annotation data may be obtained by annotating de-identification tags to each token of the training text. For example, based on the medical institution name extracted from Asan Medical Center's internal data (e.g., Asan Medical Information System, AMIS), an organization category may be annotated to the token corresponding to the medical institution name in the annotation data. De-identification categories may include categories that indicate personal information within the text (e.g., Person Name Category (PER), Organization Category (ORG) (e.g., name of the institution), Location Category (LOC) (e.g., name of the administrative district), Time Category (TIM), and Date Category (DAT). Personal information may include information (e.g., person name, organization, location, time, and date) intended to identify a specific individual (e.g., patient). De-identification categories may be represented as shown in Table 1 below.

[0033] Category Tag Definition 1. Person (PERSON) PER: Refers to names of real or fictional people 2. Organization (ORGANIZATION) ORG: Includes institutions, organizations, conferences, and meetings 3. Location (LOCATION) LOC: Regional names and administrative district names, etc. 4. Time (TIME) TIM: Time 5. Date (DATE) DAT: Date

[0034] De-identification tags may include tags corresponding to each token within the text (e.g., training text, clinical information text). For example, de-identification tags may include de-identification categories. Tags may include labels for classifying or identifying data (e.g., tokens). De-identification tags are described in detail in Fig. 2 below.

[0035] In step (120), the electronic device may perform annotation on the clinical information text (121). For example, the electronic device may output a de-identification tag (e.g., de-identification category) corresponding to each token of the clinical information text (121) by providing the clinical information text (121) to a trained de-identification tag generation model (113). The clinical information text (121) may be text (e.g., discharge summary) indicating clinical information about a patient (e.g., information about the patient's diagnosis or treatment) entered by a medical professional (e.g., doctor). In this specification, the action of assigning tags corresponding to each token of the text may be referred to as annotation or annotating.

[0036] In step (130), the electronic device may perform de-identification of the clinical information text (121). For example, the electronic device may obtain a de-identified clinical information text (131) by de-identifying the clinical information text (121) based on clinical information tags. For example, as shown in FIG. 1, the date (e.g., 20240214), name (e.g., Park OO, Kim OO), organization (e.g., Asan Medical Center), and region (e.g., Seoul, Ulsan) of the clinical information text (121) may be replaced with [DATE], [PER], [ORG], and [LOC], respectively, in the de-identified clinical information text (131).

[0037] FIG. 2 is a diagram showing clinical information text and de-identification tags corresponding to said clinical information text according to one embodiment.

[0038] An electronic device according to one embodiment can obtain de-identification tags (220) from a clinical information text (210) by performing a clinical information de-identification method. The clinical information text (210) may include information necessary to identify a patient (e.g., personal information), such as a date (211) and a person's name (212, 213). For the use of the clinical information text (210), it may be necessary to de-identify the personal information within the clinical information text (210). Accordingly, the electronic device can perform a clinical information de-identification method on the clinical information text (210). In the de-identification tags (220) obtained through the performance of the clinical information de-identification method, the date (211) corresponding to the personal information within the clinical information text (210) is changed (e.g., replaced), and the person's name (212, 213) can be changed to DAT-B (221) and PER-B (222, 223), respectively. In the tags (220) of FIG. 2 (e.g., DAT-B, PER-B, and O), DAT may represent a de-identifying category indicating a date (e.g., date category), and PER may represent a de-identifying category indicating a person's name (e.g., person name category). Additionally, B may be a BIO representation indicating that the token corresponding to the tag is the first token located at the start position within the entity name to which the token belongs. An entity name may be the name of an entity having a specific meaning, and

[0039] Additionally, O may be a BIO expression indicating that the token corresponding to the tag does not belong to a de-identification category (i.e., does not belong to personal information). The BIO expression includes B (Begin), I (Inside), and O (Outside), and I may be a BIO expression indicating that the tag is not located at the beginning position within the entity name to which it belongs. As shown in FIG. 2, annotation data or de-identification tags may include not only de-identification categories but also BIO expressions. For example, if a token corresponding to a personal name category is located at the beginning position of an entity name, the de-identification tag (or annotation data) corresponding to that token may be PER-B. Annotation data or de-identification tags may be represented as shown in Table 2 below.

[0040] Number of tagsPER-B, PER-I, ORG-B, ORG-I, LOC-B, LOC-I, TIM-B, TIM-I, DAT-B, DAT-B, O11

[0041] As shown in Table 2 above, tags included in annotation data and de-identification tags may be represented as O if they do not correspond to personal information. Additionally, if tags included in annotation data and de-identification tags correspond to personal information, they may be represented as a combination of a de-identification category and a BIO representation.

[0042] FIG. 3 is a flowchart illustrating a method for de-identifying clinical information text according to one embodiment.

[0043] In step (310), the electronic device can train a de-identification tag generation model based on a training dataset. For example, the electronic device can obtain temporary output data (e.g., de-identification tags output for the provided training text) by providing training text to the de-identification tag generation model. The electronic device can train the de-identification tag generation model based on the comparison results between the obtained temporary output data and the annotation data. For example, the electronic device can train the de-identification tag generation model by updating the parameters of the de-identification tag generation model (e.g., connection weights between nodes included in the de-identification tag generation model) to reduce the objective function value between the temporary output data and the annotation data. The training of the aforementioned de-identification tag generation model can be performed, for example, through a back-propagation method.

[0044] The electronic device may use an augmented training dataset, generated by augmenting the training dataset, for training a de-identification tag generation model. For example, the electronic device may obtain augmented training text by replacing a specified word within the training text with a synonym of that specified word. A synonym may include a word having the same or similar meaning as the specified word. The electronic device may, for example, specify any word within the training text and replace that word with a synonym of that word. As another example, the electronic device may obtain augmented training text by inserting any word, swapping the positions of any two words, or deleting any word within the training text. The electronic device may obtain augmented annotation data by annotating the obtained augmented training text. That is, the electronic device may obtain an augmented training dataset containing the augmented training text and augmented annotation data. The electronic device may use the obtained augmented training dataset for training a de-identification tag generation model. In this way, the electronic device may augment the training dataset with the augmented training text (and augmented annotation data).

[0045] In step (320), the electronic device can output a de-identification category for each token of the clinical information text by providing the clinical information text to a trained de-identification tag generation model. For example, the electronic device can obtain tag data (e.g., tags containing a de-identification category for each token of the clinical information text) by providing the clinical information text to a trained de-identification tag generation model.

[0046] The electronic device can preprocess the clinical information text before providing it to a trained de-identification tag generation model. For example, the electronic device can remove at least one special character from the remaining characters in the clinical information text, excluding English letters, Korean characters, and numbers. Here, at least one special character to be removed may include a special character different from at least one of the following: a period (dot) (.), a double quotation mark ("), a plus sign (+), a minus sign (-), an inequality sign (< or >), a tilde (~), and a percent sign (%). Since a period is used to indicate a unit (e.g., 2.48 kg), plus and minus signs are necessary when additional information follows a number, an inequality sign may be used for next steps or size comparisons, a tilde may be used to indicate a range (e.g., 6 to 7 months old), and a percent sign may be used to indicate a ratio (e.g., 8 to 90%), leaving these special characters may be advantageous for maintaining the meaning of the clinical information text. The electronic device may, for example, remove characters from the clinical information text excluding English letters, Korean characters, numbers, and some of the aforementioned special characters (at least one of a period, a double quotation mark, a plus sign, a minus sign, an inequality sign, a tilde, and a percent sign).

[0047] The electronic device may fill missing values ​​within clinical information text with an expression indicating the absence of data (e.g., "missing"). Missing values ​​can indicate a state where the data value is empty. For example, if the value of a specific item in the clinical information text is missing, was not measured, or does not exist due to a problem during the recording process, the electronic device may fill the value of that item with "missing".

[0048] The electronic device may unify categorizable words within the clinical information text into the word with the highest term frequency. For example, categorizable words may be a set of words with identical or similar meanings. For instance, the words "improved" and "improved" can be categorizable words because they have the same meaning, differing only in whether they are written in English alongside Korean. Word frequency refers to the number of times a word appears within a document (i.e., the frequency of occurrence). For instance, if "improved" appears 10 times and "improved" appears 8 times within the clinical information text, the electronic device may unify these words to "improved" because its term frequency is the highest at 10 (i.e., "improved" can be replaced with "improved").

[0049] In addition, the electronic device may replace consecutive spaces with a single space if one or more spaces are used consecutively in the clinical information text. Furthermore, the electronic device may convert English letters in the clinical information text to lowercase. The electronic device may also remove the enter key ( / n) (i.e., line break) from the clinical information text. The electronic device may tokenize the clinical information text into word units (e.g., spaces). The electronic device may also change the data type of the clinical information text entirely to string. While the electronic device can preprocess the training dataset according to the methods described above, the preprocessing method is not limited to the methods described above, and the training dataset may be preprocessed according to various other methods.

[0050] In step (330), the electronic device may change tokens corresponding to personal information in the clinical information text into expressions representing non-personal information based on the output de-identification category. Non-personal information may be information through which a specific individual cannot be identified. For example, expressions representing non-personal information may include expressions representing de-identification categories (e.g., PER, ORG, LOC, TIM, DAT, etc. as described below). Additionally, expressions representing non-personal information may include BIO expressions (e.g., ORG-I, LOC-B, etc.).

[0051] The electronic device may first determine that a token among the tokens of the clinical information text that has a de-identification category indicating personal information is the token corresponding to the personal information. Then, the electronic device may perform at least one of the operations of replacing at least a portion of the token corresponding to the personal information with an expression (e.g., ORG, PER-B, etc.) representing the de-identification category output for the said token, masking, and deleting. Masking may refer to an operation of replacing specific data (e.g., at least a portion of the token corresponding to the personal information) with another value (e.g., a special character such as *), and deletion may refer to an operation of removing specific data (e.g., at least a portion of the token corresponding to the personal information) from the entire data (e.g., clinical information data).

[0052] FIG. 4 is a diagram showing the training and evaluation operations of a de-identification tag generation model according to one embodiment.

[0053] In step (410), the electronic device can train a de-identification tag generation model (411) through step (410). As described above, the electronic device can obtain a trained de-identification tag generation model (413) by training the de-identification generation model (411) based on the training dataset (412).

[0054] In step (420), the electronic device may obtain a dataset (e.g., a second dataset (425)) for further training and evaluation of the de-identification tag generation model (411). Step (420) may include steps (421 and 422). First, in step (421), the electronic device may obtain a first dataset (424) by providing clinical information text (423) (e.g., a discharge summary) to the trained de-identification tag generation model (413). The first dataset may include the clinical information text (423) and the de-identification tags output as a result of providing the clinical information text (423) to the trained de-identification tag generation model (413). Then, in step (422), the user may verify whether the correspondence between the tokens and de-identification tags within the clinical information text (423) in the first dataset (424) is appropriate. The user can correctly correct the correspondence relationship between tokens and de-identification tags within clinical information text (423) that is not properly matched as a result of verification. The electronic device can obtain a second dataset (425) by receiving the modified first dataset (424) from the user. The electronic device can divide the received second dataset (425) into a second training dataset (NER train) (e.g., a dataset for training within the second dataset (425)) and a second test dataset (NER test) (e.g., a dataset for testing within the second dataset (425). For example, the electronic device can determine 70% of the second dataset (425) as the second training dataset (NER train) and 30% as the second test dataset (NER test). The electronic device can further train a de-identification tag generation model (413) trained based on the second training dataset (NER train). Alternatively, the second training dataset (NER train) may be used to train a model other than the trained de-identification tag generation model (413).

[0055] In step (430), the electronic device may evaluate the trained de-identification tag generation model (413) based on the second test dataset (NER test). For example, the electronic device may evaluate the trained de-identification tag generation model (413) by providing clinical information text (423) within the second test dataset (NER test) to the trained de-identification tag generation model (413) and comparing the de-identification tags obtained by providing the de-identification tags within the second test dataset (NER test) with the de-identification tags within the second test dataset (NER test) according to an evaluation metric (e.g., F1-score). The F1-score is one of the metrics for evaluating the performance of the classification model and can be calculated as the harmonic mean of precision and recall. Precision represents the ratio of cases that are actually positive among those predicted as positive by the model, and recall represents the ratio of actual positive data that the model correctly predicted as positive. The electronic device may use not only the second test dataset (NER test) but also a portion of the training dataset (412) (e.g., a dataset corresponding to 10% of the training dataset (412)) for training the de-identification tag generation model (413) trained according to the method described above. In this case, a portion of the training dataset (412) used for evaluation may not have been used in training the de-identification tag generation model (413). The results of evaluating several models as the trained de-identification tag generation model (413) according to the F1-score may be shown as in Table 3 below.

[0056] Bert-base-multilingual-casedKLUE BERTPrecisionRecallF1-scorePrecisionRecallF1-scoreDataset10.860.860.860.880.890.89Dataset20.890.910.900.910.920.92

[0057] In Table 3 above, Dataset 1 represents the training dataset (412), and Dataset 2 represents the training dataset (412) augmented with the augmented training dataset. Bert-base-multilingual-cased represents a BERT-based model pre-trained to process multiple languages, and KLUE BERT represents a BERT-based model pre-trained to process Korean. Each value in Table 3 above represents the evaluation result of the Bert-base-multilingual-cased model or the KLUE BERT model trained with Dataset 1 or Dataset 2. As can be seen from Table 3 above, the performance of the model trained using the augmented training dataset (412) is higher than that of the model trained using the training dataset (412), and the evaluation result of the KLUE BERT model is higher than that of the multilingual-cased model. Additionally, in Table 3 above, it can be seen that the F1-score of the KLUE BERT trained with Dataset 2 is the highest value of 0.92. Ultimately, a Korean-specific model (e.g., KLUE BERT) trained using an augmented training dataset (412) may be more suitable as a de-identification tag generation model (411). Additionally, the evaluation results of the same Korean-specific models can be shown in Table 4 below.

[0058] Model Negative (Nagative)Positive (Positive)KLUE-BERT498502Llama3-Open-Ko-8B597403EEVE-Korean-Instruct-10.8B784216

[0059] Table 4 above shows the results of comparing the output tags of each model with the actual corresponding tags for 1,000 tokens. In Table 4 above, Llama3-Open-Ko-8B and EEVE-Korean-Instruct-10.8B may be large-scale language models specialized for Korean. When evaluating large-scale language models, a technique that includes a few examples of output in the input prompt (i.e., few-shot prompt technique) may be used. In the table above, positive indicates the number of output tags that match the actual tags, and negative indicates the number of output tags that do not match the actual tags. As shown in Table 4 above, KLUE-BERT has the highest number of positives at 502. That is, among Korean-specialized models, KLUE-BERT, which is a BERT-based model, may be more suitable as a de-identification tag generation model (411).

[0060] The electronic device can perform de-identification of the clinical information text (423) in step (440) using a model evaluated as suitable through step (430) (e.g., KLUE BERT trained using the augmented training dataset (412)).

[0061] FIG. 5 is a diagram showing a comparison between de-identified tags generated based on a tag generation model according to one embodiment and tags generated based on a large-scale language model.

[0062] In FIG. 5, a de-identification tag generation model (KLUE-BERT (530)) and a large language model (LLM), Llama 3 (specifically, Llama-3-Open-Ko-8B:latest) (520), are compared. Here, Llama 3 is a pre-trained LLM and may be a model that has not been trained on a training dataset. When evaluating Llama 3, a technique that includes a few examples of output in the input prompt (i.e., a few-shot prompt technique) may be used. The tags produced by providing the clinical information text (510) shown on the left side of FIG. 5 to Llama 3 (520) and KLUE-BERT (530), respectively, are shown on the right side of FIG. 5. Looking at row 2 of the table in Fig. 3, it can be seen that both Llama 3 (520) and KLUE-BERT (530) appropriately output tags for "2023-11-17" and "Kim OO" as DAT-B and PER-B, respectively. In the third and fourth lines, KLUE-BERT (530) outputs O for the date (20230828, 2023-11-14), indicating some missing parts. However, in the third line, KLUE-BERT (530) outputs O for "teacher," whereas Llama 3 (520) outputs the organization category (ORG) for "teacher," and in the fourth line, KLUE-BERT (530) outputs the person category (PER) for "Choi OO," while Llama 3 (520) outputs the organization category (ORG) for "Choi OO." As such, the de-identification generation model of the present invention (e.g., KLUE-BERT (530)) can de-identify clinical information text (510) more accurately than LLM (e.g., Llama 3 (520)).

[0063] FIG. 6 is a drawing showing a clinical information text de-identification device according to one embodiment.

[0064] The clinical information text de-identification device (600) may include a memory (610) and a processor (620). The clinical information text de-identification device (600) (specifically, the processor (620) of the clinical information text de-identification device (600)) can perform the clinical information text de-identification method described above. The memory (610) may be an electronic device capable of storing data. Data required for the clinical information text de-identification method (e.g., training dataset, clinical information text, (trained) de-identification tag generation model, and output de-identification tags, etc.) may be stored in the memory. The processor (620) may be an electronic device capable of processing data and performing calculations. The processor (620) can perform the clinical information text de-identification method described above.

[0065] The clinical information text de-identification device (600) can receive a training dataset and a de-identification tag generation model and store them in memory (610). The processor (620) can train the de-identification tag generation model based on the training dataset. The clinical information text de-identification device (600) can store the trained generation model in memory (610). Additionally, the clinical information text de-identification device (600) can receive clinical information text and store it in memory (610). The processor (620) can output a de-identification tag (or de-identification category) for each token of the clinical information text by providing the clinical information text to the trained de-identification tag generation model. Based on the output de-identification tags (or de-identification categories), the processor (620) can obtain the de-identified clinical information text by changing tokens corresponding to personal information within the clinical information text into expressions representing non-personal information. The acquired de-identified clinical information text can be stored in memory (610). Additionally, if the clinical information text de-identification device (600) has a display unit (e.g., liquid crystal display, etc.) (not shown), the text de-identification device (600) may output the de-identified clinical information text through the display unit.

[0066] The embodiments described above may be implemented as hardware components, software components, and / or combinations of hardware and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include multiple processing elements and / or multiple types of processing elements. For example, the processing unit may include multiple processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.

[0067] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or instruct the processing unit independently or collectively. Software and / or data may be stored on any type of machine, component, physical device, virtual equipment, computer storage medium, or device so as to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer-readable recording media.

[0068] The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., either alone or in combination, and the program instructions recorded on the medium may be those specifically designed and configured for the embodiment or those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

[0069] The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.

[0070] Although the embodiments have been described above with reference to the limited drawings, those skilled in the art can apply various technical modifications and variations based thereon. For example, suitable results may be achieved even if the described techniques are performed in a different order than described, and / or if the components of the described system, structure, device, circuit, etc. are combined or assembled in a form different from described, or replaced or substituted by other components or equivalents.

[0071] Therefore, other implementations, other embodiments, and equivalents to the claims also fall within the scope of the claims set forth below.

Claims

1. A method for de-identifying personal information within clinical information text performed by a processor, A step of training a machine learning-based de-identification tag generation model based on annotation data indicating a de-identification category for each token of a training text and a training dataset including said training text; A step of providing clinical information text about a patient entered by a medical professional to the trained de-identification tag generation model to output a de-identification category for each token of the clinical information text; and A step of changing tokens corresponding to personal information in the clinical information text into representations indicating non-personal information, based on the outputted anonymization category. including method.

2. In Paragraph 1, The step of changing the token corresponding to the above personal information into an expression representing non-personal information is: A step of determining, among the tokens of the clinical information text, a token outputting a de-identification category indicating personal information as the token corresponding to the personal information; and A method comprising the step of performing at least one of the operations of replacing at least a portion of a token corresponding to the above personal information with an expression representing a de-identification category output for said token, masking, and deleting. method.

3. In Paragraph 1, The above-mentioned de-identification category includes at least one of a personal name category, an organization category including the name of an institution, a region category including the name of an administrative division, a time category, and a date category. method.

4. In Paragraph 1, The step of training the above-mentioned de-identification tag generation model is, The step of augmenting the training dataset with augmented training text obtained by replacing a specified word within the training text with a synonym of the specified word. method.

5. In Paragraph 1, The step of outputting the above-mentioned de-identification category includes the step of preprocessing the above-mentioned clinical information text, and The step of preprocessing the above clinical information text is, Converting English letters in clinical information text to lowercase, Removing at least one special character from the remaining characters other than English letters, Korean characters, and numbers in clinical information text, Unifying categorizable words into the word with the highest term frequency among the said categorizable words, and Includes at least one of tokenizing clinical information text in word units method.

6. In Paragraph 5, The above at least one special character is, including special characters different from at least one of the following special characters: period (dot), double quotation marks, plus sign, minus sign, inequality sign, tilde, and percent sign, method.

7. A computer-readable recording medium storing one or more computer programs including instructions for performing the method of paragraph 1.

8. In an electronic device for anonymizing personal information within clinical information text, Memory; and A processor that trains a machine learning-based de-identification tag generation model based on an annotation data indicating a de-identification category for each token of a training text and a training dataset including said training text, outputs a de-identification category for each token of said clinical information text by providing clinical information text about a patient entered by a medical professional to said trained de-identification tag generation model, and changes tokens corresponding to personal information in said clinical information text into expressions representing non-personal information based on said output de-identification categories. including Electronic device.

9. In Paragraph 8, The processor determines a token among the tokens of the clinical information text in which a de-identification category indicating personal information is output as a token corresponding to said personal information, and performs at least one of the operations of replacing at least a portion of the token corresponding to said personal information with an expression representing the de-identification category output for said token, masking, and deleting. Electronic device.

10. In Paragraph 8, The above-mentioned de-identification category includes at least one of a personal name category, an organization category including the name of an institution, a region category including the name of an administrative division, a time category, and a date category. Electronic device.

11. In Paragraph 8, The processor augments the training dataset with augmented training text obtained by replacing a specified word within the training text with a synonym of the specified word. Electronic device.

12. In Paragraph 8, The processor preprocesses the clinical information text by performing at least one of converting English letters of the clinical information text to lowercase, removing at least one special character from the remaining characters other than English letters, Korean characters, and numbers in the clinical information text, unifying categorizable words into the word with the highest term frequency among the categorizable words, and tokenizing the clinical information text into word units. Electronic device.

13. In Paragraph 12, The above at least one special character is, including special characters different from at least one of the following special characters: period (dot), double quotation marks, plus sign, minus sign, inequality sign, tilde, and percent sign, Electronic device.