Medical document structuring processing method, device, equipment and storage medium
By employing technologies such as character recognition, spatial location analysis, and named entity recognition, the problem of insufficient data accuracy and reliability in the structured processing of medical documents has been solved, achieving the generation of medical document data with high accuracy and high reliability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUZHOU LIANGYIHUI NETWORK TECH CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing medical document structuring technologies suffer from insufficient data accuracy and reliability, making it difficult to meet the requirements of high accuracy, high standardization, and high credibility.
By using text recognition, spatial location information, and layout analysis, text and its contextual relationships are extracted; by using named entity recognition and relationship analysis, heterogeneous medical information is uniformly converted into standardized structured data; and by combining rule validation, medical history association validation, and manual review, highly reliable standard medical document data is generated.
It improved the accuracy and standardization of medical document parsing, reduced false extractions and logical errors, and enhanced the reliability and usability of the data.
Smart Images

Figure CN122240710A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of medical data structuring technology, and in particular to a method, apparatus, device and storage medium for medical document structuring processing. Background Technology
[0002] With the rapid development of medical informatization, hospitals, research institutions, and other organizations have generated massive amounts of unstructured medical documents, including medical records, examination reports, imaging reports, discharge summaries, and laboratory reports. These documents contain crucial medical information such as patient conditions, diagnoses, treatment plans, and examination data, serving as the core foundation for clinical research, medical record management, decision support, and the secondary use of medical data. Traditional medical document processing relies primarily on manual data entry, requiring medical staff to spend a significant amount of time extracting and organizing information from unstructured documents into structured data. This method is not only extremely inefficient and unable to meet the processing needs of massive amounts of documents, but it is also prone to data errors due to human negligence, affecting the accuracy of subsequent applications.
[0003] Currently, automated solutions for structuring medical documents on the market are mainly built upon the following technologies, with the core idea being a two-step approach of "text extraction + information parsing": The first type is based on traditional OCR and rule matching: OCR technology converts medical document images into text, and then target information (such as patient names and examination result values) is extracted based on preset regular expressions and keyword matching rules. The second type is based on general NLP models: With the development of deep learning, some solutions introduce pre-trained language models such as BERT for named entity recognition (NER) and relation extraction (RE), replacing traditional rule matching. The third type is a hybrid architecture solution. Some solutions combine OCR, layout analysis, and NLP models. They first use layout analysis tools (such as CUTIE and CloudScan) to identify areas in the document such as headings, paragraphs, and tables, and then employ differentiated text extraction and parsing strategies for different areas.
[0004] While existing technologies have reduced reliance on manual labor to some extent, they still require a significant amount of manual review and data entry, making it difficult to meet the requirements of high accuracy, high standardization, and high reliability in medical settings. Summary of the Invention
[0005] In view of this, this application provides a method, apparatus, device and storage medium for structuring medical documents to solve the problems of insufficient data accuracy and reliability.
[0006] The first aspect of this application provides a method for structuring medical documents, the method comprising: The original medical documents are subjected to text recognition to extract text data and corresponding spatial location information from each document. Based on the spatial location information, the text data is subjected to named entity recognition and relationship analysis to obtain entity knowledge units; According to the preset standard field mapping rules, the entity knowledge units are standardized and transformed to generate standard structure data; The standard structure data is subjected to multi-level verification using a preset verification method to generate standard medical document data.
[0007] In an optional implementation, when the received data is image data, the method further includes the following steps before performing text recognition: Edge detection is performed on the acquired image data to obtain the vertex coordinates of the document region in the image data; Calculate the perspective transformation matrix based on the vertex coordinates and the vertex coordinates in the preset target rectangle; The image data is geometrically corrected according to the perspective transformation matrix to generate a corrected standard document image, which is then used as the original medical document.
[0008] In an optional implementation, the step of performing text recognition on the acquired original medical documents and extracting text data and corresponding spatial location information from each document includes: Perform structural analysis on the original medical documents to identify text regions in each document and determine the boundary coordinates of each region; Based on the type corresponding to the text region and the boundary coordinates, the text lines in each text region are detected to obtain the spatial location information of each text line in each text region. Based on the preset recognition order and the spatial location information, the text content within the text line is recognized, and text data in each text region is extracted.
[0009] In an optional implementation, the step of performing named entity recognition and relation analysis on the text data based on the spatial location information to obtain entity knowledge units includes: Based on the spatial location information, the reading order between each line of text is determined, and the text data is sorted according to the reading order to generate a text sequence; Based on preset medical entity types, the text sequence is subjected to named entity recognition to obtain medical entities and their corresponding entity types; Based on the medical entity, the entity type, and the spatial location information, context-aware semantic relationship analysis is performed on the medical entity to obtain the association relationship between the medical entities.
[0010] In an optional implementation, the step of standardizing and transforming the entity knowledge units according to preset standard field mapping rules to generate standard structure data includes: The medical entities in the entity knowledge unit are classified according to the preset entity format to obtain standard structure entities and non-standard structure entities. According to the preset standard field mapping rules, the standard structure entity is converted into a first standard field; Calculate the similarity between the non-standard structural entity and each field in the medical standard terminology database in the standard field mapping rules, and convert the non-standard structural entity into the second standard field corresponding to the highest similarity. Standard structured data is generated based on a predefined medical data template, the first standard field, and the second standard field.
[0011] In an optional implementation, the step of performing multi-level verification on the standard structure data using a preset verification method to generate standard medical document data includes: The standard structure data is validated according to preset medical logic rules and the associated relationships to obtain rule violation data; Based on the user identifier corresponding to the original medical document, historical user data is obtained from the preset user database to perform medical history association verification between the standard structure data and the historical user data, and abnormal indicator data is recorded. The rule violation data and the indicator abnormal data are sent to the target device through a preset review method, so as to correct the standard structure data according to the inspection data fed back by the target device and generate standard medical document data.
[0012] In an optional implementation, the method further includes: The standard medical document data is hashed to generate the current hash value corresponding to the standard medical document data, and the historical hash chain corresponding to the historical user data is obtained from the preset user database according to the user identifier; The integrity of the historical hash chain is verified. When the historical hash chain passes the verification, the current hash value is concatenated with the historical hash chain to form the current hash chain, and the historical hash chain is updated according to the current hash chain. According to the preset log format, the standard medical document data and the historical user data are constructed into a document log, and the historical hash chain is associated with the document log for storage to generate a traceability log.
[0013] A second aspect of this application provides a medical document structuring processing apparatus, the apparatus comprising: The text recognition module is used to perform text recognition on the acquired raw medical documents and extract the text data and corresponding spatial location information from each document. The entity analysis module is used to perform named entity recognition and relationship analysis on the text data based on the spatial location information to obtain entity knowledge units; The structural standard module is used to standardize and transform the entity knowledge units according to preset standard field mapping rules to generate standard structural data; The multi-level verification module is used to perform multi-level verification on the standard structure data through preset verification methods to generate standard medical document data.
[0014] A third aspect of this application provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the medical document structuring processing method described above.
[0015] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the medical document structuring processing method described above.
[0016] In summary, this application includes at least the following beneficial technical effects: 1. By using text recognition, spatial location information, and layout analysis, it can more accurately extract text and its contextual relationships in medical documents, especially improving the parsing accuracy of complex layouts such as tables and key-value pairs.
[0017] 2. Through named entity recognition, relational analysis, and standard field mapping, heterogeneous medical information can be uniformly converted into standardized structured data, improving the degree of data standardization and system interoperability.
[0018] 3. By implementing rule-based validation, medical history association validation, and manual review and correction, we can reduce erroneous extraction, mismapping, and medical logic errors, thereby improving the reliability and usability of the final medical document data. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of a medical document structuring method provided in an embodiment of this application; Figure 2 This is a functional block diagram of a medical document structuring processing device provided in an embodiment of this application; Figure 3 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0022] like Figure 1 The diagram shown is a flowchart of a medical document structuring method provided in an embodiment of this application. The medical document structuring method provided in this embodiment includes the following steps.
[0023] Step S1: Perform text recognition on the acquired original medical documents to extract the text data and corresponding spatial location information from each document.
[0024] The original medical documents refer to various unstructured files containing medical information that are to be processed, and their sources and formats are diverse. Specifically, a multimodal acquisition interface is constructed, which can receive and identify input data from different sources. These inputs include, but are not limited to, image files containing medical documents captured by mobile device cameras, PDF documents or image files generated by scanners, and PDF documents in native digital format. For native digital format PDF documents, their text content can be directly extracted; while for image inputs, they need to enter a subsequent image preprocessing process. The role of the original medical documents is to carry key medical information such as the patient's condition, diagnosis results, examination data, and treatment plans.
[0025] When the received data is image data, images captured by mobile devices or generated by scanners often appear as irregular quadrilaterals rather than standard rectangles due to factors such as shooting angle, lens distortion, unstable handheld operation, or tilted scanning placement. This distortion is called perspective distortion. Perspective distortion severely interferes with the accuracy of subsequent optical character recognition because the shape and relative position of the characters are distorted, making it difficult for the recognition model to correctly interpret the text content. This application employs a geometric correction method based on document edge detection and perspective transformation to eliminate the effects of distortion. Specifically, edge detection is performed on the acquired image data. By calculating the horizontal and vertical gradients of the image, edge information is enhanced. Then, the image is binarized using the Otsu thresholding method, converting it into a black and white binary image. The Hough transform is used to detect line segments in the binary image, filtering out candidate lines for document edges. By calculating the intersection points of these lines, the coordinates of the four vertices of the document region are obtained. If fewer than four corner points are detected due to edge occlusion or incomplete imaging, completion is performed based on convex hull analysis: the detected edge segments are extended, and the intersection of the extended lines is calculated as the missing corner points, or the coordinates of the remaining corner points are estimated based on the document's standard aspect ratio to ensure a complete set of quadrilateral corner points. Subsequently, a target standard rectangle is defined, with its width and height preset according to actual needs, for example, a width of 1600 pixels and a height adaptively calculated according to the original document's aspect ratio. The coordinates of the four vertices of the target rectangle are the top left, top right, bottom right, and bottom left corners. Based on the coordinates of the four vertices of the document region detected in the original image and the four vertices of the target rectangle, a system of linear equations can be constructed to solve for a 3×3 perspective transformation matrix. This matrix describes the pixel mapping relationship from the original image to the corrected image. This perspective transformation matrix is applied to each pixel in the original image to generate a new image, i.e., the corrected standard document image. At this point, the original image data is converted into a geometrically regular, uniformly sized standard document image, which will serve as the original medical document and proceed to the subsequent text recognition process.
[0026] Next, to associate the document's visual layout information with its text content, the original medical document is first subjected to structural analysis to understand its layout and distinguish different types of areas, such as title areas, paragraph areas, table areas, and image areas. This embodiment employs a deep learning-based layout analysis model, which can perform pixel-level classification of document images and output type masks and boundary coordinates for each area. Through this structural analysis, the complex document layout can be decomposed into several areas with clear semantic functions. For example, in a laboratory report, structural analysis can identify different areas such as the patient information area, the test item table area, the test result area, and the reference range area. To further locate the specific position of each line of text within the identified text areas, after obtaining the boundary coordinates of each text area, the text lines within these areas need to be detected according to their corresponding type. For paragraph areas, text line detection needs to identify continuous, left-to-right text lines; for table areas, it is necessary to combine the results of table structure recognition to locate the text lines within each cell. The text line detection in this embodiment employs a deep learning-based text detection algorithm (e.g., a differentiable binarization network). Taking a differentiable binarization network as an example, it can predict the probability map and threshold map of the text line, generate an accurate binary segmentation map through differentiable binarization operations, and extract the polygonal bounding boxes of the text line based on the segmentation map. The coordinates of these bounding boxes are the spatial location information of the text line, typically represented as the x-coordinate of the top-left corner, the y-coordinate of the top-left corner, the width, and the height, or as a sequence of vertex coordinates of the polygon. The obtained spatial location information binds the text content to its physical location in the document, enabling subsequent semantic understanding to utilize page layout information.
[0027] Finally, based on the preset recognition order and the obtained text line spatial location information, the text content within each text line is recognized. The recognition order is typically determined by document reading habits, i.e., from top to bottom and from left to right. Using the vertical and horizontal coordinates of the text line spatial location information, all text lines can be sorted to form a text sequence that conforms to reading logic. For each text line, its image patch is cropped, scaled to a uniform size, and input into an optical character recognition (OCR) model for text recognition. The OCR model can employ a convolutional recurrent neural network (CNN) structure combined with connectionist temporal classification. The convolutional layer extracts image features, the recurrent layer processes sequence features and captures contextual dependencies, and the transcription layer decodes the probability distribution output by the recurrent layer into the final text string. This process outputs the text content of each text line. Thus, through three sub-steps—structural analysis, text line detection, and text recognition—structured text data is extracted from the original medical document, and each piece of text data is precisely associated with its spatial location information within the document (i.e., the region it belongs to and the boundary coordinates of its text line).
[0028] Step S2: Based on the spatial location information, perform named entity recognition and relationship analysis on the text data to obtain entity knowledge units.
[0029] It should be understood that step S1 above, through structural analysis, text line detection and character recognition of the original medical document, has transformed the document content into a series of text lines with precise spatial coordinates. Each text line contains the recognized text string, its region type (such as title area, paragraph area, table area), and its boundary coordinates on the document page.
[0030] The text content in a document is usually organized in a specific order, such as from top to bottom or from left to right. This order implies semantic coherence and logical relationships. Disrupting this order will break the contextual dependencies between sentences, making subsequent named entity recognition and relation analysis unable to accurately understand the text's meaning. Therefore, it is necessary to determine the reading order of all text lines based on the ordinate and abscissa values in the spatial location information, and then sort the text data according to the reading order. Specifically, first, compare the ordinate values of each text line; the text line with the smaller ordinate is at the top of the page and should be read first. When the ordinates are similar, then compare the abscissa values; the text line with the smaller abscissa is on the left and should be read first. In this way, text lines scattered throughout the document can be reorganized into a linear text sequence that conforms to the original reading order.
[0031] Subsequently, based on preset medical entity types, named entity recognition is performed on the generated text sequence to obtain medical entities and their corresponding entity types. The goal of named entity recognition is to detect medical information fragments with specific meaning in the text sequence and classify these fragments into predefined categories. The preset medical entity types are designed according to the characteristics and application requirements of medical documents, and typically include, but are not limited to, patient name, age, gender, hospital number, examination item name, examination result value, reference range, diagnosis name, drug name, medication dosage, medication frequency, surgery name, and allergens. These entity types cover the core information elements in medical documents. This application embodiment uses a layout-aware pre-trained language model to implement named entity recognition. This model can not only understand the semantics of the text but also utilize the spatial position information of the text lines obtained in step S1 as auxiliary features, thereby more accurately identifying entities that rely on page layout for correct parsing. For example, in a laboratory report, different cells in the same row may correspond to examination items, results, units, and reference ranges, respectively. The layout-aware model can distinguish these different entities through the spatial position relationship of the text lines, avoiding confusion between them. The output of Named Entity Recognition is a series of medical entities and their corresponding entity type labels, with each entity clearly identified.
[0032] Simply identifying isolated entities is insufficient to fully understand the meaning of a document; it is also necessary to know how these entities are related. For example, "blood sugar" and "5.6" are independent words, but through relational analysis, it can be determined that they have a "test item - test result" relationship, thus forming a complete medical statement. The relational analysis operation used in this application relies on spatial location information and semantic context. Spatial location information provides the physical layout relationship of entities in the document, such as whether two entities are located in the same table row, whether they are adjacent in the horizontal direction, or whether they are aligned in the vertical direction. These layout relationships are often important clues to the implicit semantic relationships in medical documents. Semantic context provides the linguistic logical relationship between entities. For example, by analyzing sentence structure and dependency relationships, it can be determined whether two entities have a modifying relationship or a parallel relationship. Combining the two allows for a more accurate inference of the relationships between entities. For example, in a discharge summary, there may be a "treatment" relationship between the disease name in the "diagnosis" area and the drug name in the "treatment" area; in a tabular laboratory report, there must be a "test item - result" relationship between the test items and test results in the same row. This context-aware semantic relationship analysis can connect isolated medical entities to form a knowledge network consisting of entities and their relationships.
[0033] Finally, the medical entities, entity types, and relationships identified in the above process are integrated to form entity knowledge units. An entity knowledge unit is a structured data set that fully describes the medical knowledge fragments extracted from the original medical document. Each entity knowledge unit can contain one or more medical entities and the semantic relationships between them. For example, for a routine blood test report, multiple entity knowledge units can be constructed: one unit describes the "item-result" relationship between "examination item: white blood cell count" and "examination result: 5.6", and another unit describes the "item-result" relationship between "examination item: red blood cell count" and "examination result: 4.2". For a discharge summary, an entity knowledge unit can be constructed describing the "treatment" relationship between "diagnosis: type 2 diabetes" and "treatment medication: xxx". These entity knowledge units transform unstructured medical document content into structured knowledge representations.
[0034] Step S3: According to the preset standard field mapping rules, the entity knowledge unit is standardized and transformed to generate standard structure data.
[0035] Among them, each entity knowledge unit includes a medical entity, the entity type corresponding to the medical entity, and the association relationship between medical entities. The entity types included in the medical document have different characteristics. Some entities have fixed formats and limited value spaces, such as dates, times, numbers, units, ID numbers, phone numbers, etc. These entities are called standard structure entities; while other entities have diverse expressions and lack a unified format, such as disease names, surgical names, drug trade names, examination item names, etc. These entities are called non-standard structure entities. For standard structure entities, precise rules can be used for parsing and conversion; for non-standard structure entities, more flexible semantic matching methods are required. In this application, the entity knowledge units are classified according to the preset entity format for medical entities. The entity format used refers to a set of pre-defined rules for identifying standard structure entities. For example, the format of a date can be "YYYY-MM-DD" or "YYYY / MM / DD", the format of a number can be an integer or a decimal, and the format of a unit can be "mg / dL" or "mmol / L". By traversing each medical entity in the entity knowledge unit and matching its text content with the preset entity format, if the match is successful, the entity is classified as a standard structure entity; if the match fails, it is classified as a non-standard structure entity. For example, for the medical entity "October 5, 2023", since it conforms to the date format rule, it is classified as a standard structure entity; for the medical entity "type 2 diabetes", since it has no fixed format rule, it is classified as a non-standard structure entity.
[0036] For the obtained standard structure entities, according to the preset standard field mapping rules, the standard structure entities are converted into the first standard fields. The standard field mapping rules refer to a set of pre-defined rules for mapping standard structure entities in a specific format to the corresponding fields in the medical data template. These rules are usually implemented in the form of regular expressions. For example, for date entities, a rule can be defined: convert a string in the format of "YYYY-MM-DD" to the format of "YYYY / MM / DD" and map it to the "date" field in the medical data template; for numerical entities, a rule can be defined: convert the identified numerical string to a floating-point number format and map it to the "numerical value" field in the medical data template;; for unit entities, a rule can be defined: map "milligrams per deciliter" to the standard unit code "mg / dL" and map it to the "unit" field in the medical data template. By applying these rules one by one, each standard structure entity is converted into its corresponding standard format and filled into the pre-defined field positions in the medical data template, forming a set of the first standard fields.
[0037] Since non-standard structural entities often correspond to medical concepts, the same concept may have multiple different expressions, and simple string matching cannot handle this diversity. For non-standard structural entities, this application requires a semantic similarity calculation method for conversion. First, a medical standard terminology database needs to be pre-constructed, which includes all potentially used standard terms, such as diagnostic codes and their corresponding standard names from the International Classification of Diseases, Tenth Revision (ICD-10), laboratory test item codes and their standard names from the Logical Naming and Coding System for Observational Indicators (BISI), drug codes and their standard names from the Chemical Classification of Anatomical and Therapeutic Units (CTC), and unit codes and their standard names from the Uniform Medical Units (UMU) system. For each standard term, in addition to storing its standard name and code, a semantic vector needs to be generated using a pre-trained medical language model. This vector is a numerical representation of the term's semantic information in mathematical space. The semantic similarity calculation used in this application is based on these vectors. When a non-standard structural entity needs to be converted, the text content of the entity is first converted into its corresponding semantic vector. Then, the cosine similarity between this vector and the semantic vector of each standard term in the medical standard terminology database is calculated. Cosine similarity is a measure of the directional consistency between two vectors in a multidimensional space. A value closer to 1 indicates a closer semantic similarity, while a value closer to 0 indicates a less related semantic relationship. The cosine similarity formula is the dot product of the two vectors divided by the product of their magnitudes. After calculating all similarities, the standard term with the highest similarity is selected as the mapping result, and its corresponding code becomes the second standard field. If the highest similarity is found to be below a preset threshold, such as 0.85, it indicates that the mapping result for that entity has low confidence, and the entity needs to be marked as low-confidence and manual review should be triggered.
[0038] Finally, standard structured data is generated based on a predefined medical data template, first standard fields, and second standard fields. The medical data template is a predefined structured data format based on medical data exchange standards, such as the Health Level 7 Rapid Medical Interoperability Resource Specification. This template specifies the field names, field types, value constraints, and hierarchical relationships between fields to be output. For example, a template representing a test report might include fields such as "Patient Identifier," "Test Item Code," "Test Result Value," "Test Result Unit," "Lower Reference Range," "Upper Reference Range," and "Test Time." When generating standard structured data, the first and second standard fields are filled into their corresponding positions in the template according to their field names. Fields required in the template but not extracted from the entity knowledge unit are left blank and marked as missing. Simultaneously, the source and confidence information of each field mapping process needs to be recorded. For example, for rule-mapped fields, the rule identifier used is recorded; for semantically mapped fields, the similarity score and the matching standard terminology code are recorded. This information forms a mapping chain.
[0039] Step S4: Perform multi-level verification on the standard structure data using a preset verification method to generate standard medical document data.
[0040] While the standard structured data obtained above is standardized in format, the accuracy and medical rationality of its content have not yet been verified. For example, a set of standard structured data may contain obviously contradictory field combinations such as "Patient Gender: Male" and "Diagnosis Code: N97.9 (Female Infertility)," or abnormal data such as "Fasting Blood Glucose: 15.6 mmol / L," which is a medically critical value. If these errors or anomalies are not detected, they will seriously affect the reliability of subsequent clinical decisions and scientific research analyses. Therefore, the goal of step S4 is to perform multi-level verification on the standard structured data, using a combination of rule-based verification, medical history correlation verification, and manual review to identify and correct errors in the data, ultimately generating high-confidence standard medical document data.
[0041] First, the standard structured data undergoes rule validation based on pre-defined medical logic rules and relationships. These medical logic rules are deterministic knowledge extracted from authoritative sources such as medical guidelines, clinical pathways, drug instructions, and hospital management regulations. They are expressed in a first-order logic form: "If the condition is true, then the conclusion is true." For example, "If the patient's gender is 'female,' the diagnosis cannot contain codes related to prostatitis," "If the patient is under 18 years old, the diagnosis cannot contain codes related to Alzheimer's disease," and "If the test item is 'fasting blood glucose,' and the result is less than 3.9 mmol / L, then it is marked as hypoglycemia." These rules constitute the "hard constraints" for data validation; any data that violates these rules is highly likely to be erroneous. The rule validation process involves traversing each field in the standard structured data and matching the field values against each rule in the predefined rule base. When the condition part of a rule is satisfied by the current data, its conclusion part is checked. If the conclusion part is not true, the data is determined to violate the rule, marked as rule-violation data, and the rule identifier, violating field, and violation description are recorded. Meanwhile, the relationships between entities obtained in step S2 are also used to assist in rule validation. For example, the "check item - check result" relationship can be used to locate specific numerical fields, and the values can be compared with threshold rules in the rule base to determine whether the values are abnormal. For example, for a set of standard structured data containing "check item: glycated hemoglobin" and "check result: 8.5%", rule validation will trigger the rule "glycated hemoglobin higher than 7.0% indicates poor blood sugar control", marking the data as rule violation data.
[0042] Simultaneously, based on the user identifier corresponding to the original medical document, historical user data needs to be retrieved from the preset user database to perform medical history association verification between the standard structured data and the historical user data, and to record abnormal data. The medical history association verification method used in this application utilizes the longitudinal historical data of the same patient to verify the rationality of the current data. Original medical documents typically contain information that uniquely identifies the patient, such as ID card number, medical insurance card number, hospital visit ID, etc., which are referred to as user identifiers. In step S1 or step S2, user identifiers can be extracted from the document through named entity recognition. The preset user database is a dedicated repository for storing historical medical data, containing high-confidence medical document data of the patient that has undergone the same process and verification in the past. This data is organized in a structured form, including past medical history codes, allergy history codes, all previous test and examination items and their results, medication history records, basic physiological information, etc. The first step of the medical history association verification is to retrieve all historical user data of the patient from the user database based on the user identifier. If the patient's historical data is not found in the database, this verification step is skipped. If historical data exists, medical history association verification rules are constructed based on medical logic and clinical treatment guidelines. These rules are also divided into hard rules and soft rules. Hard rules are used to detect situations where there is an absolute contradiction with historical data, such as "historical allergy history contains penicillin codes, current data cannot contain penicillin drug codes" or "historical blood type record is type O, current data cannot contain type A, type B, or type AB". Soft rules are used to detect situations where there is an abnormal trend compared with historical data, such as "historical glycated hemoglobin values range from 5.0% to 5.5%, current data if glycated hemoglobin result is greater than 7.0%, marked as a numerical mutation" or "historical average systolic blood pressure is 120±5 mmHg, current data if systolic blood pressure is greater than 160 mmHg, marked as abnormally high". Each medical history association verification rule is iterated by comparing the fields in the current standard structure data with the corresponding fields in the historical user data item by item. If the current data triggers a hard rule, it is determined as a significant medical history anomaly, and the anomaly type, contradictory field, rule identifier, and contradictory description are recorded. If the current data triggers a soft rule, it is judged as an abnormal medical history trend, and the abnormal field, historical value range, current value, and degree of abnormality are recorded. For cases with obvious abnormal medical history and a unique correct result, automatic and accurate error correction can be performed based on historical data. For example, the blood type field in the current data can be directly corrected to a historical blood type, and the source of the error, the original value, and the corrected value are recorded. For cases without a unique correct result, it is marked as abnormal indicator data, which requires manual review.
[0043] Finally, the review method adopted in this application refers to the construction of a manual review workbench. This workbench can uniformly collect data marked as suspected errors, suspected anomalies, low-confidence mappings, medical history contradictions requiring manual review, and abnormal medical history trends (i.e., rule violation data and abnormal indicator data) from rule validation and medical history correlation validation, forming a review queue. This data is sent to the target device, i.e., the reviewer's terminal device, such as a personal computer or mobile workstation. The review workbench displays detailed information for each piece of data to be reviewed in a user-friendly interface, including thumbnails or screenshots of key areas of the original document, the current standard structure data, warning messages triggered by rule validation, contradictions or anomalies found in medical history correlation validation, and candidate mapping results for low-confidence fields. Reviewers can view the original context and confirm, correct, or supplement the data. Every correction operation by the reviewer, such as changing an incorrect diagnostic code to a correct code, correcting an abnormal test result value to the correct value in the original document, or confirming the correct mapping result of a low-confidence field, will be recorded in detail by the system, forming a set of feedback data (i.e., test data) containing error samples and correct labels. This test data will be periodically used to optimize the parameters of the named entity recognition and relation analysis model in step S2 and the semantic mapping model in step S3, forming a closed loop of continuous improvement. Ultimately, only data that passes rule validation and medical history association validation without any warnings, or data that, although flagged, is confirmed correct after manual review, will be marked as high-confidence standard medical document data and used as the final output. Simultaneously, all validation and review process information, including rule trigger records, anomaly scores, reviewer identifiers, modified content, and modification timestamps, will be recorded in detail and integrated into a complete quality control log.
[0044] In the medical field, the credibility and traceability of data are crucial, directly impacting legal compliance, medical liability determination, and data auditing requirements. For example, medical data used in clinical research, if its source and processing cannot be traced, will not be recognized by regulatory agencies and academic journals. Therefore, it is necessary to build an immutable and fully traceable archive for every standard medical document, recording the entire lifecycle of the data from the original document to the final output, ensuring that the data can be verified and audited at any time.
[0045] First, a hash calculation needs to be performed on the standard medical document data to generate the current hash value corresponding to the standard medical document data. Then, based on the user identifier, the historical hash chain corresponding to historical user data is retrieved from a pre-set user database. A hash calculation is a mathematical function that maps data of arbitrary length to a fixed-length string; its output is called a hash value or message digest. Hash functions are one-way and collision-resistant, meaning that the original data cannot be deduced from the hash value, and different data are extremely unlikely to produce the same hash value. Therefore, the hash value can serve as a unique digital fingerprint of the data; any small modification to the original data will cause a significant change in the hash value. The standard medical document data generated in step S4 is input into a secure hash function, such as a 256-bit secure hash algorithm. This function calculates the data content and outputs a 256-bit binary number, usually represented as a 64-bit hexadecimal string. This string is the current hash value. The current hash value uniquely identifies the content of this standard medical document data. Simultaneously, based on the user identifier extracted from the document in step S4, such as the patient's ID number or hospital visit ID, a search needs to be performed in the pre-set user database. The user database not only stores patients' historical medical data but also maintains a historical hash chain for each patient. The historical hash chain is a chain-like structure composed of the hash values of all the patient's historical medical documents linked together in chronological order; each new piece of data adds its hash value to the end of the chain. The purpose of the historical hash chain is to protect the integrity of all the patient's medical data, ensuring that any tampering with historical data can be detected.
[0046] Subsequently, the historical hash chain needs to be verified for integrity to ensure the reliability of data traceability. The historical hash chain is typically associated with a digital signature or a periodically published chainhead hash value. For example, the system can digitally sign the latest chainhead hash value daily using the private key in the hardware security module and store the signature value in trusted third-party or local encrypted storage. During integrity verification, the entire hash chain from the chain head to the current chain tail is recalculated to obtain a new chainhead hash value. Then, the corresponding public key is used to verify the previously stored digital signature, confirming whether the newly calculated chainhead hash value matches the signature value. If they match, it proves that the historical hash chain has not been tampered with since the last signature, and the integrity verification passes. When the historical hash chain passes verification, the newly generated current hash value is appended to the end of the existing historical hash chain, forming a new, longer hash chain. Specifically, the current hash value is concatenated with the last hash value of the historical hash chain (i.e., the hash value of the previous data), and then the concatenated string is hashed again to obtain a new hash value. This new hash value becomes the new chainhead of the current hash chain. In this way, the hash value of the new data is tightly linked to the historical data chain. Any modification to the historical data will cause all subsequent hash values to change, thus being detected. After the concatenation is complete, this new chain head hash value is used to update the historical hash chain stored in the user database, ensuring that the patient's hash chain is always up-to-date.
[0047] Finally, according to the preset log format, standard medical document data and historical user data are constructed into a document log. The historical hash chain is then associated with the document log for storage, generating a traceability log. The preset log format is a predefined data structure used to systematically record the entire lifecycle information of data. The document log typically contains three types of information: source information, processing information, and quality control information. Source information includes the 256-bit hash value of the original document using a secure hash algorithm, the document's upload time, and the uploader's anonymization identifier. Processing information includes the geometric correction method and parameters used in step S1, the version number and confidence level of the named entity recognition and relational analysis model used in step S2, the version of the mapping rule or semantic similarity score used in step S3, the rule identifier triggered in step S4, the anomaly detection score, the reviewer's identifier, the modified content, and the timestamp. Quality control information includes the final confidence label of the data and the quality control completion time. When constructing the document log, all the above information is organized according to the preset format to form a complete log record. Then, the currently generated document log is associated with the previously updated historical hash chain for storage. The associated storage method can be to add a field to the document log to record the corresponding chain head hash value, or to record the corresponding document log storage location in the hash chain's metadata. Through this association, when verifying the authenticity of a standard medical document, one can first find the corresponding position in the hash chain based on its hash value to verify the chain's integrity, and then find the corresponding document log based on the association information to view the detailed processing procedure. The resulting traceability log is a comprehensive archive containing the data itself, data processing history, and data integrity verification information. This archive ensures that every step of every piece of medical data, from the original document input to the final structured output, is clearly traceable and tamper-proof, providing a solid technical guarantee for data legal compliance, medical liability determination, and data auditing.
[0048] This application applies to the field of medical data structuring technology. It extracts text data and corresponding spatial location information from original medical documents through text recognition. Based on the spatial location information, it performs named entity recognition and relationship analysis on the text data to obtain entity knowledge units. These entity knowledge units are then standardized and transformed according to standard field mapping rules to generate standard structured data. Finally, multi-level verification is performed on the standard structured data to generate standard medical document data. This application, through the identification, parsing, semantic extraction, standard mapping, and multi-level verification of original medical documents, forms a closed-loop processing flow from unstructured information to standard medical document data, improving the accuracy, standardization level, and data reliability of medical document structuring.
[0049] like Figure 2 The diagram shown is a functional block diagram of a medical document structuring processing device provided in an embodiment of this application.
[0050] In some embodiments, the medical document structuring processing apparatus 2 may include multiple functional modules composed of computer program segments. The computer programs for each program segment in the medical document structuring processing apparatus 2 may be stored in the server's memory and executed by at least one processor to perform (see details). Figure 1 (Description) Functions of the medical document structuring method.
[0051] In this embodiment, the medical document structuring processing device 2 can be divided into multiple functional modules according to its functions. These functional modules may include: a text recognition module 21, an entity analysis module 22, a structural standard module 23, a multi-level verification module 24, and a traceability verification module 25. The module referred to in this invention is a series of computer program segments that can be executed by at least one processor and perform a fixed function, stored in memory. In this embodiment, the functions of each module will be detailed in subsequent embodiments.
[0052] In an optional implementation, when the received data is image data, the text recognition module 21 is used to: Edge detection is performed on the acquired image data to obtain the vertex coordinates of the document region in the image data; Calculate the perspective transformation matrix based on the vertex coordinates and the vertex coordinates in the preset target rectangle; The image data is geometrically corrected according to the perspective transformation matrix to generate a corrected standard document image, which is then used as the original medical document.
[0053] In an optional implementation, the text recognition module 21 is specifically used for: Perform structural analysis on the original medical documents to identify text regions in each document and determine the boundary coordinates of each region; Based on the type corresponding to the text region and the boundary coordinates, the text lines in each text region are detected to obtain the spatial location information of each text line in each text region. Based on the preset recognition order and the spatial location information, the text content within the text line is recognized, and text data in each text region is extracted.
[0054] In an optional implementation, the entity analysis module 22 is used for: Based on the spatial location information, the reading order between each line of text is determined, and the text data is sorted according to the reading order to generate a text sequence; Based on the preset medical entity types, the text sequence is subjected to named entity recognition to obtain the medical entities and their corresponding entity types; Based on the medical entity, the entity type, and the spatial location information, context-aware semantic relationship analysis is performed on the medical entity to obtain the association relationship between the medical entities.
[0055] In an optional implementation, the structural standard module 23 is used for: The medical entities in the entity knowledge unit are classified according to the preset entity format to obtain standard structure entities and non-standard structure entities. According to the preset standard field mapping rules, the standard structure entity is converted into a first standard field; Calculate the similarity between the non-standard structural entity and each field in the medical standard terminology database in the standard field mapping rules, and convert the non-standard structural entity into the second standard field corresponding to the highest similarity. Standard structured data is generated based on a predefined medical data template, the first standard field, and the second standard field.
[0056] In an optional implementation, the multi-level verification module 24 is used for: The standard structure data is validated according to preset medical logic rules and the associated relationships to obtain rule violation data; Based on the user identifier corresponding to the original medical document, historical user data is obtained from the preset user database to perform medical history association verification between the standard structure data and the historical user data, and abnormal indicator data is recorded. The rule violation data and the abnormal indicator data are sent to the target device through a preset review method, so as to correct the standard structure data according to the inspection data fed back by the target device and generate standard medical document data.
[0057] In an optional embodiment, the medical document structuring processing device 2 further includes a traceability verification module 25, which is used for: The standard medical document data is hashed to generate the current hash value corresponding to the standard medical document data, and the historical hash chain corresponding to the historical user data is obtained from the preset user database according to the user identifier; The integrity of the historical hash chain is verified. When the historical hash chain passes the verification, the current hash value is concatenated with the historical hash chain to form the current hash chain, and the historical hash chain is updated according to the current hash chain. According to the preset log format, the standard medical document data and the historical user data are constructed into a document log, and the historical hash chain is associated with the document log for storage to generate a traceability log.
[0058] It should be understood that the various variations and specific embodiments of the methods provided in the above embodiments are also applicable to the medical document structuring processing device of this embodiment. Through the foregoing detailed description of the medical document structuring processing method, those skilled in the art can clearly understand the implementation method of the medical document structuring processing device in this embodiment. For the sake of brevity, it will not be described in detail here.
[0059] like Figure 3 The diagram shown is a structural schematic of an electronic device provided in an embodiment of this application.
[0060] In a preferred embodiment of the present invention, the electronic device 3 may include, but is not limited to, a memory 31, at least one processor 32, and at least one communication bus 33.
[0061] Those skilled in the art should understand that Figure 3 The structure of the electronic device 3 shown does not constitute a limitation of the embodiments of the present invention. The electronic device 3 may also include more or fewer other hardware or software than shown, or different component arrangements.
[0062] In some embodiments, the electronic device 3 is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions, and its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits, programmable gate arrays, digital processors, and embedded devices.
[0063] It should be noted that the electronic device 3 is merely an example. Other existing or future electronic products that are suitable for this application should also be included within the scope of protection of this application and are incorporated herein by reference.
[0064] In some embodiments, the memory 31 stores a computer program that, when executed by the at least one processor 32, implements all or part of the steps in the medical document structuring processing method described above. The memory 31 includes read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data. Further, the computer-readable storage medium may primarily include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for a function, etc.
[0065] In some embodiments, the at least one processor 32 is the control unit of the electronic device 3, connecting various components of the electronic device 3 via various interfaces and lines. It executes programs or modules stored in the memory 31 and calls data stored in the memory 31 to perform various functions and process data. For example, when the at least one processor 32 executes a computer program stored in the memory 31, it implements all or part of the steps of the medical document structuring processing method described in this application embodiment; or it implements all or part of the functions of the medical document structuring processing device. The at least one processor 32 may be composed of integrated circuits, such as a single-packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips.
[0066] In some embodiments, the at least one communication bus 33 is configured to enable communication between the memory 31 and the at least one processor 32, etc. Although not shown, the electronic device 3 may also include a power supply (e.g., a battery) to power the various components. Preferably, the power supply can be logically connected to the at least one processor 32 via a power management device, thereby enabling functions such as charging, discharging, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The electronic device 3 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.
[0067] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause an electronic device (which may be a personal computer, electronic device, or network device, etc.) or processor to execute portions of the methods described in the various embodiments of this application.
[0068] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0069] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0070] The above are all preferred embodiments of this application, and are not intended to limit the scope of protection of this application. Therefore, all equivalent changes made in accordance with the structure, shape and principle of this application should be covered within the scope of protection of this application.
Claims
1. A method for structuring medical documents, characterized in that, The method includes: The original medical documents are subjected to text recognition to extract text data and corresponding spatial location information from each document. Based on the spatial location information, the text data is subjected to named entity recognition and relationship analysis to obtain entity knowledge units; According to the preset standard field mapping rules, the entity knowledge units are standardized and transformed to generate standard structure data; The standard structure data is subjected to multi-level verification using a preset verification method to generate standard medical document data.
2. The medical document structuring method according to claim 1, characterized in that, When the received data is image data, the method further includes the following steps before performing text recognition: Edge detection is performed on the acquired image data to obtain the vertex coordinates of the document region in the image data; Calculate the perspective transformation matrix based on the vertex coordinates and the vertex coordinates in the preset target rectangle; The image data is geometrically corrected according to the perspective transformation matrix to generate a corrected standard document image, which is then used as the original medical document.
3. The medical document structuring method according to claim 1, characterized in that, The step of performing text recognition on the acquired original medical documents and extracting text data and corresponding spatial location information from each document includes: Perform structural analysis on the original medical documents to identify text regions in each document and determine the boundary coordinates of each region; Based on the type corresponding to the text region and the boundary coordinates, the text lines in each text region are detected to obtain the spatial location information of each text line in each text region. Based on the preset recognition order and the spatial location information, the text content within the text line is recognized, and text data in each text region is extracted.
4. The medical document structured processing method according to claim 3, wherein the entity knowledge unit includes medical entities, entity types, and the relationships between medical entities, characterized in that, The step of performing named entity recognition and relation analysis on the text data based on the spatial location information to obtain entity knowledge units includes: Based on the spatial location information, the reading order between each line of text is determined, and the text data is sorted according to the reading order to generate a text sequence; Based on preset medical entity types, the text sequence is subjected to named entity recognition to obtain medical entities and their corresponding entity types; Based on the medical entity, the entity type, and the spatial location information, context-aware semantic relationship analysis is performed on the medical entity to obtain the association relationship between the medical entities.
5. The medical document structuring processing method according to claim 1, characterized in that, The step of standardizing and transforming the entity knowledge units according to preset standard field mapping rules to generate standard structure data includes: The medical entities in the entity knowledge unit are classified according to the preset entity format to obtain standard structure entities and non-standard structure entities. According to the preset standard field mapping rules, the standard structure entity is converted into a first standard field; Calculate the similarity between the non-standard structural entity and each field in the medical standard terminology database in the standard field mapping rules, and convert the non-standard structural entity into the second standard field corresponding to the highest similarity. Standard structured data is generated based on a predefined medical data template, the first standard field, and the second standard field.
6. The medical document structuring processing method according to claim 4, characterized in that, The process of performing multi-level verification on the standard structure data using a preset verification method to generate standard medical document data includes: The standard structure data is validated according to preset medical logic rules and the associated relationships to obtain rule violation data; Based on the user identifier corresponding to the original medical document, historical user data is obtained from the preset user database to perform medical history association verification between the standard structure data and the historical user data, and abnormal indicator data is recorded. The rule violation data and the abnormal indicator data are sent to the target device through a preset review method, so as to correct the standard structure data according to the inspection data fed back by the target device and generate standard medical document data.
7. The medical document structuring method according to claim 6, characterized in that, The method further includes: The standard medical document data is hashed to generate the current hash value corresponding to the standard medical document data, and the historical hash chain corresponding to the historical user data is obtained from the preset user database according to the user identifier; The integrity of the historical hash chain is verified. When the historical hash chain passes the verification, the current hash value is concatenated with the historical hash chain to form the current hash chain, and the historical hash chain is updated according to the current hash chain. According to the preset log format, the standard medical document data and the historical user data are constructed into a document log, and the historical hash chain is associated with the document log for storage to generate a traceability log.
8. A medical document structuring processing apparatus, applied to the medical document structuring processing method of claim 1, characterized in that, The device includes: The text recognition module is used to perform text recognition on the acquired raw medical documents and extract the text data and corresponding spatial location information from each document. The entity analysis module is used to perform named entity recognition and relationship analysis on the text data based on the spatial location information to obtain entity knowledge units; The structural standard module is used to standardize and transform the entity knowledge units according to preset standard field mapping rules to generate standard structural data; The multi-level verification module is used to perform multi-level verification on the standard structure data through preset verification methods to generate standard medical document data.
9. An electronic device, characterized in that, The electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the medical document structuring processing method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the medical document structuring processing method according to any one of claims 1 to 7.