An information mining method for heterogeneous time series data
By constructing a task-adaptive model with a hypergraph structure and attention mechanism, the problem of information mining of heterogeneous time series data is solved, and efficient prediction and interpretation enhancement of EHR data are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2023-05-10
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies fail to fully exploit heterogeneous time-series data in EHRs, impacting model performance and interpretability, and fail to effectively unify the learning of information decay rates for each access and the correlation between medical events.
A task-adaptive model is constructed using a hypergraph structure and attention mechanism. Heterogeneous time series data is analyzed through a multilayer perceptron and sequence learning model, and medical events are predicted by combining a fully connected network.
It improves the accuracy of medical event prediction and the interpretability of the model, adapts to different downstream tasks, and enhances the ability to cope with complex medical situations.
Smart Images

Figure CN116543917B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of medical prediction, and in particular relates to an information mining method for heterogeneous time series data. Background Technology
[0002] An Electronic Health Record (EHR) is a longitudinal system for collecting electronic medical information about a patient, recording data generated across all healthcare institutions. This digitally stored information needs to be shareable across different healthcare facilities to ensure patients receive quality medical care from different doctors, hospitals, clinics, and even in different countries. It also allows doctors, other healthcare professionals, and insurance companies to share the patient's medical records across different devices.
[0003] In EHRs, doctors and other healthcare professionals typically use text to record patients' health information and medical history. This textual data includes medical records, lab results, radiology reports, medical orders, prescriptions, etc., all stored in natural language. NLP technology can analyze this textual information to extract useful insights, helping doctors and other healthcare professionals make more accurate diagnostic and treatment decisions.
[0004] Here are some common NLP applications:
[0005] Entity extraction: This technology can identify entities in text and associate them with specific categories, such as diseases, medications, surgeries, and laboratory test results. Healthcare professionals can use entity extraction to quickly obtain useful information about patients, such as medical history, treatment plans, and allergic reactions.
[0006] Automatic summarization: This technology uses natural language processing algorithms to automatically generate summaries or overviews of text. For large volumes of medical records, automatic summarization can help doctors understand patients' conditions and diagnostic results more quickly.
[0007] Text classification: This technology can automatically categorize text data into different categories, such as diseases and symptoms, clinical laboratories, and medications. This can help doctors better understand a patient's health condition and quickly find relevant information about the illness.
[0008] Sentiment analysis: This technology can analyze the emotional tone contained in text, such as a patient's level of pain or their response to a particular treatment. This can help doctors better assess a patient's condition and medical needs, thereby providing better care and treatment plans.
[0009] Speech recognition: This technology can convert verbal instructions from healthcare professionals into text format and store them in the EHR system. This helps healthcare professionals record patient information more quickly and also helps reduce input errors. In summary, NLP technology can help healthcare professionals better utilize the large amounts of textual information in the EHR system. Through entity extraction, automatic summarization, text classification, sentiment analysis, and other functions, NLP technology can help healthcare professionals access and analyze patient health information more quickly, thereby improving the quality and efficiency of healthcare.
[0010] Electronic health records (EHRs) are a type of time series data, a common data type in data mining. They typically consist of data from multiple time periods, containing rich temporal information. From this information, we can uncover patterns in data evolution and make reasonable inferences, which is crucial for many predictive tasks.
[0011] Heterogeneous time series data is even more complex. On the one hand, real-world graphs are far from homogeneous; heterogeneous information networks are common, such as drug-targeting biomedical networks and recommendation networks. On the other hand, heterogeneous time series data may have different sampling rates, different time spans, or different timestamps, and may also measure different variables or attributes at different time intervals. How to better extract various information from heterogeneous time series data has become a widely concerned issue in academia in recent years.
[0012] Current technologies do not fully consider all features of EHR data, thus failing to fully uncover the potential information hidden between medical codes and patients, which affects model performance and interpretability. While existing models can handle irregular time intervals to some extent, none treat time information as a "new" type of medical event, learn the information decay rate for each visit and the correlation between medical events in each visit in a uniform way, and lack task adaptability. Summary of the Invention
[0013] The purpose of this invention is to provide an information mining method for heterogeneous time series data to solve the problems existing in the prior art.
[0014] To achieve the above objectives, the present invention provides an information mining method for heterogeneous time series data, comprising:
[0015] Electronic medical record data is acquired, and a hypergraph is constructed based on the electronic medical record data. The hypergraph is analyzed and calculated using a multilayer perceptron and an attention mechanism to obtain embedded representation data. A task-adaptive model is constructed based on the attention mechanism, and the embedded representation data is classified and weighted using the task-adaptive model to obtain embedded sequence data.
[0016] A sequence learning model is constructed, and the embedded sequence data is subjected to hidden state access analysis through the sequence learning model to obtain the hidden representation data of the embedded sequence data;
[0017] Obtain the weight data of the hidden representation data, and weight the embedded sequence data based on the weight data to obtain the hidden data of the embedded sequence;
[0018] Acquire time training parameter data, train the sequence learning model using the time training parameter data, weight the embedded sequence hidden data using the trained sequence learning model to obtain the time dimension hidden data of the embedded sequence data, construct a fully connected network, and perform predictive analysis on the time dimension hidden data using the fully connected network to obtain medical event prediction data.
[0019] Optionally, the electronic medical record data includes: patient information data and medical code data.
[0020] Optionally, the process of constructing the hypergraph includes: using the patient information data as a hyperedge set E, using the medical code data as a node set C, and constructing a hypergraph Gh based on the hyperedge set E and the node set C;
[0021] The formula for calculating the constructed hypergraph Gh is as follows:
[0022]
[0023] Gh = (C, E)
[0024] In the formula, The i-th patient or superedge in layer l is represented, N p Indicates the number of patients.
[0025] Optionally, the process of obtaining the embedded representation data includes:
[0026] The hypergraph Gh is analyzed based on the attention mechanism to obtain important data of the hypergraph Gh. The important data is then iteratively analyzed using a multilayer perceptron to obtain embedded representation data Nodes.
[0027] The formula for calculating the embedded representation data Node is as follows:
[0028]
[0029]
[0030] Where φ(c)=p j |c∈P jLet w represent the set of hyperedge representations containing node c, w be a learnable parameter matrix, and ψ be a compatibility metric function between node and hyperedge embeddings implemented by an MLP.
[0031] Optionally, the task-adaptive model includes a task-known attention model. and task-unknown attention model
[0032] The task has a known attention model. for:
[0033]
[0034] The task-unknown attention model for:
[0035]
[0036] The task has a known attention model. The output o t With task-unknown attention model Output for:
[0037]
[0038]
[0039] In the formula, where This is an embedded representation of the main event. For the embedding representation of secondary events, n is the number of event types: m, d, l, p are one type of medical event, representing drugs, diagnosis, laboratory tests, and surgery, respectively;
[0040] Based on the known attention model for the task The output o t and the task-unknown attention model Output Construct access to embedded sequence data;
[0041] The access embedding sequence data is [o1, o2, ..., o T ].
[0042] Optionally, the process of obtaining the hidden representation data includes:
[0043] An arbitrary sequence modeling network, Backbone, is selected as the backbone network. A sequence learning model is constructed based on the backbone network. The access embedding sequence data is analyzed and calculated using the sequence learning model to obtain the hidden representation data h.
[0044] The formula for calculating the hidden state access data h is as follows:
[0045] h = [h1, h2, ..., h T =Backbone[o1,o2,…,O] T ].
[0046] Optionally, the process of obtaining the embedded sequence hidden data includes: obtaining the weight data [α1,…,α] of the hidden representation data h through an access-level attention mechanism. T Based on the weight data [α1,…,α], T The embedded sequence data is weighted to obtain the embedded sequence hidden data.
[0047] Among them, the acquisition of embedded sequence hidden data The calculation formula is:
[0048]
[0049]
[0050] in, Let T be the hidden state matrix for accessing from 1 to T.
[0051] Optionally, the process of obtaining hidden time-dimension data includes:
[0052] The time training parameter data includes: W Δ g t1 b Δ g t1 W Δ g t2 and b Δ g t2 ;
[0053] The sequence learning model is trained based on the time training parameter data;
[0054] The computational process for training the sequence learning model is as follows:
[0055]
[0056] Among them, W Δ g t1 ∈R b b Δ g t1 ∈R b W Δ g t2 ∈R m×b b Δ g t2 ∈Rm ;
[0057] After the model is trained, the global time decay score data [β1, ..., β] is obtained using the sigmoid function. T Based on the global time decay score data [β1, ..., β], T Hiding data in the embedded sequence We perform weighted analysis to obtain hidden data in the time dimension.
[0058] Among them, the acquisition of hidden data in the time dimension The calculation process is as follows:
[0059]
[0060]
[0061] Optionally, the process of obtaining medical event prediction data y′ includes:
[0062] y′=σ(W u [h′,e s ]+b u )
[0063] Among them, W u ∈R ρ×(b+g) , These are the training parameters. The technical effects of this invention are:
[0064] This invention provides an information mining method for heterogeneous time series data, integrating a hypergraph structure into the modeling process. This approach preserves the patient concept within the hypergraph structure and closely resembles the real doctor's consultation process, enabling more rational diagnosis and prevention by comparing treatment plans for patients with similar symptoms. This provides crucial interpretability for AI models in medical work and assists doctors in diagnosis. Simultaneously, different attention methods are adjusted for downstream tasks to uniformly learn the information decay rate and correlation between medical events in each visit. This attention mechanism is time-aware and task-adaptive. This model achieves performance improvements across various downstream tasks, increasing accuracy without sacrificing generalization. This allows the invention to handle more complex real-world medical situations, assisting medical personnel from multiple perspectives. The technical solution of this application dynamically adjusts the learning mode to update the embedding based on the task type, then enters the sequence learning module, utilizing time step information to learn complex information in the time dimension, resulting in accurate medical event predictions. Attached Figure Description
[0065] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0066] Figure 1 This is a flowchart of an embodiment of the present invention. Detailed Implementation
[0067] Various exemplary embodiments of the present invention will now be described in detail. This detailed description should not be considered as a limitation of the present invention, but rather as a more detailed description of certain aspects, features, and embodiments of the present invention.
[0068] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0069] Example 1
[0070] like Figure 1 As shown, this embodiment provides an information mining method for heterogeneous time series data, including:
[0071] Electronic medical record data is acquired, and a hypergraph is constructed based on the electronic medical record data. The hypergraph is analyzed and calculated using a multilayer perceptron and an attention mechanism to obtain embedded representation data. A task-adaptive model is constructed based on the attention mechanism, and the embedded representation data is classified and weighted using the task-adaptive model to obtain embedded sequence data.
[0072] A sequence learning model is constructed, and the embedded sequence data is subjected to hidden state access analysis through the sequence learning model to obtain the hidden representation data of the embedded sequence data;
[0073] Obtain the weight data of the hidden representation data, and weight the embedded sequence data based on the weight data to obtain the hidden data of the embedded sequence;
[0074] Acquire time training parameter data, train the sequence learning model using the time training parameter data, weight the embedded sequence hidden data using the trained sequence learning model to obtain the time dimension hidden data of the embedded sequence data, construct a fully connected network, and perform predictive analysis on the time dimension hidden data using the fully connected network to obtain medical event prediction data.
[0075] The hypergraph representation learning method in this embodiment first constructs the heterogeneous time-series data, EHR, into a hypergraph. For each time slice, each patient is considered a hyperedge, and each medical code is considered a node. Based on this, the hypergraph is constructed. Furthermore, the embedding of the hyperedges is learned by fusing information from all nodes on the hyperedges. Then, for each node, the node representation is updated by fusing information from all its hyperedges. This process yields the embedding representation, which is then fed into a task-adaptive attention module. This module dynamically adjusts the learning mode to update the embedding based on the task type. Subsequently, the sequence learning module learns complex information in the time dimension using time step information, and finally, result prediction is performed.
[0076] Specifically, the steps include the following:
[0077] S1. Embed the data from the EHR dataset into the hypergraph structure and update the node and hyperedge information;
[0078] S2. In order to make the model task-adaptive, attention is learned based on the embedding representation data learned from the hypergraph for the classification of downstream task types;
[0079] S3. In order to mine hidden information in the time dimension, local irregular time intervals and global time intervals are used to guide model learning to obtain time dimension hidden embedding data;
[0080] S4. Hide the time dimension by embedding it into the data representation and feed it into two fully connected layers for result prediction;
[0081] In step S1, the node information is first aggregated in the hypergraph.
[0082] Gh = (C, E) represents the patient code hypergraph, where C is the set of nodes in the hypergraph and also the medical code set.
[0083] This represents a hyperboundary set or a patient record set. Let φ(c) = p represent the i-th patient or superedge in layer l. j |c∈P j Represents the set of superedges containing node c;
[0084] Oversmoothing in neural networks can make medical codes and patient representations difficult to distinguish in a hypergraph. Therefore, it's necessary to select the most important nodes or hyperedges during message passing. An attention mechanism is thus applied, specifically described below. ψ, implemented by an MLP, is a compatibility measure between node and hyperedge embeddings. w is a parameter vector:
[0085]
[0086]
[0087] After several iterations, we obtain patient embedding representation data to serve subsequent steps.
[0088] In step S2, cross-event attention is divided into two cases based on the type of the target event: task-unwareattention and task-awareattention. Task-unwareattention corresponds to the case where the target event is a new type different from all events in the historical access, while task-awareattention corresponds to the case where the historical access includes medical events of the same type as those in the historical access (i.e., the primary event). In task-awareattention, we apply self-attention only to the primary event (taking drug prediction as an example).
[0089]
[0090] However, all events in the task-unknown attention are as follows:
[0091]
[0092] It is a matrix containing all time and time embeddings. The final output of this module is as follows:
[0093]
[0094]
[0095] Our proposed method is time-aware due to the attentional weights between the primary event (or all events) and the time interval. It also possesses event-awareness because attention is applied at the event level. Furthermore, the attention mechanism can be adapted to different tasks.
[0096] In step S3, any sequence modeling network can be used as the backbone network to model the historical access sequence, such as GRU, LSTM, and Transformer. Assume that time-aware, event-aware, and task-adaptive access embedding sequences [o1, o2, ..., o...] are used. T The hidden representation data of ] can be obtained through the following equation:
[0097] h = [h1, h2, ..., h T ]=Backbone[o1, o2,...,O T ]
[0098] Where h t ∈R bThe hidden state is determined by aggregating all medical information for the t-th visit, and the backbone is any sequence modeling network. After obtaining h, we use visit-level attention to generate corresponding attention weights for each visit, resulting in the weighted data of the hidden representation data:
[0099]
[0100] in It is the hidden state matrix for accessing from 1 to T;
[0101] Based on the weighted data of the hidden representation data, the embedded sequence hidden data for each patient is obtained:
[0102]
[0103] In addition to considering local time intervals, we also consider the impact of global time decay on information transmission. Similar to local time information, we treat it as a medical event and train the model accordingly.
[0104]
[0105] Among them W Δ g t1 ∈R b b Δ g t1 ∈R b W Δ g t2 ∈R m×b b Δ g t2 ∈R m These are all training parameters. Further, the sigmoid function is used to calculate the global time decay score, which is then weighted and added to the embedded sequence hidden data to obtain the time dimension hidden data of the embedded sequence. The specific formula is as follows:
[0106]
[0107]
[0108] In step S4, a fully connected network with a sigmoid function is used to predict binary vectors as follows:
[0109] y′=σ(W u [h′,e s ]+b u )
[0110] Among them W u ∈R ρ×(b+g) , The training parameters are y' and the predicted values are y' and y' is the result. Applying binary cross-entropy loss to these parameters with the label optimizes the entire network.
[0111] The advantages of using the method provided in this embodiment are:
[0112] The method in this embodiment first constructs a hypergraph representation learning framework. It then attempts to jointly capture code-code, patient-patient, and patient-code relationships from EHR data.
[0113] Meanwhile, this embodiment treats time information as a "new" type of medical event and proposes a novel attention mechanism (cross-event attention) to learn the information decay rate of each visit and the correlation between medical events of each visit in a unified manner. This attention mechanism is time-aware and task-adaptive; it innovatively introduces visit-level attention to simulate the relationships between historical visits and introduces a global time transformer to model global time information.
[0114] This embodiment provides an information mining method for heterogeneous time series data. It integrates a hypergraph structure into the modeling process of heterogeneous time series data and adjusts different attention methods for downstream tasks. It learns the information decay rate of each access and the correlation between medical events in each access in a unified way. This attention mechanism is time-aware and task-adaptive.
[0115] The technical solution of this application can dynamically adjust the learning mode to update the embedding according to the task type, and then enter the sequence learning module to learn complex information in the time dimension using time step information, which can obtain accurate medical event prediction results. The experimental results on two commonly used heterogeneous time series datasets and three downstream tasks exceed the current state-of-the-art level.
[0116] The above description is merely a preferred embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for information mining of heterogeneous time series data, characterized in that, include: Electronic medical record data is acquired, a hypergraph is constructed based on the electronic medical record data, and the hypergraph is analyzed and calculated using a multilayer perceptron and attention mechanism to obtain embedded representation data; A task-adaptive model is constructed based on an attention mechanism. This model is then used to classify and weight the embedded representation data to obtain embedded sequence data. Construct a sequence learning model, and perform hidden state access analysis on the embedded sequence data using the sequence learning model to obtain the hidden representation data of the embedded sequence data; Obtain the weight data of the hidden representation data, and weight the embedded sequence data based on the weight data to obtain the hidden data of the embedded sequence; Acquire time training parameter data, train the sequence learning model using the time training parameter data, weight the embedded sequence hidden data using the trained sequence learning model to obtain the time dimension hidden data of the embedded sequence data, construct a fully connected network, and perform predictive analysis on the time dimension hidden data using the fully connected network to obtain medical event prediction data; The task-adaptive model includes a task-known attention model. and task-unknown attention model ; The task has a known attention model. for: The task-unknown attention model for: The task has a known attention model. Output With task-unknown attention model Output for: In the formula, where This is an embedded representation of the main event. For the embedding representation of secondary events, n is the number of event types: m, d, l, p are one type of medical event, representing drugs, diagnosis, laboratory tests, and surgery, respectively; Based on the known attention model for the task Output and the task-unknown attention model Output Construct access to embedded sequence data; The access embedded sequence data is .
2. The information mining method for heterogeneous time series data according to claim 1, characterized in that, The electronic medical record data includes: patient information data and medical code data.
3. The information mining method for heterogeneous time series data according to claim 2, characterized in that, The process of constructing the hypergraph includes: using the patient information data as a hyperedge set E, using the medical code data as a node set C, and constructing a hypergraph Gh based on the hyperedge set E and the node set C; The formula for calculating the constructed hypergraph Gh is as follows: Gh=(C,E) In the formula, Presentation layer The first in Individual patient information or super-border representation Indicates the number of patients.
4. The information mining method for heterogeneous time series data according to claim 3, characterized in that, The process of obtaining embedded representation data includes: The hypergraph Gh is analyzed based on the attention mechanism to obtain important data of the hypergraph Gh. The important data is then iteratively analyzed using a multilayer perceptron to obtain embedded representation data. ; Among them, the acquisition of embedded representation data The calculation formula is: in, Let w represent the set of hyperedge representations containing node c, w be a learnable parameter matrix, and ψ be a compatibility metric function between node and hyperedge embeddings implemented by an MLP.
5. The information mining method for heterogeneous time series data according to claim 4, characterized in that, The process of obtaining the hidden representation data includes: An arbitrary sequence modeling network, Backbone, is selected as the backbone network. A sequence learning model is constructed based on the backbone network. The access embedding sequence data is analyzed and calculated using the sequence learning model to obtain the hidden representation data h. The formula for calculating the hidden representation data h is as follows: 。 6. The information mining method for heterogeneous time series data according to claim 5, characterized in that, The process of obtaining the hidden data of the embedded sequence includes: obtaining the weight data of the hidden representation data h through an access-level attention mechanism. Based on the weight data The embedded sequence data is weighted to obtain the embedded sequence hidden data. ; Among them, the acquisition of embedded sequence hidden data The calculation formula is: in, From 1 to T The hidden state matrix of the access.
7. The information mining method for heterogeneous time series data according to claim 6, characterized in that, The process of obtaining hidden data in the time dimension includes: The time-based training parameter data includes: , , and ; The sequence learning model is trained based on the time training parameter data; The computational process for training the sequence learning model is as follows: in, ; After the model is trained, the global time decay score data is obtained using the sigmoid function. Based on the global time decay score data Hiding data in the embedded sequence We perform weighted analysis to obtain hidden data in the time dimension. ; Among them, the acquisition of hidden data in the time dimension The calculation process is as follows: 。 8. The information mining method for heterogeneous time series data according to claim 7, characterized in that, Obtaining medical event prediction data The process includes: in, These are the training parameters.