A similar group determination method, device and storage medium

By constructing a similarity matrix of patient hospitalization dates and basic attributes, and fitting hospitalization time series using Wasserstein distance and Gaussian mixture function, a graphical model of patient feature similarity is built. This solves the problems of high complexity and low accuracy in patient hospitalization behavior similarity analysis in traditional methods, and enables rapid identification of medical insurance fraud gangs.

CN116957815BActive Publication Date: 2026-06-12CHINA MOBILE COMM LTD RES INST +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MOBILE COMM LTD RES INST
Filing Date
2022-08-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional methods are difficult to effectively analyze the similarity of patients' hospitalization behavior, especially in characterizing the similarity across different hospitalizations and time periods. Furthermore, they are computationally complex and cannot accurately identify medical insurance fraud gangs.

Method used

By constructing a similarity matrix of patient hospitalization dates and basic attributes, using Wasserstein distance to represent the similarity of hospitalization dates, and combining Gaussian mixture function or Gaussian distribution density function to fit the hospitalization time series, a graphical model of patient feature similarity is constructed, and maximal clique processing is performed to identify similar groups.

🎯Benefits of technology

It enables more accurate analysis of patient hospitalization behavior, quickly identifies similar groups that may be involved in medical insurance fraud, reduces computational complexity, and improves the success rate of medical insurance fraud prevention.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116957815B_ABST
    Figure CN116957815B_ABST
Patent Text Reader

Abstract

The application discloses a kind of similarity group determination method, device and storage medium;The method comprises: obtaining patient hospitalization characteristics;According to the patient hospitalization characteristics, the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity are constructed;The patient hospitalization date similarity is characterized by Wasserstein distance;According to the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity, a graph model of patient feature similarity is constructed;According to the graph model of patient feature similarity, maximum group processing is carried out, to obtain at least one maximum group;According to the at least one maximum group, determine target maximum group.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence, and more particularly to a method, apparatus, and storage medium for identifying similar groups. Background Technology

[0002] Social medical insurance effectively reduces economic risks for patients and improves the health of the general public. However, due to loopholes in the management system and implementation procedures of social medical insurance, some individuals exploit these loopholes for personal gain, using methods such as forging invoices, duplicate consultations, duplicate prescriptions, impersonation for medical treatment, and paying for non-medical insurance drugs or services to defraud the medical insurance fund, resulting in significant losses. Furthermore, some individuals collude with certain doctors in hospitals to fabricate or exaggerate treatment costs through group hospitalizations and group prescriptions, thereby defrauding social medical insurance.

[0003] Traditional medical insurance fraud prevention methods primarily rely on business analysis methods, such as interpreting medical insurance policy documents, doctors' treatment experience, and patient diagnostic information, to identify irregularities in medical insurance reimbursement. In addition, some anomaly detection methods identify potential fraudulent behavior by recognizing abnormal indicators in patient data. Effectively characterizing patient traits and developing interpretable methods for judging abnormal patient behavior can fully utilize patient data to uncover potential medical insurance reimbursement anomalies, thus supplementing the shortcomings of business analysis-based fraud prevention methods and improving the success rate of medical insurance fraud prevention. Since inpatient medical insurance reimbursement is a crucial component of all medical insurance reimbursements, effectively analyzing patient inpatient behavior and extracting key features can improve the accuracy of inpatient medical insurance fraud prevention algorithms, making it an important research direction. However, the characteristics of inpatient patients are diverse, and how to identify suspicious fraud groups based on these characteristics remains a problem that needs to be solved. Summary of the Invention

[0004] In view of this, the main objective of the present invention is to provide a method, apparatus and storage medium for determining similar groups.

[0005] To achieve the above objectives, the technical solution of the present invention is implemented as follows:

[0006] This invention provides a method for determining similar groups, the method comprising:

[0007] Obtain patient hospitalization characteristics;

[0008] Based on the patient hospitalization characteristics, a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity are constructed; the patient hospitalization date similarity is represented by Wasserstein distance.

[0009] Based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes, a graphical model of patient feature similarity is constructed.

[0010] Maximal clique processing is performed based on the graphical model of patient feature similarity to obtain at least one maximal clique;

[0011] The target maxima is determined based on the at least one maxima.

[0012] In the above scheme, the patient hospitalization characteristics include: patient hospitalization date characteristics;

[0013] Based on the patient hospitalization date characteristics, a matrix of patient hospitalization date similarity is constructed, including:

[0014] Determine the start date and length of stay for each patient's hospitalization;

[0015] The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization.

[0016] The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function.

[0017] The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained.

[0018] Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

[0019] In the above scheme, the matrix of patient hospitalization date similarity includes: the similarity results of the hospitalization dates of any two patients;

[0020] The step of constructing a matrix of patient hospitalization date similarity based on the calculation results between any two patients includes:

[0021] If the calculated result between two patients exceeds the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 1;

[0022] If the calculated result between two patients is less than or equal to the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 0.

[0023] In the above scheme, the step of fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function to obtain the second hospitalization time series includes:

[0024] When the number of hospitalizations of a patient exceeds the first threshold, a Gaussian mixture function is used to fit the first hospitalization time series to obtain the second hospitalization time series.

[0025] When the number of hospitalizations of a patient is less than or equal to a first threshold, the first hospitalization time series is fitted with a Gaussian distribution density function to obtain a second hospitalization time series.

[0026] In the above scheme, the patient hospitalization characteristics include: basic patient attributes;

[0027] The step of constructing a matrix of patient basic attribute similarity based on the patient's basic attributes includes:

[0028] Based on the basic patient attributes described for each patient, construct a basic attribute feature vector for each patient;

[0029] Calculate the similarity between the basic attribute feature vectors of any two patients based on the basic attribute feature vector of each patient.

[0030] Construct a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients' basic attributes.

[0031] In the above scheme, the matrix of patient basic attribute similarity includes: the similarity results of the basic attribute feature vectors of any two patients;

[0032] The step of constructing a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients includes:

[0033] If the similarity of the basic attribute feature vectors of two patients exceeds the basic attribute similarity threshold, the similarity result is determined to be 1.

[0034] If the similarity of the basic attribute feature vectors of two patients is less than or equal to the basic attribute similarity threshold, the similarity result is determined to be 0.

[0035] In the above scheme, the step of constructing a graph model of patient feature similarity based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes includes:

[0036] The patient feature similarity model is obtained by multiplying the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes by matrix components.

[0037] The elements in the patient feature similarity model are used as the adjacency matrix of the graph model to form an undirected graph with patients as nodes, which serves as the graph model for the patient feature similarity.

[0038] In the above scheme, determining the target maxima based on the at least one maxima includes:

[0039] Determine the number of patient groups included in each of the at least one maximal cliques;

[0040] Identify the target maxima where the number of patients exceeds a preset threshold.

[0041] This invention provides a device for determining similar groups, the device comprising:

[0042] The acquisition module is used to acquire patient hospitalization characteristics;

[0043] The first processing module is used to construct a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization characteristics; the patient hospitalization date similarity is represented by Wasserstein distance.

[0044] The second processing module is used to construct a graph model of patient feature similarity based on the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes.

[0045] The third processing module is used to perform maximal clique processing on the graphical model based on the patient feature similarity to obtain at least one maximal clique; and to determine the target maximal clique based on the at least one maximal clique.

[0046] In the above scheme, the patient hospitalization characteristics include: patient hospitalization date characteristics;

[0047] The first processing module is used to determine the start time and number of days of hospitalization for each patient in each hospitalization.

[0048] The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization.

[0049] The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function.

[0050] The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained.

[0051] Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

[0052] In the above scheme, the matrix of patient hospitalization date similarity includes: the similarity results of the hospitalization dates of any two patients;

[0053] The first processing module is used to determine the similarity result of the hospitalization dates of the two patients as 1 when the calculation result between the two patients exceeds the hospitalization time similarity threshold;

[0054] If the calculated result between two patients is less than or equal to the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 0.

[0055] In the above scheme, the first processing module is used to fit the first hospitalization time series with a Gaussian mixture function to obtain a second hospitalization time series when the number of hospitalizations of a patient exceeds a first threshold.

[0056] When the number of hospitalizations of a patient is less than or equal to a first threshold, the first hospitalization time series is fitted with a Gaussian distribution density function to obtain a second hospitalization time series.

[0057] In the above scheme, the patient hospitalization characteristics include: basic patient attributes;

[0058] The first processing module is used to construct a basic attribute feature vector for each patient based on the patient's basic attributes.

[0059] Calculate the similarity between the basic attribute feature vectors of any two patients based on the basic attribute feature vector of each patient.

[0060] Construct a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients' basic attributes.

[0061] In the above scheme, the matrix of patient basic attribute similarity includes: the similarity results of the basic attribute feature vectors of any two patients;

[0062] The first processing module is used to determine the similarity result as 1 if the similarity of the basic attribute feature vectors of two patients exceeds the basic attribute similarity threshold.

[0063] If the similarity of the basic attribute feature vectors of two patients is less than or equal to the basic attribute similarity threshold, the similarity result is determined to be 0.

[0064] In the above scheme, the second processing module is used to multiply the matrix of similarity of patient hospitalization dates and the matrix of similarity of patient basic attributes by matrix components to obtain a patient feature similarity model;

[0065] The elements in the patient feature similarity model are used as the adjacency matrix of the graph model to form an undirected graph with patients as nodes, which serves as the graph model for the patient feature similarity.

[0066] In the above scheme, the third processing module is used to determine the number of patient groups included in each of the at least one maximal cliques; and to determine the target maximal cliques whose number of patient groups exceeds a preset threshold.

[0067] This invention provides a similar group identification device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps of any of the methods described above.

[0068] This invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described above.

[0069] This invention provides a method, apparatus, and storage medium for determining similar patient groups. The method includes: acquiring patient hospitalization characteristics; constructing a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization characteristics; representing the patient hospitalization date similarity using Wasserstein distance; constructing a graphical model of patient characteristic similarity based on the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity; performing maximal clique processing on the graphical model of patient characteristic similarity to obtain at least one maximal clique; and determining a target maximal clique based on the at least one maximal clique. Thus, by calculating similarity using multiple patient hospitalization characteristics (including basic attributes and hospitalization dates), similar patient groups can be analyzed more accurately. Furthermore, representing the similarity of patient hospitalization dates using Wasserstein distance allows for more efficient and rapid analysis of the similarity of patient hospitalization behaviors. Attached Figure Description

[0070] Figure 1 This is a flowchart illustrating a method for determining similar groups according to an embodiment of the present invention.

[0071] Figure 2 A schematic diagram illustrating a method for determining similar groups provided in an application embodiment of the present invention;

[0072] Figure 3 This is a schematic diagram of a similar group determination device provided in an embodiment of the present invention;

[0073] Figure 4 This is a schematic diagram of another similar group determination device provided in an embodiment of the present invention. Detailed Implementation

[0074] As mentioned above, patient hospitalization characteristics are diverse. For example, in addition to features such as place of residence, age, disease name, and reimbursement price, patient characteristics also include the start date of hospitalization and the number of days of hospitalization. When identifying fraudulent groups using fake hospitalizations, it is necessary not only to obtain the routine numerical and discrete features of patient hospitalizations, but also to extract features from the hospitalization date and number of days of hospitalization. Similar patient hospitalization groups require that the length of each hospitalization within the group be similar. However, directly using the start date of hospitalization and the number of days of hospitalization as patient hospitalization characteristics leads to the following technical difficulties in calculating patient hospitalization similarity:

[0075] (1) Different patients have different numbers of hospitalizations within a certain period, resulting in differences in the number of features for hospitalization dates and length of stay. Patient hospitalization times are usually not directly aligned, and traditional methods cannot effectively calculate the similarity of hospitalization behavior across different time periods. Some traditional similarity calculation methods require patients to have the same number of hospitalizations. Normalizing features of variable length can lead to information loss.

[0076] (2) Since the length of a patient's hospital stay is a discrete variable (1 for hospitalization and 0 for non-hospitalization), treating the length of a patient's hospital stay as a time series directly cannot describe the similarity of hospital stay times. For example, the hospital stay time series of patients hospitalized on two consecutive days have no overlap.

[0077] (3) In practical applications, the similarity of patient hospitalization times mainly focuses on identifying highly similar behaviors within a certain time period, ignoring differences at individual time points. Traditional feature extraction methods for assessing the similarity of patient hospitalization times are affected by significant differences at local time points, making it difficult to effectively characterize patient hospitalization behavior in practical applications. Local time differences in patient hospitalization times will result in significant overall differences when arranged chronologically. For example, patient A's hospitalization times are January 5th, January 10th, and January 15th, while patient B's hospitalization times are January 10th, January 14th, and January 15th. Patient A and patient B have similar hospitalization times on the 10th and 15th, and patient A's hospitalization behavior on January 5th is a personal behavior; therefore, patient A and patient B are similar. However, if patient A and patient B are arranged chronologically, significant differences in their hospitalization times are found.

[0078] In summary, it is necessary to overcome the problem that traditional feature extraction methods cannot effectively analyze the similarity of hospitalization behavior of patients at different hospitalization times and time periods; and to overcome the problem that some patients have a large number of patients, a long time span of hospitalization behavior, and a large number of hospitalizations, which leads to high computational complexity.

[0079] Based on this, the method provided in this embodiment of the invention obtains patient hospitalization features; constructs a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization features; the patient hospitalization date similarity is represented by Wasserstein distance; constructs a graphical model of patient feature similarity based on the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity; performs maximal clique processing on the graphical model of patient feature similarity to obtain at least one maximal clique; and determines a target maximal clique based on the at least one maximal clique.

[0080] The present invention will be further described in detail below with reference to the embodiments.

[0081] Figure 1 This is a flowchart illustrating a method for determining similar groups provided in an embodiment of the present invention; as shown below. Figure 1 As shown, the method can be applied to a server; the method includes:

[0082] Step 101: Obtain patient hospitalization characteristics;

[0083] Step 102: Based on the patient hospitalization characteristics, construct a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity; the patient hospitalization date similarity is represented by Wasserstein distance;

[0084] Step 103: Construct a graph model of patient feature similarity based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes;

[0085] Step 104: Perform maximal clique processing on the graphical model of the patient feature similarity to obtain at least one maximal clique;

[0086] Step 105: Determine the target maxima based on the at least one maxima.

[0087] In some embodiments, patient hospitalization characteristics include: basic patient attributes and patient hospitalization date characteristics.

[0088] Basic patient attributes include: age, total out-of-pocket hospitalization expenses, total hospitalization reimbursement amount, disease type, and patient address.

[0089] The patient's hospitalization duration characteristics include: the start date and number of days of hospitalization for all hospitalization experiences.

[0090] In some embodiments, based on the patient hospitalization characteristics, constructing a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity includes:

[0091] Based on the characteristics of the patients' hospitalization dates, construct a matrix of similarity between patients' hospitalization dates;

[0092] Based on the patient's basic attributes, construct a matrix of patient basic attribute similarity.

[0093] In practical applications, considering that traditional feature extraction methods cannot effectively analyze the similarity of patients' hospitalization behavior across different hospitalization times and time periods, a method using Wasserstein distance to characterize the similarity of patients' hospitalization time features is proposed, in order to more effectively analyze the similarity of patients' hospitalization behavior.

[0094] Based on this, in some embodiments, constructing a matrix of patient hospitalization date similarity according to the patient hospitalization date characteristics includes:

[0095] Determine the start date and length of stay for each patient's hospitalization;

[0096] The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization.

[0097] The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function.

[0098] The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained.

[0099] Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

[0100] Here, the patient hospitalization time series is fitted with a Gaussian density function sequence to form the first hospitalization time series.

[0101] For example, if a patient is hospitalized n times within a certain period, and the start and end times of each hospitalization are denoted as a. i and b i If i = 1...n, then the number of days of hospitalization for the i-th hospitalization is c. i =a i -b i The first hospitalization time series of this patient is defined as follows:

[0102] Here, the similarity of hospitalization time between patients can be directly measured by the Wasserstein distance of the first hospitalization time series between patients.

[0103] However, considering the high complexity of Wasserstein distance when the time interval is large, this invention proposes to fit the first hospitalization time series with a Gaussian mixture function to obtain a second hospitalization time series. Based on the second hospitalization time series, the Wasserstein distance can be approximated efficiently, overcoming the problem of high computational complexity caused by a large number of patients, long hospitalization time spans, and numerous hospitalizations.

[0104] Based on this, in some embodiments, the first hospitalization time series is fitted with a Gaussian mixture function or a Gaussian distribution density function to obtain a second hospitalization time series, including:

[0105] When the number of hospitalizations of a patient exceeds the first threshold, a Gaussian mixture function is used to fit the first hospitalization time series to obtain the second hospitalization time series.

[0106] When the number of hospitalizations of a patient is less than or equal to the first threshold, the first hospitalization time series is fitted with a Gaussian distribution density function to obtain the second hospitalization time series.

[0107] When fitting the first hospitalization time series using a Gaussian mixture function, the fitting method is the Expectation-Maximization (EM) algorithm, and the number of Gaussian components is the first threshold.

[0108] For example, if the first threshold is 3, meaning there are 3 Gaussian components, and the EM algorithm is used for fitting, the average value of the i-th Gaussian component is μ. i The standard deviation is σ i This forms a new hospitalization time series. That is, the second hospitalization time series.

[0109] When the number of hospitalizations is less than or equal to 3, for the i-th hospitalization, a i c i (i.e., hospitalization start date and length of hospital stay) are fitted using a Gaussian distribution density function, which is defined as follows:

[0110]

[0111] Where x is a variable, indicating that the variable x follows a parameter μ. i and σ i The Gaussian distribution (also known as the normal distribution); μ i σ is the average value. i The standard deviation is μ. i and standard deviation σi Determine using the following method:

[0112] Where Γ(0.95) represents the 0.95 quantile of the standard Gaussian distribution.

[0113] Ultimately, a new hospitalization time series was obtained. That is, for patients with 3 or fewer hospitalizations, the second hospitalization time series is obtained.

[0114] Here, the Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results include:

[0115] Suppose any two patients are Patient A and Patient B, and their second hospitalization time series are denoted as: and (i represents the number of times patient A has been hospitalized, Let represent the mean and variance of patient A's i-th hospitalization, respectively; and j represent the number of hospitalizations of patient B. Let represent the mean and variance of patient B's j-th hospitalization, respectively. Calculate the Wasserstein distance. The calculation process includes:

[0116] Constructing the matrix: Among them, C ij Gaussian distribution and The Gaussian distribution represents the second hospitalization time series of patient A; The Wasserstein distance, representing the Gaussian distribution of patient B's second hospitalization time series, has the following calculation form:

[0117]

[0118] Constructing vectors: and Make p a Each component is p b Each component is Solve the following optimization problem:

[0119] f = min<C,P> P*1 a =p a PT*1 b =P b ,

[0120] Where <, ·, and > denote matrix inner product, and * denote matrix multiplication. and Let f represent vectors with all components equal to 1, T denote the matrix transpose, and f denote the Wasserstein distance between the hospitalization time series of patient A and patient B.

[0121] In some embodiments, the matrix of patient hospitalization date similarity includes: the similarity results of the patient hospitalization dates of any two patients;

[0122] The step of constructing a matrix of patient hospitalization date similarity based on the calculation results between any two patients includes:

[0123] If the calculated result between two patients exceeds the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 1;

[0124] If the calculated result between two patients is less than or equal to the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 0.

[0125] Here, we assume a hospitalization time similarity threshold τ1. When f > τ1, the hospitalization time similarity between patient A and patient B is recorded as 1; otherwise, it is recorded as 0.

[0126] The method in this invention uses Wasserstein distance to more effectively characterize the intrinsic similarity of patients' hospital stays, while overcoming the problem that small differences in hospital stays can lead to large discrepancies in similarity calculations due to the discrete nature of patient hospital stays. Furthermore, the calculation method for Wasserstein distance (also called Earth Mover's Distance, or EMD for short, used to represent the similarity between two distributions) is simplified, reducing computational complexity and enabling rapid calculation even with a large number of patients and hospitalizations.

[0127] In some embodiments, constructing a matrix of patient basic attribute similarity based on the patient basic attributes includes:

[0128] Based on the basic patient attributes described for each patient, construct a basic attribute feature vector for each patient;

[0129] Calculate the similarity between the basic attribute feature vectors of any two patients based on the basic attribute feature vector of each patient.

[0130] Construct a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients' basic attributes.

[0131] The patient basic attribute similarity matrix includes: the similarity results of the basic attribute feature vectors of any two patients;

[0132] The step of constructing a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients includes:

[0133] If the similarity of the basic attribute feature vectors of two patients exceeds the basic attribute similarity threshold, the similarity result is determined to be 1.

[0134] If the similarity of the basic attribute feature vectors of two patients is less than or equal to the basic attribute similarity threshold, the similarity result is determined to be 0.

[0135] Here, the length of the basic attribute feature vector constructed for each patient is consistent.

[0136] Suppose that any two patients are patient A and patient B, and their basic attribute feature vectors are x and x respectively. a and x b Then, the similarity between patient A and patient B is calculated as follows:

[0137] 1. If

[0138] 0, if

[0139] Where τ2 is a pre-set basic attribute similarity threshold.

[0140] In some embodiments, a graphical model of patient feature similarity is constructed based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes, including:

[0141] The patient feature similarity model is obtained by multiplying the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes by matrix components.

[0142] The elements in the patient feature similarity model are used as the adjacency matrix of the graph model to form an undirected graph with patients as nodes, which serves as the graph model for the patient feature similarity.

[0143] Here, in the process of building a graph model based on patient feature similarity, the edges of the weighted graph model are mainly generated by the similarity of patient hospitalization time and the similarity of patient basic attributes.

[0144] If we assume that, for the hospitalization data of m patients, a patient hospitalization date similarity matrix S1∈R can be generated. m×m The similarity matrix S2∈R between the basic attributes of the patient and the patient m×m ;

[0145] Then, the patient feature similarity model S is: S = S1 × S2; where × represents multiplication by matrix components.

[0146] Here, the elements in S are denoted as the adjacency matrix of the graph model, which forms an undirected graph G with patients as nodes, thus obtaining the graph model of patient feature similarity.

[0147] In some embodiments, maximal clique processing is performed based on a graphical model of patient feature similarity to obtain at least one maximal clique, including:

[0148] In the process of mining highly similar groups, the undirected graph G (i.e., the graph model of patient feature similarity) is solved by finding maximal cliques. Each maximal clique obtained represents a patient group with highly similar hospitalization behavior. From each maximal clique, the maximal clique with a high number of patients (e.g., greater than a preset threshold) is identified, and the patient group in the maximal clique is considered to have potential problems.

[0149] Figure 2 This is a flowchart illustrating a method for determining similar groups provided in an embodiment of the present invention; as shown below. Figure 2 As shown, the method can be applied to any server used to determine similarity groups; the method includes:

[0150] Step 201: Collect patient hospitalization characteristics;

[0151] Patient hospitalization characteristics include: basic patient attributes and patient hospitalization date characteristics.

[0152] Basic patient attributes include: age, total out-of-pocket hospitalization expenses, total hospitalization reimbursement amount, disease type, and patient address.

[0153] The patient's hospitalization duration characteristics include: the start date and number of days of hospitalization for all hospitalization experiences.

[0154] Step 202: Construct a similarity measure for patient hospitalization dates, and construct a similarity measure for patient basic attributes;

[0155] Step 202 includes: constructing a similarity metric for patient hospitalization dates; specifically including:

[0156] Step 2021: Fit the hospitalization time series of each patient (referred to as the first hospitalization time series) with a Gaussian density function sequence to form a new hospitalization feature series (referred to as the second hospitalization time series).

[0157] For example, if a patient is hospitalized n times within a certain period, and the start and end times of each hospitalization are denoted as a. i and b i If i = 1...n, then the number of days of hospitalization for the i-th hospitalization is c. i =a i -b iThe first hospitalization time series of a patient is defined as

[0158] The Wasserstein distance between patients' first hospitalization time series can directly measure the similarity of their hospitalizations. However, when the time interval is large, the Wasserstein distance has high complexity. Therefore, an efficient algorithm is needed to approximate the solution.

[0159] This invention proposes an approximate solution method for the Wasserstein distance, which mainly includes the following steps:

[0160] Step 1: Apply Gaussian mixture function to the first hospitalization time series. By fitting the data, a new patient hospitalization time series (i.e., the second hospitalization time series) is obtained.

[0161] When the number of hospitalizations (n) is greater than 3, the EM algorithm can be used for fitting. The number of Gaussian components is 3, and the EM algorithm returns the average value (μ) of the i-th Gaussian component. i and standard deviation σ i This forms a new hospitalization time series (i.e., the second hospitalization time series).

[0162] When the number of hospitalizations is less than or equal to 3, for the i-th hospitalization, a i c i We use a Gaussian distribution density function for fitting, which is defined as follows: Then μ i and σ i Determine using the following method: Where Γ(0.95) represents the 0.95 quantile of the standard Gaussian distribution. This ultimately forms a new hospitalization time series (i.e., the second hospitalization time series).

[0163] Step 2: Calculate the Wasserstein distance between the second hospitalization time series of any two patients and obtain the calculation results.

[0164] Suppose we have two patients, A and B, and their second hospitalization time series are denoted as follows: and (i represents the number of times patient A has been hospitalized, Let represent the mean and variance of patient A's i-th hospitalization, respectively; and j represent the number of hospitalizations of patient B. Let represent the mean and variance of patient B's j-th hospitalization, respectively, and construct an approximate solution method for the Wasserstein distance.

[0165] First, construct the matrix: Among them, C ij Gaussian distribution and The Wasserstein distance has the following calculation form:

[0166] Constructing vectors: and Make p a Each component is p b Each component is

[0167] Solve the following optimization problem:

[0168] f = min<C,P> P*1 a =p a P T *1 b =P b ,

[0169] Where <, ·, and > denote matrix inner product, and * denote matrix multiplication. and Let f represent vectors with all components equal to 1, and let f represent the Wasserstein distance between the second hospitalization time series of patients A and B.

[0170] In the method provided in this embodiment of the invention, a preset threshold τ1 is set. When f > τ1, the similarity of hospitalization time between patient A and patient B is recorded as 1; otherwise, it is recorded as 0.

[0171] Step 202 includes: constructing a similarity measure of patients' basic attributes; specifically including:

[0172] Step 2022: Construct feature vectors based on the patient's basic attributes, with each patient having the same length of basic attribute feature vector.

[0173] Specifically, given the feature vectors of patient A and patient B, respectively: x a and x b The similarity between patient A and patient B is calculated as follows:

[0174] 1. If

[0175] 0, if

[0176] Where τ2 is a pre-set threshold.

[0177] Step 203: Building a graph model based on patient feature similarity;

[0178] Specifically, the graph model based on patient feature similarity is constructed by generating the edges of the weighted graph model mainly through the similarity of patients' hospitalization dates and the similarity of patients' basic attributes.

[0179] Here, given the hospitalization data of m patients, a similarity matrix S1∈R of the patients' hospitalization dates can be generated. m×m The similarity matrix S2∈R between the basic attributes of the patient and the patient m×m ;

[0180] The patient similarity model S is calculated as follows: S = S1 × S2; where × represents multiplication by matrix components;

[0181] The elements in S are denoted as the adjacency matrix of the graph model, which forms an undirected graph G with patients as nodes.

[0182] Step 204: Patient population mining based on maximal clique estimation.

[0183] Highly similar patient groups are primarily identified by solving for maximal cliques in an undirected graph G. Each maximal clique returned by the algorithm represents a patient group with highly similar hospitalization behaviors. The maximal cliques with a high number of patients (e.g., exceeding a preset threshold) are then identified as potentially problematic patient groups.

[0184] The present invention can also provide an apparatus for similar groups, including: a hospital behavior feature acquisition module, a similarity construction module, a graph model building module, and a high similarity group mining module; wherein, the hospital behavior feature acquisition module can perform the above step 201, the similarity construction module can perform the above step 202, the graph model building module can perform the above step 203, and the high similarity group mining module can perform the above step 204.

[0185] The method and apparatus provided in this invention, considering that some medical insurance fraud gangs concentrate on fraudulent hospital admissions over a period of time, propose to identify suspicious gangs committing medical insurance fraud by mining groups with similar patient hospitalization behaviors. To mine these groups, patient hospitalization characteristics are modeled by analyzing similar groups (including constructing a similarity matrix of patient hospitalization dates, a similarity matrix of patient basic attributes, and a graphical model of patient feature similarity). Furthermore, when constructing the similarity matrix of patient hospitalization dates, Wasserstein distance is used to more effectively characterize the intrinsic similarity of patient hospitalization times, overcoming the problem that small differences in hospitalization times can lead to large discrepancies in similarity calculation results. Additionally, the calculation method for Wasserstein distance is simplified to reduce computational complexity, thus enabling rapid calculation even with a large number of patients and hospitalizations.

[0186] Figure 3 This is a schematic diagram of a similar group determination device provided in an embodiment of the present invention; as shown below. Figure 3 As shown, the device includes:

[0187] The acquisition module is used to acquire patient hospitalization characteristics;

[0188] The first processing module is used to construct a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization characteristics; the patient hospitalization date similarity is represented by Wasserstein distance.

[0189] The second processing module is used to construct a graph model of patient feature similarity based on the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes.

[0190] The third processing module is used to perform maximal clique processing on the graphical model based on the patient feature similarity to obtain at least one maximal clique; and to determine the target maximal clique based on the at least one maximal clique.

[0191] In some embodiments, the patient hospitalization characteristics include: patient hospitalization date characteristics;

[0192] The first processing module is used to determine the start time and number of days of hospitalization for each patient in each hospitalization.

[0193] The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization.

[0194] The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function.

[0195] The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained.

[0196] Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

[0197] In some embodiments, the matrix of patient hospitalization date similarity includes: the similarity results of the patient hospitalization dates of any two patients;

[0198] The first processing module is used to determine the similarity result of the hospitalization dates of the two patients as 1 when the calculation result between the two patients exceeds the hospitalization time similarity threshold;

[0199] If the calculated result between two patients is less than or equal to the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 0.

[0200] In some embodiments, the first processing module is configured to fit the first hospitalization time series with a Gaussian mixture function to obtain a second hospitalization time series when the number of hospitalizations of a patient exceeds a first threshold.

[0201] When the number of hospitalizations of a patient is less than or equal to a first threshold, the first hospitalization time series is fitted with a Gaussian distribution density function to obtain a second hospitalization time series.

[0202] In some embodiments, the patient hospitalization characteristics include: basic patient attributes;

[0203] The first processing module is used to construct a basic attribute feature vector for each patient based on the patient's basic attributes.

[0204] Calculate the similarity between the basic attribute feature vectors of any two patients based on the basic attribute feature vector of each patient.

[0205] Construct a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients' basic attributes.

[0206] In some embodiments, the matrix of patient basic attribute similarity includes: the similarity results of the basic attribute feature vectors of any two patients;

[0207] The first processing module is used to determine the similarity result as 1 if the similarity of the basic attribute feature vectors of two patients exceeds the basic attribute similarity threshold.

[0208] If the similarity of the basic attribute feature vectors of two patients is less than or equal to the basic attribute similarity threshold, the similarity result is determined to be 0.

[0209] In some embodiments, the second processing module is used to multiply the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity by matrix components to obtain a patient feature similarity model;

[0210] The elements in the patient feature similarity model are used as the adjacency matrix of the graph model to form an undirected graph with patients as nodes, which serves as the graph model for the patient feature similarity.

[0211] In some embodiments, the third processing module is configured to determine the number of patient groups included in each of the at least one maximal clusters;

[0212] Identify the target maxima where the number of patients exceeds a preset threshold.

[0213] It should be noted that the similar group determination device provided in the above embodiments is only illustrated by the division of the above program modules when implementing the corresponding similar group determination method. In practical applications, the above processing can be assigned to different program modules as needed, that is, the internal structure of the server can be divided into different program modules to complete all or part of the processing described above. In addition, the device and the corresponding method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.

[0214] Figure 4 This is a schematic diagram of a similar group determination device provided in an embodiment of the present invention, as shown below. Figure 4 As shown, the similarity group determination device 40 includes: a processor 401 and a memory 402 for storing a computer program capable of running on the processor; when the processor 401 runs the computer program, it performs the following: acquiring patient hospitalization features; constructing a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization features; representing the patient hospitalization date similarity using Wasserstein distance; constructing a graphical model of patient feature similarity based on the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity; performing maximal clique processing on the graphical model of patient feature similarity to obtain at least one maximal clique; and determining a target maximal clique based on the at least one maximal clique. Specifically, the similarity group determination device can also perform the following: Figure 1 The method shown is the same as Figure 1 The embodiments of the similar group determination method shown belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.

[0215] In practical applications, the similarity group determination device 40 may further include at least one network interface 403. The various components in the similarity group determination device 40 are coupled together via a bus system 404. It is understood that the bus system 404 is used to implement communication between these components. In addition to a data bus, the bus system 404 also includes a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 4 All buses are labeled as bus system 404. The number of processors 401 can be at least one. Network interface 403 is used for wired or wireless communication between similar group determination device 40 and other devices.

[0216] The memory 402 in this embodiment of the invention is used to store various types of data to support the operation of the similar group determination device 40.

[0217] The methods disclosed in the above embodiments of the present invention can be applied to processor 401, or implemented by processor 401. Processor 401 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit of the hardware in processor 401 or by instructions in the form of software. The processor 401 may be a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Processor 401 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present invention. The general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of the present invention can be directly manifested as being executed by a hardware decoding processor, or being executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium, which is located in memory 402. Processor 401 reads the information in memory 402 and combines its hardware to complete the steps of the aforementioned method.

[0218] In an exemplary embodiment, the similarity group determination device 40 may be implemented by one or more application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microcontrollers (MCUs), microprocessors, or other electronic components to perform the aforementioned method.

[0219] This invention also provides a computer-readable storage medium storing a computer program thereon. When the computer program is executed by a processor, it performs the following actions: acquiring patient hospitalization features; constructing a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization features; representing the patient hospitalization date similarity using Wasserstein distance; constructing a graphical model of patient feature similarity based on the matrix of patient hospitalization date similarity and the matrix of patient basic attribute similarity; performing maximal clique processing on the graphical model of patient feature similarity to obtain at least one maximal clique; and determining a target maximal clique based on the at least one maximal clique. Specifically, the computer program can also perform actions such as... Figure 1 The method shown is the same as Figure 1 The embodiments of the similar group determination method shown belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.

[0220] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0221] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0222] In addition, in the various embodiments of the present invention, each functional unit can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0223] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0224] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0225] It should be noted that terms such as "first" and "second" are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

[0226] Furthermore, the technical solutions described in the embodiments of this application can be combined arbitrarily without conflict.

[0227] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for determining similar groups, characterized in that, The method includes: Obtain patient hospitalization characteristics; Based on the patient hospitalization characteristics, a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity are constructed; the patient hospitalization date similarity is represented by Wasserstein distance. Based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes, a graphical model of patient feature similarity is constructed. Maximal clique processing is performed based on the graphical model of patient feature similarity to obtain at least one maximal clique; Based on the at least one maxima, determine the target maxima; The patient hospitalization characteristics include: patient hospitalization date characteristics; Based on the patient hospitalization date characteristics, a matrix of patient hospitalization date similarity is constructed, including: Determine the start date and length of stay for each patient's hospitalization; The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization. The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function. The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained. Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

2. The method according to claim 1, characterized in that, The matrix of patient hospitalization date similarity includes: the similarity results of the hospitalization dates of any two patients; The step of constructing a matrix of patient hospitalization date similarity based on the calculation results between any two patients includes: If the calculated result between two patients exceeds the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 1; If the calculated result between two patients is less than or equal to the hospitalization time similarity threshold, the similarity result of the hospitalization dates of the two patients is determined to be 0.

3. The method according to claim 1, characterized in that, The step of fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function to obtain the second hospitalization time series includes: When the number of hospitalizations of a patient exceeds the first threshold, a Gaussian mixture function is used to fit the first hospitalization time series to obtain the second hospitalization time series. When the number of hospitalizations of a patient is less than or equal to a first threshold, the first hospitalization time series is fitted with a Gaussian distribution density function to obtain a second hospitalization time series.

4. The method according to claim 1, characterized in that, The patient hospitalization characteristics include: basic patient attributes; The step of constructing a matrix of patient basic attribute similarity based on the patient's basic attributes includes: Based on the basic patient attributes described for each patient, construct a basic attribute feature vector for each patient; Calculate the similarity between the basic attribute feature vectors of any two patients based on the basic attribute feature vector of each patient. Construct a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients' basic attributes.

5. The method according to claim 4, characterized in that, The matrix of patient basic attribute similarity includes: the similarity results of the basic attribute feature vectors of any two patients; The step of constructing a matrix of patient basic attribute similarity based on the similarity of the feature vectors of any two patients includes: If the similarity of the basic attribute feature vectors of two patients exceeds the basic attribute similarity threshold, the similarity result is determined to be 1. If the similarity of the basic attribute feature vectors of two patients is less than or equal to the basic attribute similarity threshold, the similarity result is determined to be 0.

6. The method according to claim 1, characterized in that, The step of constructing a graphical model of patient feature similarity based on the matrix of similarity between patient hospitalization dates and the matrix of similarity between patient basic attributes includes: The patient feature similarity model is obtained by multiplying the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes by matrix components. The elements in the patient feature similarity model are used as the adjacency matrix of the graph model to form an undirected graph with patients as nodes, which serves as the graph model for the patient feature similarity.

7. The method according to claim 1, characterized in that, Determining the target maxima based on the at least one maxima includes: Determine the number of patient groups included in each of the at least one maximal cliques; Identify the target maxima where the number of patients exceeds a preset threshold.

8. A device for determining similar groups, characterized in that, The device includes: The acquisition module is used to acquire patient hospitalization characteristics; The first processing module is used to construct a matrix of patient hospitalization date similarity and a matrix of patient basic attribute similarity based on the patient hospitalization characteristics; the patient hospitalization date similarity is represented by Wasserstein distance. The second processing module is used to construct a graph model of patient feature similarity based on the matrix of similarity between the patient hospitalization dates and the matrix of similarity between the patient basic attributes. The third processing module is used to perform maximal clique processing on the graphical model based on the patient feature similarity to obtain at least one maximal clique; and to determine the target maximal clique based on the at least one maximal clique. The patient hospitalization characteristics include: patient hospitalization date characteristics; The first processing module is further configured to: Determine the start date and length of stay for each patient's hospitalization; The first hospitalization time series for each patient is determined based on the hospitalization start time and the number of hospitalization days for each patient in each hospitalization. The second hospitalization time series is obtained by fitting the first hospitalization time series with a Gaussian mixture function or a Gaussian distribution density function. The Wasserstein distance is calculated based on the second hospitalization time series of each of any two patients, and the results are obtained. Based on the calculation results between any two patients, construct a matrix of the similarity of the patients' hospitalization dates.

9. A similar group identification device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.