Hospital internal control evaluation method, system and device based on multi-modal data fusion

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multimodal data fusion-based hospital internal control evaluation method, which automatically parses text semantics, detects image features, and log stream patterns, the problem of fragmented multimodal data and insufficient evaluation dimensions in hospital asset management is solved. This enables fine-grained risk assessment and real-time maintenance instruction generation, thereby improving the accuracy and efficiency of asset management.

CN122067739BActive Publication Date: 2026-06-19SICHUAN ACADEMY OF MEDICAL SCI SICHUAN PROVINCIAL PEOPLES HOSPITAL

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SICHUAN ACADEMY OF MEDICAL SCI SICHUAN PROVINCIAL PEOPLES HOSPITAL
Filing Date: 2026-04-21
Publication Date: 2026-06-19

Smart Images

Figure CN122067739B_ABST

Patent Text Reader

Abstract

This invention seeks protection for a hospital internal control evaluation method, system, and device based on multimodal data fusion, belonging to the field of computer-aided medical asset management technology. The method involves acquiring semantic analysis text, medical image sequences, and asset operation log streams of the target asset within a continuous time window; performing dependency parsing on the text to construct semantic feature triples; performing inter-frame differencing on the image sequences to count the number of changing pixels and their spatial distribution centroids, constructing an image feature vector sequence; performing inflection point detection on the log stream to form a log feature sequence; establishing a multimodal feature table, comparing it element-by-element with a standard feature table, and accumulating weighted differences to obtain a total difference score; outputting a risk level based on the interval of the total difference score and generating maintenance instructions containing the operation type and execution time. This invention further performs cluster risk assessment, asset trend prediction, neighbor asset impact analysis, and cross-modal consistency verification on departmental assets, achieving a comprehensive evaluation of hospital hardware assets.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention seeks protection for a hospital internal control evaluation method, system, and device based on multimodal data fusion, belonging to the field of computer-aided medical asset management technology. Background Technology

[0002] In the internal control evaluation system of medical institutions, hospital asset management evaluation is a crucial link in ensuring the safe operation of medical assets and improving diagnostic quality. Traditional asset management evaluation methods mainly rely on regular manual inspections and post-failure analysis, with evaluation criteria typically based on single-type operation records or simple asset self-inspection reports. In recent years, with the widespread adoption of hospital information systems, electronic medical records, medical image archiving and communication systems, and asset log management systems have accumulated a large amount of text descriptions, image sequences, and time-series status data. However, this data is usually scattered across different subsystems, lacking effective means of correlation analysis. Existing multimodal data processing technologies in the computer field are mostly geared towards general-purpose image classification or speech recognition tasks. When directly applied to hospital internal control evaluation, they struggle to handle the complex correspondences between asset operation semantics embedded in text, asset imaging features in medical images, and dynamic parameters in log streams, resulting in the underutilization of a large amount of multi-source heterogeneous data.

[0003] For risk assessment of hospital assets, some methods have attempted to use a single data source for analysis. For example, analyzing abnormal numerical changes in asset logs can determine whether an asset needs maintenance, or manually reviewing image quality can indirectly assess the asset's condition. However, these methods have significant shortcomings: relying solely on log data cannot reflect the asset's workload and image output quality in actual clinical operations, while relying solely on image data makes it difficult to trace whether the asset's own operating parameters deviate from standards. Furthermore, existing technologies for semantic text processing are often limited to keyword matching, ignoring the syntactic dependencies between asset identifiers, operational actions, and measured values within the text, resulting in a single dimension of usable information extracted from medical records. In addition, due to the lack of a mechanism to align and fuse data from different modalities on a unified timeline, assessment results often lag behind actual changes in the asset's condition, failing to achieve a comprehensive and forward-looking assessment of the asset's hardware performance. Summary of the Invention

[0004] The main objective of this application is to provide a multimodal data evaluation method that integrates semantic analysis text, medical image sequences, and asset operation log streams. This method should be able to automatically parse asset operation semantics from electronic medical records, detect changes reflecting asset imaging stability from image sequences, identify fluctuation patterns of asset parameters from log streams, and construct a unified multimodal feature representation through spatiotemporal alignment. Based on this, by comparing actual operational characteristics with standard characteristics, the method outputs the asset's risk level and generates corresponding maintenance operation instructions. This helps solve the problems of fragmented multimodal data, incomplete evaluation dimensions, and inability to correlate asset hardware status in real time in existing hospital internal control evaluations, providing a data-driven, fine-grained evaluation basis for hospital asset management.

[0005] To achieve the above objectives, this application provides the following technical solution:

[0006] The hospital internal control evaluation method based on multimodal data fusion includes the following steps:

[0007] Step A1: Obtain semantic analysis text, medical image sequences, and asset operation log streams generated by the target assets within the hospital within a continuous time window;

[0008] Step A2: Perform dependency parsing on the semantic analysis text to identify text components, extract asset identifiers, action type words, and numerical object words, and combine them into semantic feature triples;

[0009] Step A3: Perform inter-frame difference operation on the medical image sequence, calculate the absolute value of the gray level difference between two adjacent frames and mark the corresponding pixel positions as changed pixels, count the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and construct an image feature vector sequence;

[0010] Step A4: Perform inflection point detection on the asset operation log stream, traverse the numerical sequence in the log stream, record the inflection points of the values, and calculate the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and form a log feature sequence.

[0011] Step A5: Match the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time match it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table;

[0012] Step A6: Read the standard feature table that matches the target asset model from the preset standard feature library, compare the multimodal feature table with the standard feature table, calculate the absolute difference between the actual feature value and the standard feature value in each cell, and sum them up to obtain the total difference score;

[0013] Step A7: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate operation instructions for the target asset.

[0014] Further, step A2 includes:

[0015] The semantically analyzed text is divided into multiple sentences by periods, and a dependency parser is used to output the dependency arc structure for each sentence.

[0016] Find the core predicate node from the dependency arc structure, and starting from the core predicate node, find the subject node along the subject relation arc. Extract the words corresponding to the subject node as asset identifiers.

[0017] Find the object node along the direct object relation arc, and extract the words corresponding to the object node as numerical object words;

[0018] Treat the words corresponding to the core predicate nodes as action type words;

[0019] If a single sentence contains multiple subject relation arcs or multiple direct object relation arcs, then multiple asset identifiers or multiple numerical object words are extracted to form multiple semantic feature triples.

[0020] Arrange the semantic feature triples generated from all the sentences in the order in which the sentences appear in the text to form a sequence of semantic feature triples.

[0021] Furthermore, step A3 also includes:

[0022] The first frame of the medical image sequence is read as the reference frame. For the second frame and each subsequent frame, the grayscale value of the current image is subtracted from that of the previous frame at the same pixel coordinate position to obtain the difference image.

[0023] Set a grayscale difference threshold, traverse all pixels in the difference image, mark pixels with an absolute grayscale difference greater than the threshold as changed pixels, and mark pixels with an absolute grayscale difference less than or equal to the threshold as unchanged pixels.

[0024] The total number of changing pixels in each frame of the difference image is counted and used as the first feature element of that frame.

[0025] For each frame of the difference image, the sum of the row coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid row coordinates, and the sum of the column coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid column coordinates. The centroid row coordinates and centroid column coordinates are used as the second and third feature elements of the frame image.

[0026] The total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to each frame are arranged in frame order to form a three-dimensional image feature vector sequence.

[0027] When the total number of changed pixels in a frame of the difference image is zero, the barycenter row coordinates and barycenter column coordinates of that frame are set to be the same as the barycenter coordinates of the previous frame of the difference image.

[0028] Furthermore, step A4 also includes:

[0029] Obtain a sequence of values from the asset operation log stream, which contains value points sampled at equal time intervals;

[0030] Starting from the second value point, calculate the difference between each value point and its previous value point to obtain a difference sequence;

[0031] For each difference in the difference sequence, record the sign of the difference as positive, negative, or zero.

[0032] Traverse the difference sequence, and when the signs of two adjacent differences change, mark the next value point corresponding to the sign change position as the inflection point. Specifically:

[0033] If the current difference is positive and the next difference is negative, then the value point corresponding to the next difference is the peak inflection point;

[0034] If the current difference is negative and the next difference is positive, then the value point corresponding to the next difference is the valley inflection point;

[0035] If the difference is zero and the signs of the differences are different, then the first numerical point in the zero value interval is marked as the inflection point;

[0036] Arrange all inflection points in chronological order, and use them together with the first and last numerical points of the numerical sequence as dividing points to divide the numerical sequence into multiple consecutive intervals, with each interval located between two adjacent dividing points.

[0037] For each interval, take the values of all the data points in the interval, calculate the arithmetic mean of these values as the mean of the interval, find the maximum value among these values as the maximum value of the interval, and find the minimum value among these values as the minimum value of the interval.

[0038] The mean, maximum, and minimum values of each interval are arranged in the order of the start time of that interval. The three values of each interval form a sub-tuple, and all sub-tuples are arranged in chronological order to form a log feature sequence.

[0039] Furthermore, step A5 also includes:

[0040] Determine the start and end times of a unified timeline. The start time is the minimum of the earliest operation time indication in the semantic analysis text, the timestamp of the first frame image in the medical image sequence, and the time of the first sampling point in the asset operation log stream. The end time is the maximum of the three. The time range between the start and end times is divided into multiple consecutive time units.

[0041] For each time unit, perform the following operations:

[0042] Check if a semantic feature triple exists within the time range covered by the time unit. If it exists, convert the numerical object words in the triple into numerical values and fill them into the corresponding semantic feature column of the row. If it does not exist, copy the numerical values from the previous time unit where semantic features exist and fill them in.

[0043] If a frame exists in the image feature vector sequence within the time range covered by the time unit, the total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to the frame are respectively filled into the three image feature columns corresponding to the row. If it does not exist, the frame is filled in after linear interpolation from the existing frame that is closest in time.

[0044] If an interval in the log feature sequence exists within the time range covered by the time unit, the mean, maximum, and minimum values of the interval are filled into the three log feature columns corresponding to the row. If the time unit spans multiple interval boundaries, the feature value of the interval with the largest time proportion is filled in.

[0045] Furthermore, step A6 also includes:

[0046] Read the standard feature table corresponding to the standard asset with the same model as the target asset from the preset standard feature library;

[0047] For each row in the multimodal feature table, the actual feature values in each cell are extracted sequentially;

[0048] For the standard feature values in the standard feature table at the same row and column number, calculate the absolute difference between the actual feature value and the standard feature value;

[0049] For the semantic feature column, the absolute difference is multiplied by the first weighting coefficient to obtain the weighted difference of the cell. For the total number of changed pixels column in the image feature column, the absolute difference is multiplied by the second weighting coefficient to obtain the weighted difference. For the centroid row coordinate column and centroid column coordinate column in the image feature column, the absolute difference is multiplied by the third weighting coefficient and the fourth weighting coefficient respectively to obtain the weighted difference.

[0050] For the mean, maximum, and minimum value columns in the log feature columns, the absolute difference is multiplied by the fifth, sixth, and seventh weighting coefficients, respectively, to obtain the weighted difference;

[0051] The weighted differences of all cells in the multimodal feature table are summed to obtain the total difference score.

[0052] Furthermore, the preset score interval in step A7 includes a low-risk interval, a medium-risk interval, and a high-risk interval. The lower limit of the low-risk interval is zero, and the upper limit of the low-risk interval is a first limit value. The lower limit of the medium-risk interval is equal to the first limit value, and the upper limit of the medium-risk interval is a second limit value. The lower limit of the high-risk interval is equal to the second limit value, and the upper limit of the high-risk interval is positive infinity.

[0053] When the total difference score falls into the low-risk range, the risk level of the target asset is output as low risk, and the instruction type identifier in the generated operation instruction is the routine inspection identifier.

[0054] When the total difference score falls into the medium risk range, the risk level of the target asset is output as medium risk, and the instruction type identifier in the generated operation instruction is parameter calibration identifier;

[0055] When the total difference score falls into the high-risk range, the risk level of the target asset is output as high risk, and the instruction type identifier in the generated operation instruction is a shutdown check identifier.

[0056] Furthermore, after step A5 is completed and before step A6 begins, a multimodal data consistency verification step is also included:

[0057] For each time unit on the unified time axis, read the values in the semantic feature column, the total number of changed pixels in the image feature column, and the mean value in the log feature column corresponding to that time unit from the multimodal feature table;

[0058] Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the total number of changed pixels in the image feature column at that time unit. The numerical correspondence is defined as follows:

[0059] When the value in the semantic feature column belongs to the first value range, the total number of changed pixels should belong to the first number range. When the value in the semantic feature column belongs to the second value range, the total number of changed pixels should belong to the second number range. Otherwise, it is marked as semantic-image inconsistency.

[0060] Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the mean in the log feature column for that time unit. The numerical correspondence is defined as follows:

[0061] When the value in the semantic feature column belongs to the first value range, the mean value in the log feature column should belong to the first mean value range. When the value in the semantic feature column belongs to the second value range, the mean value in the log feature column should belong to the second mean value range. Otherwise, it is marked as semantic-log inconsistency.

[0062] If both semantic-image inconsistency and semantic-log inconsistency occur simultaneously in a time unit, then that time unit is marked as a severely inconsistent time unit.

[0063] The number of all seriously inconsistent time units is counted. If the number exceeds the preset proportion of the total number of time units in the unified time axis, steps A6 and A7 are stopped, and a maintenance operation instruction containing a data acquisition review instruction is generated. The instruction type of this maintenance operation instruction is identified as a re-acquisition identifier.

[0064] According to a second aspect of the present invention, the present invention claims protection for a hospital internal control evaluation device based on multimodal data fusion, for executing the hospital internal control evaluation method based on multimodal data fusion, comprising the following features:

[0065] Data acquisition module: Acquires semantic analysis text, medical image sequences, and asset operation log streams generated by target assets within a continuous time window within the hospital;

[0066] Text analysis module: Performs dependency parsing on the semantically analyzed text, identifies text components, extracts asset identifiers, action type words, and numerical object words, and combines them into semantic feature triples;

[0067] Image analysis module: Performs inter-frame difference operation on the medical image sequence, calculates the absolute value of the gray level difference between two adjacent frames and marks the corresponding pixel positions as changed pixels, counts the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and constructs an image feature vector sequence;

[0068] Log analysis module: Performs inflection point detection operation on the asset operation log stream, traverses the numerical sequence in the log stream, records the inflection points of the values, and calculates the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and forms a log feature sequence;

[0069] Multimodal matching module: Matches the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time matches it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table;

[0070] The difference calculation module reads a standard feature table that matches the target asset model from a preset standard feature library, compares the multimodal feature table with the standard feature table, calculates the absolute difference between the actual feature value and the standard feature value in each cell, and accumulates them to obtain the total difference score.

[0071] Risk assessment module: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate operation instructions for the target asset.

[0072] According to a third aspect of the present invention, the present invention claims protection for a hospital internal control evaluation system based on multimodal data fusion, comprising:

[0073] One or more processors;

[0074] A memory that stores one or more programs, which, when executed by one or more processors, enable the one or more processors to implement the hospital internal control evaluation method based on multimodal data fusion.

[0075] This invention seeks protection for a hospital internal control evaluation method, system, and device based on multimodal data fusion, belonging to the field of computer-aided medical asset management technology. The method involves acquiring semantic analysis text, medical image sequences, and asset operation log streams of the target asset within a continuous time window; performing dependency parsing on the text to construct semantic feature triples; performing inter-frame differencing on the image sequences to count the number of changing pixels and their spatial distribution centroids, constructing an image feature vector sequence; performing inflection point detection on the log stream to form a log feature sequence; establishing a multimodal feature table, comparing it element-by-element with a standard feature table, and accumulating weighted differences to obtain a total difference score; outputting a risk level based on the interval of the total difference score and generating maintenance instructions containing the operation type and execution time. This invention further performs cluster risk assessment, asset trend prediction, neighbor asset impact analysis, and cross-modal consistency verification on departmental assets, achieving a comprehensive evaluation of hospital hardware assets. Attached Figure Description

[0076] Figure 1 This is a flowchart illustrating the workflow of the hospital internal control evaluation method based on multimodal data fusion, as claimed in the embodiments of the present invention.

[0077] Figure 2 This is a second flowchart of the hospital internal control evaluation method based on multimodal data fusion, as claimed in the embodiments of the present invention.

[0078] Figure 3 The third flowchart is shown for the hospital internal control evaluation method based on multimodal data fusion, which is claimed in the embodiments of the present invention. Detailed Implementation

[0079] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0080] The terms "first," "second," and "third" in this application are for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified. All directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of this application are only used to explain the relative positional relationships and movements between components in a specific orientation (as shown in the figures). If the specific orientation changes, the directional indications also change accordingly. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or asset that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or assets.

[0081] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a mutually exclusive, independent, or alternative embodiment. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0082] According to the first embodiment of the present invention, referring to Figure 1 This invention seeks protection for a hospital internal control evaluation method based on multimodal data fusion, comprising the following steps:

[0083] Step A1: Obtain semantic analysis text, medical image sequences, and asset operation log streams generated by the target assets within the hospital within a continuous time window;

[0084] Step A2: Perform dependency parsing on the semantic analysis text to identify text components, extract asset identifiers, action type words, and numerical object words, and combine them into semantic feature triples;

[0085] Step A3: Perform inter-frame difference operation on the medical image sequence, calculate the absolute value of the gray level difference between two adjacent frames and mark the corresponding pixel positions as changed pixels, count the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and construct an image feature vector sequence;

[0086] Step A4: Perform inflection point detection on the asset operation log stream, traverse the numerical sequence in the log stream, record the inflection points of the values, and calculate the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and form a log feature sequence.

[0087] Step A5: Match the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time match it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table;

[0088] Step A6: Read the standard feature table that matches the target asset model from the preset standard feature library, compare the multimodal feature table with the standard feature table, calculate the absolute difference between the actual feature value and the standard feature value in each cell, and sum them up to obtain the total difference score;

[0089] Step A7: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate operation instructions for the target asset.

[0090] In this embodiment, a data acquisition server is deployed within the hospital's internal network. This server establishes communication connections with the hospital information system, the image archiving and communication system, and the asset log management system. The server first receives instructions from the user interface, specifying a target asset to be evaluated, such as a computed tomography (CT) scan asset, and a continuous time window, for example, from 8:00 AM to 5:00 PM on the current date. Within this time window, the server retrieves all electronic medical records related to the target asset from the hospital information system's database. These records contain text descriptions entered by doctors or technicians, such as records of patient examinations, asset operation parameter settings, and textual descriptions of asset self-inspection results. This text content is extracted as semantic analysis text. Simultaneously, the server retrieves all medical images generated by the target asset within the time window from the image archiving and communication system. These images are arranged according to the examination sequence and frame order, forming a medical image sequence. In addition, the server retrieves the status parameter stream recorded by the built-in sensors of the target asset from the asset log management system. This parameter stream records various physical quantities during asset operation at a fixed sampling interval (e.g., once per second), such as tube voltage, filament current, rotating anode speed, and internal asset temperature. These records are arranged in chronological order to form the asset operation log stream. The server temporarily stores the above three types of data in local memory and records their respective timestamp information for subsequent processing.

[0091] The server performs dependency parsing on the semantic analysis text obtained in step A1. First, the server segments the entire text content according to periods, question marks, exclamation marks, and line breaks, resulting in several independent sentences. For each sentence, the server invokes a dependency parser, which performs part-of-speech tagging on the words in the sentence based on a grammar rule base and identifies the grammatical relationships between words. The parser first identifies the core predicate verb in the sentence, such as words indicating actions or states like "set," "adjust," "record," and "display." Starting from the core predicate verb, it finds the subject component along the grammatical arc indicating subject relations (e.g., subject-verb relations). This subject component is usually the name of an asset or asset component, such as "CT asset," "X-ray tube," and "detector." The server extracts this word as an asset identifier. Then, it finds the object component along the grammatical arc indicating direct object relations. The object component is usually a numerical description, such as "one hundred and twenty," "zero point five," and "two." The server extracts this numerical word as a numerical object word. At the same time, the core predicate verb itself is treated as an action type word. If a sentence contains multiple subject relation arcs (e.g., parallel subjects), a semantic feature triple is generated for each subject. If a sentence contains multiple direct object relation arcs (e.g., parallel objects), a semantic feature triple is generated for each object. These triples share the same asset identifier and the same action type word. The server arranges all the triples generated from each sentence in the order in which the sentences appear in the original text, forming a sequence of semantic feature triples. Each triple contains three fields: asset identifier, action type word, and numeric object word.

[0092] The server performs inter-frame difference operations on the medical image sequence obtained in step A1. First, the server reads the first frame of the image sequence and stores it as a reference frame in a temporary buffer. Then, starting from the second frame, it processes each current frame sequentially. For each current frame, the server creates a blank difference matrix of the same size. The server iterates through each pixel position in the current image, extracts the grayscale value at that position, and simultaneously extracts the grayscale value at the same position in the reference frame, calculating the absolute value of the difference between these two grayscale values. The server pre-sets a grayscale difference threshold, the specific value of which is determined by the operator during system configuration based on the image bit depth and image content characteristics. If the calculated absolute difference is greater than the grayscale difference threshold, the server marks the pixel as a changed pixel and records the difference value at the corresponding position in the difference matrix; if the absolute difference is less than or equal to the threshold, it is marked as a non-changed pixel, and the corresponding position in the difference matrix is set to zero. After the iteration is complete, the server counts the total number of changed pixels in the difference matrix, obtaining a numerical value. Next, the server calculates the centroid coordinates of the spatial distribution of all changed pixels: the server initializes two accumulators, one for accumulating the row coordinates and the other for accumulating the column coordinates of the changed pixels. It then iterates through all changed pixels again, adding the row coordinates to the row accumulator and the column coordinates to the column accumulator for each changed pixel encountered. After the iteration, the server divides the value of the row accumulator by the total number of changed pixels to obtain the centroid row coordinates; it also divides the value of the column accumulator by the total number of changed pixels to obtain the centroid column coordinates. If the total number of changed pixels in a frame is zero, the centroid coordinates cannot be calculated by division. In this case, the server copies the centroid row and column coordinates from the previous frame's difference image to the current frame. The server combines the total number of changed pixels, the centroid row coordinates, and the centroid column coordinates in a fixed order for the current frame as the feature elements of that frame, and writes these feature elements sequentially into the image feature vector sequence according to the frame order. For the first frame of the image, since it has no previous frame, no difference results are generated. Therefore, the total length of the image feature vector sequence is equal to the total number of frames in the medical image sequence minus one.

[0093] The server performs inflection point detection on the asset operation log stream obtained in step A1. The asset operation log stream is a sequence of numerical points sampled at equal time intervals, each numerical point containing a sampling time and a numerical value. Starting from the second numerical point, the server calculates the difference between each numerical point and its preceding numerical point, and records the sign (positive, negative, or zero) of each difference. The server iterates through these difference signs, searching for the positions where adjacent signs change. When a sign changes from positive to negative, the server marks the following numerical point as a peak inflection point; when a sign changes from negative to positive, the server marks the following numerical point as a valley inflection point; when a zero sign appears, the server further checks the signs before and after the zero sign. If the signs are opposite, the server marks the first numerical point within the zero sign interval as an inflection point. The server arranges all detected inflection points in chronological order, and also adds the first and last numerical points of the entire numerical sequence to the inflection point list, using these points as dividing boundaries. The server divides the original numerical sequence into several consecutive intervals based on these dividing boundaries. Each interval lies between two adjacent dividing boundaries and does not contain boundary points. For each interval, the server extracts the values of all points within that interval, calculates their arithmetic mean, and identifies the maximum and minimum values. The server then arranges the arithmetic mean, maximum, and minimum values of each interval in chronological order, forming a sub-tuple for each interval. All sub-tuples are arranged chronologically to form a log feature sequence, the length of which is equal to the number of intervals.

[0094] The server matches the semantic feature triplet sequence obtained in step A2, the image feature vector sequence obtained in step A3, and the log feature sequence obtained in step A4 to establish a multimodal feature table indexed by a unified timeline. The server first determines the start and end times of the unified timeline: the start time is the minimum of the earliest occurrence of the operation time indication in the semantic analysis text, the timestamp of the first frame image in the medical image sequence, and the time of the first sampling point in the asset operation log stream; the end time is the maximum of the three. The server divides the time range between the start and end times into several consecutive time units, each with the same length determined by system configuration (e.g., one minute). Each time unit corresponds to a row index in the multimodal feature table. For each time unit, the server performs the following fill operation: For the semantic feature column, the server checks if the semantic feature triplet generated in step A2 exists within the time range covered by the time unit. If it exists, the server extracts the numerical object words from the triplet, converts them to numerical form, and fills them into the corresponding semantic feature cell in the row; if it does not exist, the server searches backward for the nearest time unit with a semantic feature triplet and copies the value from that time unit to the current row. For the image feature column, the server checks if a frame from the image feature vector sequence exists within the time range covered by the given time unit. If it does, the server fills the three image feature cells in that row with the total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid. If it doesn't exist, the server finds the two closest time units with image features, performs linear interpolation on the feature values of these two time units according to the time distance ratio, and fills the interpolation result into the current row. For the log feature column, the server checks which interval in the log feature sequence the time range covered by the given time unit belongs to. If the time unit falls entirely within an interval, the server fills the mean, maximum, and minimum values of that interval into the three log feature cells in that row. If the time unit crosses the boundaries of two or more intervals, the server calculates the proportion of time each interval occupies within that time unit, selects the interval with the largest proportion, and fills its feature value into the current row. After these operations, the server obtains a multi-row, multi-column multimodal feature table, where each row corresponds to a time unit and each column corresponds to a feature dimension.

[0095] The server reads a standard feature table matching the target asset model from a pre-set standard feature library. This standard feature library is generated by the manufacturer or hospital asset department under standard operating conditions, based on data collected during asset manufacturing or after comprehensive calibration. The standard feature table has the exact same number of rows and columns as the multimodal feature table generated in step A5, and the column order is also completely identical. The server compares the multimodal feature table with the standard feature table row by row and column by column. For each row index and each column index, the server retrieves the actual feature value of that cell in the multimodal feature table and the standard feature value at the same row and column position in the standard feature table. The server calculates the absolute value of the difference between these two values to obtain the absolute difference of that cell. The server pre-sets different weighting coefficients for different feature columns. The specific values of these weighting coefficients are determined during system configuration. All weighting coefficients are positive and their sum equals one. The server multiplies the absolute difference of each cell by the weighting coefficient corresponding to the column containing that cell to obtain the weighted difference value of that cell. The server iterates through all cells in the multimodal feature table, sums up the weighted difference values of each cell, and finally obtains a total value, called the total difference score.

[0096] The server determines which preset score interval the total difference score calculated in step A6 falls into. These preset score intervals include a low-risk interval, a medium-risk interval, and a high-risk interval. The lower limit of the low-risk interval is zero, and the upper limit is the first threshold value; the lower limit of the medium-risk interval is equal to the first threshold value, and the upper limit is the second threshold value; the lower limit of the high-risk interval is equal to the second threshold value, and the upper limit is positive infinity. The first and second threshold values are fixed values preset according to the asset type and clinical application requirements. The server compares the total difference score with these two threshold values: if the total difference score is less than the first threshold value, the target asset is classified as low-risk; if the total difference score is greater than or equal to the first threshold value and less than the second threshold value, it is classified as medium-risk; if the total difference score is greater than or equal to the second threshold value, it is classified as high-risk. Based on the determined risk level, the server generates an operation instruction for the target asset. This operation instruction is a structured data object containing two fields: an instruction type identifier field and an instruction execution time field. For low-risk levels, the instruction type identifier is set to "Routine Inspection Identifier," and the instruction execution time is set to 00:00 on the seventh natural day after the current time. For medium-risk levels, the instruction type identifier is set to "Parameter Calibration Identifier," and the instruction execution time is set to the time corresponding to the twenty-fourth hour after the current time. For high-risk levels, the instruction type identifier is set to "Stop Inspection Identifier," and the instruction execution time is set to the time corresponding to the fourth hour after the current time. The server sends this operation instruction to the message queue of the hospital asset maintenance management system, which is responsible for scheduling and executing it.

[0097] Further, step A2 includes:

[0098] The semantically analyzed text is divided into multiple sentences by periods, and a dependency parser is used to output the dependency arc structure for each sentence.

[0099] Find the core predicate node from the dependency arc structure, and starting from the core predicate node, find the subject node along the subject relation arc. Extract the words corresponding to the subject node as asset identifiers.

[0100] Find the object node along the direct object relation arc, and extract the words corresponding to the object node as numerical object words;

[0101] Treat the words corresponding to the core predicate nodes as action type words;

[0102] If a single sentence contains multiple subject relation arcs or multiple direct object relation arcs, then multiple asset identifiers or multiple numerical object words are extracted to form multiple semantic feature triples.

[0103] Arrange the semantic feature triples generated from all the sentences in the order in which the sentences appear in the text to form a sequence of semantic feature triples.

[0104] In this embodiment, the server reads all text content from the semantically analyzed text obtained in step A1. The server segments the text content according to punctuation marks such as periods, question marks, exclamation marks, and line breaks, treating each segment as an independent short sentence unit. For each short sentence unit, the server checks whether it is an empty string or contains only whitespace characters; if so, it skips it.

[0105] For each non-empty short sentence unit, the server invokes a pre-loaded dependency parsing model. This model analyzes the short sentences using either a graph-based or transition-based approach. The analysis process is as follows: First, the model segments the short sentence, dividing the continuous character sequence into independent words, such as "CT assets," "adjust," "to," "two hundred and fifty," and "million amperes." Then, the model tags each word with its part of speech; for example, "CT assets" is tagged as a noun, "adjust" as a verb, "two hundred and fifty" as a quantifier, and "million amperes" as a quantifier. Next, the model determines the dependency relationships between words according to grammatical rules, establishing a directed arc for each pair of words with a grammatical relationship. The arc begins with a modifier or subordinate word and ends with a core word or governing word, with the relationship type labeled on the arc, such as "subject-verb relationship," "verb-object relationship," or "attributive relationship."

[0106] After the dependency arc structure is constructed, the server locates the core predicate node. The core predicate is typically the main verb of the entire sentence, represented in the dependency arc structure as a relation arc without pointers to other predicates, or determined based on the relation type. The server iterates through all nodes, finding the node that is a verb and serves as the sentence's center as the core predicate node. If multiple verbs exist, the topmost verb is determined as the core predicate based on the dependency relation.

[0107] Starting from the core predicate node, the server finds the subject node along a directed arc of the "subject relation" type. The endpoint of the subject relation arc is the core predicate node, and the starting point is the subject node. The server extracts the words corresponding to the subject node as asset identifiers. If there are multiple subject relation arcs (e.g., parallel subjects), the server extracts the words corresponding to each subject node separately, and each word is treated as an independent asset identifier.

[0108] Starting from the core predicate node, the server finds the object node along a directed arc of the "direct object relation" type. The core predicate node is the endpoint of the direct object relation arc, and the object node is the starting point. The server extracts the words corresponding to the object nodes as numerical object words. If there are multiple direct object relation arcs (e.g., parallel objects), the words corresponding to each object node are extracted separately, and each word is treated as an independent numerical object word.

[0109] The server uses the words corresponding to the core predicate nodes as action type words.

[0110] For each combination of an asset identifier and a numeric object word, the server generates a semantic feature triple, where the first element of the triple is the asset identifier, the second element is the action type word, and the third element is the numeric object word. If multiple asset identifiers and multiple numeric object words are detected in a short sentence, multiple triples of Cartesian product combinations are generated.

[0111] The server arranges all semantic feature triples generated from all short sentences in the order in which the sentences appear in the original text, forming a sequence of semantic feature triples. The order of the triples in this sequence reflects the temporal relationship of the operations described in the text. The server stores this sequence in memory for use in subsequent step A5.

[0112] Furthermore, referring to Figure 2 Step A3 further includes:

[0113] The first frame of the medical image sequence is read as the reference frame. For the second frame and each subsequent frame, the grayscale value of the current image is subtracted from that of the previous frame at the same pixel coordinate position to obtain the difference image.

[0114] Set a grayscale difference threshold, traverse all pixels in the difference image, mark pixels with an absolute grayscale difference greater than the threshold as changed pixels, and mark pixels with an absolute grayscale difference less than or equal to the threshold as unchanged pixels.

[0115] The total number of changing pixels in each frame of the difference image is counted and used as the first feature element of that frame.

[0116] For each frame of the difference image, the sum of the row coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid row coordinates, and the sum of the column coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid column coordinates. The centroid row coordinates and centroid column coordinates are used as the second and third feature elements of the frame image.

[0117] The total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to each frame are arranged in frame order to form a three-dimensional image feature vector sequence.

[0118] When the total number of changed pixels in a frame of the difference image is zero, the barycenter row coordinates and barycenter column coordinates of that frame are set to be the same as the barycenter coordinates of the previous frame of the difference image.

[0119] In this embodiment, the server first acquires the medical image sequence collected in step A1. This sequence contains multiple frames of images, each with the same number of rows and columns, i.e., the same resolution. The server uses the first frame as a reference frame and copies its pixel data into a dedicated reference frame buffer.

[0120] The server begins iterating through each frame from the second frame to the last frame, performing the following operations on each current frame:

[0121] The server creates a difference image matrix with the exact same dimensions as the image. The matrix has the same number of rows and columns as the image, and each element in the matrix is initially set to zero.

[0122] The server sets a grayscale difference threshold, which is a fixed non-negative integer. The specific value of this threshold is pre-configured by the system administrator based on the image bit depth and clinical requirements. For example, for a 12-bit image, the threshold can be set as a proportion of a certain grayscale range.

[0123] The server iterates through every pixel in the current image. For each pixel, the server retrieves the grayscale value of the current image at that location and simultaneously retrieves the grayscale value of the reference frame at the same location from the reference frame buffer. The server calculates the absolute value of the difference between these two grayscale values. If the absolute value is greater than a grayscale difference threshold, the server classifies the pixel as a changed pixel and sets the value at the corresponding position in the difference image matrix to this absolute value; if the absolute value is less than or equal to the threshold, the server classifies the pixel as a non-changed pixel, and the corresponding position in the difference image matrix remains zero.

[0124] After traversing all pixels, the server counts the number of non-zero elements in the difference image matrix; this value represents the total number of changed pixels. The server stores this total number as a variable.

[0125] Next, the server calculates the centroid coordinates of the spatial distribution of all changed pixels. The server initializes two accumulation variables: one for accumulating the row coordinates and the other for accumulating the column coordinates of changed pixels, both with initial values of zero. The server iterates through all pixel positions again. For each position marked as a changed pixel, the server adds the row coordinate value to the row accumulation variable and the column coordinate value to the column accumulation variable. After the iteration is complete, the server checks if the total number of changed pixels is zero. If the total number is not zero, the server divides the value of the row accumulation variable by the total number of changed pixels to obtain the centroid row coordinates; it also divides the value of the column accumulation variable by the total number of changed pixels to obtain the centroid column coordinates. If the total number of changed pixels is zero, division cannot be performed, and the server sets both the centroid row and column coordinates of the current frame to be the same as the centroid coordinates of the previous frame's difference image. For the first frame of the differential image (i.e., the difference between the second frame and the first frame), if the total number of changed pixels is zero, then there is no previous frame to refer to. In this case, the server sets the centroid row coordinates and centroid column coordinates to the row and column coordinates of the image center position, which is half the number of rows and half the number of columns.

[0126] The server combines the total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid in a fixed order (number first, then row coordinates, then column coordinates) into a subvector and appends this subvector to the end of the image feature vector sequence. The image feature vector sequence is a list structure where each element is a subvector containing three values, and the order of the subvectors is consistent with the processing order of the image frames.

[0127] After all frames of images have been processed, the length of the image feature vector sequence obtained by the server is equal to the total number of frames in the medical image sequence minus one. This sequence will be stored in memory for use in step A5.

[0128] Furthermore, step A4 also includes:

[0129] Obtain a sequence of values from the asset operation log stream, which contains value points sampled at equal time intervals;

[0130] Starting from the second value point, calculate the difference between each value point and its previous value point to obtain a difference sequence;

[0131] For each difference in the difference sequence, record the sign of the difference as positive, negative, or zero.

[0132] Traverse the difference sequence, and when the signs of two adjacent differences change, mark the next value point corresponding to the sign change position as the inflection point. Specifically:

[0133] If the current difference is positive and the next difference is negative, then the value point corresponding to the next difference is the peak inflection point;

[0134] If the current difference is negative and the next difference is positive, then the value point corresponding to the next difference is the valley inflection point;

[0135] If the difference is zero and the signs of the differences are different, then the first numerical point in the zero value interval is marked as the inflection point;

[0136] Arrange all inflection points in chronological order, and use them together with the first and last numerical points of the numerical sequence as dividing points to divide the numerical sequence into multiple consecutive intervals, with each interval located between two adjacent dividing points.

[0137] For each interval, take the values of all the data points in the interval, calculate the arithmetic mean of these values as the mean of the interval, find the maximum value among these values as the maximum value of the interval, and find the minimum value among these values as the minimum value of the interval.

[0138] The mean, maximum, and minimum values of each interval are arranged in the order of the start time of that interval. The three values of each interval form a sub-tuple, and all sub-tuples are arranged in chronological order to form a log feature sequence.

[0139] In this embodiment, the server acquires the asset operation log stream collected in step A1. This log stream is a sequence of numerical points, each containing two attributes: sampling time and numerical value. The server arranges the numerical points in the sequence in ascending order of sampling time to ensure correct chronological order.

[0140] The server processes values sequentially from the second value in the sequence to the last value. For the currently processed i-th value, the server calculates the difference between its value and the value of the (i-1)-th value. The server records the sign of this difference: a positive sign if the difference is greater than zero; a negative sign if the difference is less than zero; and a zero sign if the difference is equal to zero. The server stores these signs in a sign sequence of the same length as the difference sequence.

[0141] The server iterates through the symbol sequence, finding the point where two adjacent symbols change. The server compares the j-th symbol with the (j+1)-th symbol. If the j-th symbol is positive and the (j+1)-th symbol is negative, the server marks the next value point corresponding to the (j+1)-th difference (i.e., the (j+2)-th original value point) as a peak inflection point. If the j-th symbol is negative and the (j+1)-th symbol is positive, the server marks the next value point corresponding to the (j+1)-th difference as a valley inflection point. If either the j-th symbol or the (j+1)-th symbol is zero, the server needs to process it further: when zero appears between two opposite symbols, such as positive, zero, and negative, or negative, zero, and positive, the server marks the first value point within the zero interval as an inflection point. The zero interval refers to the range of value points covered by multiple consecutive zeros.

[0142] The server arranges all detected inflection points into an inflection point list according to their sampling time. Then, the server adds the first value point (i.e., the earliest value point sampled) and the last value point (i.e., the latest value point sampled) of the entire log stream sequence to the inflection point list, and re-sorts them by time to ensure that the beginning and end of the list are the beginning and end of the sequence, respectively.

[0143] The server uses two adjacent inflection points in the inflection point list as interval boundaries to divide the original numerical sequence into multiple consecutive intervals. Specifically, for the k-th and (k+1)-th inflection points in the inflection point list, the server extracts all numerical points in the original sequence located between these two inflection points (excluding the boundary points themselves, or, by definition, including one of the boundary points; here, the method of excluding boundary points is adopted to avoid duplication). These numerical points constitute an interval.

[0144] For each interval, the server performs the following statistical calculations: The server iterates through all the numerical points within the interval, collecting the value for each point. The server calculates the arithmetic mean of these values, which is the sum of all values divided by the number of numerical points. The server finds the maximum value among these values, which is the largest value recorded during the iteration. The server finds the minimum value among these values, which is the smallest value recorded during the iteration. The server stores these three statistics (mean, maximum, and minimum) as a sub-tuple and, in the order of the interval's start time, stores these sub-tuples sequentially into a list structure.

[0145] The final log feature sequence obtained by the server is this list, the length of which is equal to the number of intervals. The order of the three values in each sub-tuple is fixed: average, then maximum, then minimum. This log feature sequence will be stored in memory for use in step A5.

[0146] Furthermore, step A5 also includes:

[0147] Determine the start and end times of a unified timeline. The start time is the minimum of the earliest operation time indication in the semantic analysis text, the timestamp of the first frame image in the medical image sequence, and the time of the first sampling point in the asset operation log stream. The end time is the maximum of the three. The time range between the start and end times is divided into multiple consecutive time units.

[0148] For each time unit, perform the following operations:

[0149] Check if a semantic feature triple exists within the time range covered by the time unit. If it exists, convert the numerical object words in the triple into numerical values and fill them into the corresponding semantic feature column of the row. If it does not exist, copy the numerical values from the previous time unit where semantic features exist and fill them in.

[0150] If a frame exists in the image feature vector sequence within the time range covered by the time unit, the total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to the frame are respectively filled into the three image feature columns corresponding to the row. If it does not exist, the frame is filled in after linear interpolation from the existing frame that is closest in time.

[0151] If an interval in the log feature sequence exists within the time range covered by the time unit, the mean, maximum, and minimum values of the interval are filled into the three log feature columns corresponding to the row. If the time unit spans multiple interval boundaries, the feature value of the interval with the largest time proportion is filled in.

[0152] In this embodiment, after completing steps A2, A3, and A4, the server obtains a semantic feature triplet sequence, an image feature vector sequence, and a log feature sequence, respectively. Each element in the sequence is associated with time information: the semantic feature triplet is associated with an operation time indication parsed from the text (e.g., a timestamp field extracted from medical records); each sub-vector in the image feature vector sequence is associated with the acquisition timestamp of the corresponding image frame; and each sub-tuple in the log feature sequence is associated with the time boundaries (start and end times) of the corresponding interval.

[0153] The server first determines the start and end times of the unified timeline. It extracts all operation time indicators from the semantic feature triplet sequence and finds the minimum value; it extracts all image frame timestamps from the image feature vector sequence and finds the minimum value; and it extracts the start times of all intervals from the log feature sequence and finds the minimum value. The server takes the minimum of these three minimum values as the start time of the unified timeline. Similarly, the server takes the maximum value of the semantic operation time indicator, the maximum value of the image frame timestamp, and the maximum value of the log interval end time as the end time of the unified timeline.

[0154] The server divides the time range from the start time to the end time into several consecutive time units. The length of each time unit is a pre-configured fixed value, such as one minute or five seconds. The server calculates the total number of time units based on their lengths, which is calculated by dividing (end time minus start time) by the unit length and rounding up. Each time unit corresponds to a row index, from the first row to the last row.

[0155] The server creates a multimodal feature table, which is a two-dimensional table structure. The number of rows equals the number of time units, and the number of columns equals the number of semantic feature columns (one column), the number of image feature columns (three columns), and the number of log feature columns (three columns), for a total of seven columns. All cells in the table are initially empty.

[0156] The server processes each row sequentially according to the time unit. The corresponding row time interval is the starting time plus (row index minus one) multiplied by the time unit length, and the ending time is the starting time plus the row index multiplied by the time unit length.

[0157] For the semantic feature column corresponding to the current row, the server checks if there is a triplet in the semantic feature triplet sequence whose operation time indication falls within the time interval of the current row. If it exists, the server extracts the numeric object word from the triplet, converts it to a number (e.g., converts the string "one hundred and twenty" to the floating-point number one hundred and twenty), and fills it into the semantic feature cell of the current row. If multiple triplets fall within the same time interval, the server selects the triplet that is closest to the center point of the interval in time. If no triplet falls within the time interval of the current row, the server searches backward for the nearest time row containing a triplet and copies the value from the semantic feature cell of that row to the current row.

[0158] For the three image feature columns corresponding to the current row (total number of changed pixels, centroid row coordinates, and centroid column coordinates), the server checks if there is a frame in the image feature vector sequence whose timestamp falls within the time interval of the current row. If it exists, the server extracts the three feature values corresponding to that frame and fills them into the three image feature cells of the current row. If it does not exist, the server searches for the two frames closest in time to the center point of the current row's time interval. Let the timestamp of the previous frame be T1 and the timestamp of the next frame be T2, with corresponding feature values V1 and V2, and the time of the center point of the current row's time interval be T. The server calculates the interpolation coefficients such that the interpolation result V = V1 + (V2 - V1)*(T - T1) / (T2 - T1), and then fills V into the corresponding cells. If there is only the previous frame and no next frame, the feature values of the previous frame are directly copied.

[0159] For the three log feature columns (mean, maximum, and minimum) corresponding to the current row, the server first determines the time range covered by each interval in the log feature sequence. The server then checks which log intervals overlap with the current row's time interval. If the current row's time interval is completely contained within a single log interval, the mean, maximum, and minimum values of that log interval are directly extracted and filled into the three log feature cells of the current row. If the current row's time interval spans the boundaries of multiple log intervals, the server calculates the overlap duration of each log interval within the current row's time interval, selects the log interval with the largest overlap duration, and fills its feature value into the current row. If two intervals have equal overlap durations, the earlier interval is selected.

[0160] After all rows have been processed, the server obtains a fully populated multimodal feature table, which will be passed to step A6 for further processing.

[0161] Furthermore, referring to Figure 3 Step A6 further includes:

[0162] Read the standard feature table corresponding to the standard asset with the same model as the target asset from the preset standard feature library;

[0163] For each row in the multimodal feature table, the actual feature values in each cell are extracted sequentially;

[0164] For the standard feature values in the standard feature table at the same row and column number, calculate the absolute difference between the actual feature value and the standard feature value;

[0165] For the semantic feature column, the absolute difference is multiplied by the first weighting coefficient to obtain the weighted difference of the cell. For the total number of changed pixels column in the image feature column, the absolute difference is multiplied by the second weighting coefficient to obtain the weighted difference. For the centroid row coordinate column and centroid column coordinate column in the image feature column, the absolute difference is multiplied by the third weighting coefficient and the fourth weighting coefficient respectively to obtain the weighted difference.

[0166] For the mean, maximum, and minimum value columns in the log feature columns, the absolute difference is multiplied by the fifth, sixth, and seventh weighting coefficients, respectively, to obtain the weighted difference;

[0167] The weighted differences of all cells in the multimodal feature table are summed to obtain the total difference score.

[0168] In this embodiment, the server reads a standard feature table matching the target asset model from a pre-built standard feature library. The standard feature library is a pre-built database that stores standard feature tables for different asset models under different operating conditions. The number of rows and columns in the standard feature table is the same as the number of rows in the multimodal feature table generated in step A5, and the column order is also consistent. Each cell in the standard feature table stores the standard value of the asset in the corresponding time unit and feature dimension under standard operating conditions.

[0169] The server initializes an accumulated variable called the total difference score, with an initial value of zero. The server traverses each cell in the multimodal feature table in the order of row index from the first row to the last row and column index from the first column to the last column.

[0170] For the currently visited cell, the server retrieves the actual feature value of that cell and simultaneously obtains the standard feature value from the same row and column number in the standard feature table. The server calculates the absolute value of the difference between these two values to obtain the absolute difference.

[0171] The server determines the category of a feature column based on the column index of the current cell and retrieves the corresponding weighting coefficient. The weighting coefficients are a pre-configured set of positive numbers stored in the system configuration file. Specifically, the semantic feature column corresponds to the first weighting coefficient; the total number of changing pixels in the image feature column corresponds to the second weighting coefficient; the centroid row coordinates in the image feature column correspond to the third weighting coefficient; the centroid column coordinates in the image feature column correspond to the fourth weighting coefficient; the mean in the log feature column corresponds to the fifth weighting coefficient; the maximum value in the log feature column corresponds to the sixth weighting coefficient; and the minimum value in the log feature column corresponds to the seventh weighting coefficient. All seven weighting coefficients range from zero to one, and their sum equals one.

[0172] The server multiplies the absolute difference of the current cell by the weighting coefficient corresponding to the column containing that cell to obtain the weighted difference value. The server then adds this weighted difference value to the total difference score variable.

[0173] The server continues to traverse the next cell, repeating the above operation until all cells in the multimodal feature table have been processed.

[0174] After the iteration is complete, the value stored in the total difference score variable is the sum of the weighted difference values of all cells. This total difference score will be passed to step A7 for risk level determination. The weighting coefficients can be set to adjust the contribution of features from different modalities to the final evaluation result according to clinical importance. For example, for some assets, the mean in log features may be more important than the centroid coordinates in image features, so the corresponding weighting coefficients can be set higher.

[0175] Furthermore, the preset score interval in step A7 includes a low-risk interval, a medium-risk interval, and a high-risk interval. The lower limit of the low-risk interval is zero, and the upper limit of the low-risk interval is a first limit value. The lower limit of the medium-risk interval is equal to the first limit value, and the upper limit of the medium-risk interval is a second limit value. The lower limit of the high-risk interval is equal to the second limit value, and the upper limit of the high-risk interval is positive infinity.

[0176] When the total difference score falls into the low-risk range, the risk level of the target asset is output as low risk, and the instruction type identifier in the generated operation instruction is the routine inspection identifier.

[0177] When the total difference score falls into the medium risk range, the risk level of the target asset is output as medium risk, and the instruction type identifier in the generated operation instruction is parameter calibration identifier;

[0178] When the total difference score falls into the high-risk range, the risk level of the target asset is output as high risk, and the instruction type identifier in the generated operation instruction is a shutdown check identifier.

[0179] In this embodiment, the server calculates the total difference score in step A6. The server pre-stores two thresholds, referred to as the first threshold and the second threshold. The specific values of these two thresholds are pre-set based on asset type, clinical application standards, and the hospital's quality management requirements. The first threshold is less than the second threshold.

[0180] The server compares the total difference score with a first threshold and a second threshold. If the total difference score is greater than or equal to zero and less than the first threshold, it is determined that the total difference score falls into the first interval. If the total difference score is greater than or equal to the first threshold and less than the second threshold, it is determined that it falls into the second interval. If the total difference score is greater than or equal to the second threshold, it is determined that it falls into the third interval.

[0181] Based on the range in which the total difference score falls, the server determines the evaluation level of the target asset. If it falls into the first range, the evaluation level is Level 1; if it falls into the second range, the evaluation level is Level 2; and if it falls into the third range, the evaluation level is Level 3.

[0182] The server generates maintenance instructions based on the evaluation level. A maintenance instruction is a data structure containing two fields: a maintenance operation type field and an operation time window field.

[0183] When the evaluation level is Level 1, the server will set the maintenance operation type to "cleaning and inspection operation" and set the operation time window to within 72 hours from the current moment. In other words, the instruction requires relevant personnel to complete a cleaning and routine inspection of the assets within the next 72 hours.

[0184] When the evaluation level is Level 2, the server will set the maintenance operation type to "Component Calibration Operation" and set the operation time window to within 24 hours from the current moment. The instruction requires relevant personnel to complete the parameter calibration of critical asset components within the next 24 hours.

[0185] When the evaluation level is level three, the server will set the maintenance operation type to "shutdown and repair operation" and set the operation time window to four hours from the current time. The instruction requires relevant personnel to stop the asset operation and conduct a comprehensive overhaul within the next four hours.

[0186] The server encapsulates the generated maintenance instructions into a message and sends it to the task queue of the asset maintenance management system via the hospital's internal network. The maintenance management system prioritizes tasks according to their urgency within the operation time window and notifies the corresponding maintenance personnel to perform the operation.

[0187] Furthermore, after step A5 is completed and before step A6 begins, a multimodal data consistency verification step is also included:

[0188] For each time unit on the unified time axis, read the values in the semantic feature column, the total number of changed pixels in the image feature column, and the mean value in the log feature column corresponding to that time unit from the multimodal feature table;

[0189] Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the total number of changed pixels in the image feature column at that time unit. The numerical correspondence is defined as follows:

[0190] When the value in the semantic feature column belongs to the first value range, the total number of changed pixels should belong to the first number range. When the value in the semantic feature column belongs to the second value range, the total number of changed pixels should belong to the second number range. Otherwise, it is marked as semantic-image inconsistency.

[0191] Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the mean in the log feature column for that time unit. The numerical correspondence is defined as follows:

[0192] When the value in the semantic feature column belongs to the first value range, the mean value in the log feature column should belong to the first mean value range. When the value in the semantic feature column belongs to the second value range, the mean value in the log feature column should belong to the second mean value range. Otherwise, it is marked as semantic-log inconsistency.

[0193] If both semantic-image inconsistency and semantic-log inconsistency occur simultaneously in a time unit, then that time unit is marked as a severely inconsistent time unit.

[0194] The number of all seriously inconsistent time units is counted. If the number exceeds the preset proportion of the total number of time units in the unified time axis, steps A6 and A7 are stopped, and a maintenance operation instruction containing a data acquisition review instruction is generated. The instruction type of this maintenance operation instruction is identified as a re-acquisition identifier.

[0195] In this embodiment, after the server completes the construction of the multimodal feature table in step A5, it performs a consistency check step before performing the difference comparison in step A6.

[0196] The server iterates through each row of the multimodal feature table, i.e., each time unit. For the current time unit, the server reads the values of three specific cells from that row: the first is the operating parameter value (e.g., tube voltage setting) in the semantic feature column; the second is the average grayscale value in the image feature column (which is calculated as part of the pixel intensity distribution feature in step A3, representing the average grayscale within the image sub-region); and the third is the amplitude fluctuation magnitude in the temporal feature column (which is calculated as a feature of the waveform segment in step A4, representing the degree of drastic change in signal amplitude).

[0197] The server performs the first judgment: checking whether a preset correlation exists between the operating parameter value and the grayscale average value within that time unit. This preset correlation is defined by a set of rules stored in the system configuration. For example, a rule might state: when the operating parameter value belongs to the high parameter range, the grayscale average value should belong to the high grayscale range. The specific boundary values of the high parameter range and the high grayscale range are preset based on asset type and clinical experience. The server compares the operating parameter value with the boundary of the high parameter range, and simultaneously compares the grayscale average value with the boundary of the high grayscale range. If the operating parameter value falls within the high parameter range but the grayscale average value does not, it is considered inconsistent; conversely, if the operating parameter value does not fall within the high parameter range but the grayscale average value does, it is also considered inconsistent. Only when both fall within the corresponding range or neither falls within the corresponding range is it considered consistent.

[0198] The server performs a second check: it checks whether a pre-defined correlation exists between the operating parameter value and the amplitude fluctuation range within that time unit. Similarly, this pre-defined correlation is defined by rules. For example, when the operating parameter value belongs to the high parameter range, the amplitude fluctuation range should also belong to the high fluctuation range. The server compares the operating parameter value with the boundary of the high parameter range and the amplitude fluctuation range with the boundary of the high fluctuation range. If the operating parameter value falls within the high parameter range but the amplitude fluctuation range does not, it is considered inconsistent; conversely, if the operating parameter value does not fall within the high parameter range but the amplitude fluctuation range does, it is also considered inconsistent.

[0199] If either the first or second check is inconsistent for the current time unit, the server marks that time unit as an abnormal time unit. If both checks are consistent, no mark is made.

[0200] After the server iterates through all time units, it counts the total number of time units marked as abnormal. Simultaneously, it calculates the total number of time units on the unified timeline. The server then calculates the proportion of abnormal time units to the total number of time units.

[0201] The server compares this ratio to a preset threshold (e.g., 10%). If the number of abnormal time units exceeds 10% of the total number of time units, the server determines that there is a serious inconsistency between the multimodal data, meaning that the multimodal feature table constructed in step A5 may be unreliable due to data acquisition errors, time synchronization errors, or asset anomalies. In this case, the server skips steps A6 and A7, i.e., it does not perform difference comparison and risk level judgment, but directly generates a special maintenance instruction. The maintenance operation type of this maintenance instruction is set to "data review instruction," requiring the operator to check the data acquisition assets and time synchronization system. The operation time window for this maintenance instruction is set to one hour from the current moment. The server sends this instruction to the maintenance management system.

[0202] If the proportion of abnormal time units does not exceed the threshold, the server continues to execute steps A6 and A7 and proceeds with the subsequent evaluation normally.

[0203] According to a second embodiment of the present invention, the present invention claims protection for a hospital internal control evaluation device based on multimodal data fusion, used to execute the hospital internal control evaluation method based on multimodal data fusion, comprising the following features:

[0204] Data acquisition module: Acquires semantic analysis text, medical image sequences, and asset operation log streams generated by target assets within a continuous time window within the hospital;

[0205] Text analysis module: Performs dependency parsing on the semantically analyzed text, identifies text components, extracts asset identifiers, action type words, and numerical object words, and combines them into semantic feature triples;

[0206] Image analysis module: Performs inter-frame difference operation on the medical image sequence, calculates the absolute value of the gray level difference between two adjacent frames and marks the corresponding pixel positions as changed pixels, counts the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and constructs an image feature vector sequence;

[0207] Log analysis module: Performs inflection point detection operation on the asset operation log stream, traverses the numerical sequence in the log stream, records the inflection points of the values, and calculates the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and forms a log feature sequence;

[0208] Multimodal matching module: Matches the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time matches it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table;

[0209] The difference calculation module reads a standard feature table that matches the target asset model from a preset standard feature library, compares the multimodal feature table with the standard feature table, calculates the absolute difference between the actual feature value and the standard feature value in each cell, and accumulates them to obtain the total difference score.

[0210] Risk assessment module: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate operation instructions for the target asset.

[0211] According to a third embodiment of the present invention, the present invention claims protection for a hospital internal control evaluation system based on multimodal data fusion, comprising:

[0212] One or more processors;

[0213] A memory that stores one or more programs, which, when executed by one or more processors, enable the one or more processors to implement the hospital internal control evaluation method based on multimodal data fusion.

[0214] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0215] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated units described above can be implemented in hardware or as software functional units. The above are merely embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made based on the description and drawings of this application, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

[0216] The specific embodiments of the invention have been described in detail above, but they are only examples, and this application is not limited to the specific embodiments described above. For those skilled in the art, any equivalent modifications or substitutions to the invention are also within the scope of this application. Therefore, all equivalent changes, modifications, and improvements made without departing from the spirit and principles of this application should be covered within the scope of this application.

Claims

1. A hospital internal control evaluation method based on multi-modal data fusion, characterized in that, Includes the following steps: Step A1: Obtain semantic analysis text, medical image sequences, and asset operation log streams generated by the target assets within the hospital within a continuous time window; Step A2: Perform dependency parsing on the semantic analysis text to identify text components, extract asset identifiers, action type words, and numerical object words, and combine them into semantic feature triples; Step A3: Perform inter-frame difference operation on the medical image sequence, calculate the absolute value of the gray level difference between two adjacent frames and mark the corresponding pixel positions as changed pixels, count the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and construct an image feature vector sequence; Step A4: Perform inflection point detection on the asset operation log stream, traverse the numerical sequence in the log stream, record the inflection points of the values, and calculate the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and form a log feature sequence. Step A5: Match the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time match it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table; Determine the start and end times of a unified timeline. The start time is the minimum of the earliest operation time indication in the semantic analysis text, the timestamp of the first frame image in the medical image sequence, and the time of the first sampling point in the asset operation log stream. The end time is the maximum of the three. The time range between the start and end times is divided into multiple consecutive time units. For each time unit, perform the following operations: Check if a semantic feature triple exists within the time range covered by the time unit. If it exists, convert the numerical object words in the triple into numerical values and fill them into the corresponding semantic feature column of the row. If it does not exist, copy the numerical values from the previous time unit where semantic features exist and fill them in. If a frame exists in the image feature vector sequence within the time range covered by the time unit, the total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to the frame are respectively filled into the three image feature columns corresponding to the row. If it does not exist, the frame is filled in after linear interpolation from the existing frame that is closest in time. Check if there is an interval in the log feature sequence within the time range covered by the time unit. If so, fill the mean, maximum and minimum values of the interval into the three log feature columns corresponding to the row. If the time unit spans multiple interval boundaries, fill in the feature value of the interval with the largest time proportion. Step A6: Read the standard feature table that matches the target asset model from the preset standard feature library, compare the multimodal feature table with the standard feature table, calculate the absolute difference between the actual feature value and the standard feature value in each cell, and sum them up to obtain the total difference score; Step A7: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate an operation instruction for the target asset. The instruction type identifier of the operation instruction includes a routine inspection identifier, a parameter calibration identifier, and a shutdown inspection identifier.

2. The method of claim 1, wherein, Step A2 includes: The semantically analyzed text is divided into multiple sentences by periods, and a dependency parser is used to output the dependency arc structure for each sentence. Find the core predicate node from the dependency arc structure, and starting from the core predicate node, find the subject node along the subject relation arc. Extract the words corresponding to the subject node as asset identifiers. Find the object node along the direct object relation arc, and extract the words corresponding to the object node as numerical object words; Treat the words corresponding to the core predicate nodes as action type words; If a single sentence contains multiple subject relation arcs or multiple direct object relation arcs, then multiple asset identifiers or multiple numerical object words are extracted to form multiple semantic feature triples. Arrange the semantic feature triples generated from all the sentences in the order in which the sentences appear in the text to form a sequence of semantic feature triples.

3. The method of claim 1, wherein, Step A3 further includes: The first frame of the medical image sequence is read as the reference frame. For the second frame and each subsequent frame, the grayscale value of the current image is subtracted from that of the previous frame at the same pixel coordinate position to obtain the difference image. Set a grayscale difference threshold, traverse all pixels in the difference image, mark pixels with an absolute grayscale difference greater than the threshold as changed pixels, and mark pixels with an absolute grayscale difference less than or equal to the threshold as unchanged pixels. The total number of changing pixels in each frame of the difference image is counted and used as the first feature element of that frame. For each frame of the difference image, the sum of the row coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid row coordinates, and the sum of the column coordinates of all the changed pixels is divided by the total number of changed pixels to obtain the centroid column coordinates. The centroid row coordinates and centroid column coordinates are used as the second and third feature elements of the frame image. The total number of changed pixels, the row coordinates of the centroid, and the column coordinates of the centroid corresponding to each frame are arranged in frame order to form a three-dimensional image feature vector sequence. When the total number of changed pixels in a frame of the difference image is zero, the barycenter row coordinates and barycenter column coordinates of that frame are set to be the same as the barycenter coordinates of the previous frame of the difference image.

4. The method of claim 1, wherein, Step A4 further includes: Obtain a sequence of values from the asset operation log stream, which contains value points sampled at equal time intervals; Starting from the second value point, calculate the difference between each value point and its previous value point to obtain a difference sequence; For each difference in the difference sequence, record the sign of the difference as positive, negative, or zero. Traverse the difference sequence, and when the signs of two adjacent differences change, mark the next value point corresponding to the sign change position as the inflection point. Specifically: If the current difference is positive and the next difference is negative, then the value point corresponding to the next difference is the peak inflection point; If the current difference is negative and the next difference is positive, then the value point corresponding to the next difference is the valley inflection point; If the difference is zero and the signs of the differences are different, then the first numerical point in the zero value interval is marked as the inflection point; Arrange all inflection points in chronological order, and use them together with the first and last numerical points of the numerical sequence as dividing points to divide the numerical sequence into multiple consecutive intervals, with each interval located between two adjacent dividing points. For each interval, take the values of all the data points in the interval, calculate the arithmetic mean of these values as the mean of the interval, find the maximum value among these values as the maximum value of the interval, and find the minimum value among these values as the minimum value of the interval. The mean, maximum, and minimum values of each interval are arranged in the order of the start time of that interval. The three values of each interval form a sub-tuple, and all sub-tuples are arranged in chronological order to form a log feature sequence.

5. The method of claim 1, wherein, Step A6 also includes: Read the standard feature table corresponding to the standard asset with the same model as the target asset from the preset standard feature library; For each row in the multimodal feature table, the actual feature values in each cell are extracted sequentially; For the standard feature values in the standard feature table at the same row and column number, calculate the absolute difference between the actual feature value and the standard feature value; For the semantic feature column, the absolute difference is multiplied by the first weighting coefficient to obtain the weighted difference of the cell. For the total number of changed pixels column in the image feature column, the absolute difference is multiplied by the second weighting coefficient to obtain the weighted difference. For the centroid row coordinate column and centroid column coordinate column in the image feature column, the absolute difference is multiplied by the third weighting coefficient and the fourth weighting coefficient respectively to obtain the weighted difference. For the mean, maximum, and minimum value columns in the log feature columns, the absolute difference is multiplied by the fifth, sixth, and seventh weighting coefficients, respectively, to obtain the weighted difference; The weighted differences of all cells in the multimodal feature table are summed to obtain the total difference score.

6. The method of claim 1, wherein, The preset score range in step A7 includes a low-risk range, a medium-risk range, and a high-risk range. The lower limit of the low-risk range is zero, and the upper limit of the low-risk range is the first limit value. The lower limit of the medium-risk range is equal to the first limit value, and the upper limit of the medium-risk range is the second limit value. The lower limit of the high-risk range is equal to the second limit value, and the upper limit of the high-risk range is positive infinity. When the total difference score falls into the low-risk range, the risk level of the target asset is output as low risk, and the instruction type identifier in the generated operation instruction is the routine inspection identifier. When the total difference score falls into the medium risk range, the risk level of the target asset is output as medium risk, and the instruction type identifier in the generated operation instruction is parameter calibration identifier; When the total difference score falls into the high-risk range, the risk level of the target asset is output as high risk, and the instruction type identifier in the generated operation instruction is a shutdown check identifier.

7. The method of claim 1, wherein, After step A5 is completed and before step A6 begins, a multimodal data consistency check step is also included: For each time unit on the unified time axis, read the values in the semantic feature column, the total number of changed pixels in the image feature column, and the mean value in the log feature column corresponding to that time unit from the multimodal feature table; Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the total number of changed pixels in the image feature column at that time unit. The numerical correspondence is defined as follows: When the value in the semantic feature column belongs to the first value range, the total number of changed pixels should belong to the first number range. When the value in the semantic feature column belongs to the second value range, the total number of changed pixels should belong to the second number range. Otherwise, it is marked as semantic-image inconsistency. Determine whether there is a preset numerical correspondence between the values in the semantic feature column and the mean in the log feature column for that time unit. The numerical correspondence is defined as follows: When the value in the semantic feature column belongs to the first value range, the mean value in the log feature column should belong to the first mean value range. When the value in the semantic feature column belongs to the second value range, the mean value in the log feature column should belong to the second mean value range. Otherwise, it is marked as semantic-log inconsistency. If both semantic-image inconsistency and semantic-log inconsistency occur simultaneously in a time unit, then that time unit is marked as a severely inconsistent time unit. The number of all seriously inconsistent time units is counted. If the number exceeds the preset proportion of the total number of time units in the unified time axis, steps A6 and A7 are stopped, and a maintenance operation instruction containing a data acquisition review instruction is generated. The instruction type of this maintenance operation instruction is identified as a re-acquisition identifier.

8. The hospital internal control evaluation device based on multi-modal data fusion, characterized in that, The method for performing hospital internal control evaluation based on multimodal data fusion as described in any one of claims 1-7 includes the following features: Data acquisition module: Acquires semantic analysis text, medical image sequences, and asset operation log streams generated by target assets within a continuous time window within the hospital; Text analysis module: Performs dependency parsing on the semantically analyzed text, identifies text components, extracts asset identifiers, action type words, and numerical object words, and combines them into semantic feature triples; Image analysis module: Performs inter-frame difference operation on the medical image sequence, calculates the absolute value of the gray level difference between two adjacent frames and marks the corresponding pixel positions as changed pixels, counts the total number of changed pixels and the centroid coordinates of the spatial distribution in each frame and constructs an image feature vector sequence; Log analysis module: Performs inflection point detection operation on the asset operation log stream, traverses the numerical sequence in the log stream, records the inflection points of the values, and calculates the mean, maximum and minimum values of the values in each interval, taking the interval between adjacent inflection points as the unit, and forms a log feature sequence; Multimodal matching module: Matches the operation time indication in the semantic feature triple with the timestamps of each frame in the image feature vector sequence, and at the same time matches it with the time boundaries of each interval in the log feature sequence to establish a multimodal feature table; The difference calculation module reads a standard feature table that matches the target asset model from a preset standard feature library, compares the multimodal feature table with the standard feature table, calculates the absolute difference between the actual feature value and the standard feature value in each cell, and accumulates them to obtain the total difference score. Risk assessment module: Based on the preset score range to which the total difference score belongs, output the risk level of the target asset and generate operation instructions for the target asset.

9. A hospital internal control evaluation system based on multi-modal data fusion, characterized in that, include: One or more processors; A memory having stored one or more programs that, when executed by one or more processors, cause the one or more processors to implement the hospital internal control evaluation method based on multimodal data fusion according to any one of claims 1 to 7.