Screen pixel recognition-based non-invasive mimicry interaction method and system for legacy systems

By constructing a knowledge graph of interface elements and optimizing interactive hotspots in real time, the adaptability of legacy system interaction technology to dynamic interface changes was solved, realizing human-like operation and resource management, and ensuring the stability and reliability of legacy systems.

CN121807435BActive Publication Date: 2026-06-19BEIJING YIZHUANG INTELLIGENT CITY RES INST GRP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING YIZHUANG INTELLIGENT CITY RES INST GRP CO LTD
Filing Date
2026-03-06
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing legacy system interaction technologies lack the ability to adapt to dynamic changes in the interface, making it difficult to simulate the operation behavior of real users. They are easily identified as non-human operations and cannot dynamically adjust the operation speed and frequency according to the system resource status, resulting in overload of target system resources or operation response timeouts.

Method used

By collecting screen pixel stream data from legacy systems, a knowledge graph of interface elements is constructed, the coordinate range of interactive hot zones is extracted, and real-time optimization is performed based on pixel difference values. Hot zone positioning parameters and timing control parameters are configured to generate human-like operation sequences, monitor resource usage in real time, and dynamically adjust operation timing.

Benefits of technology

It enables non-intrusive interaction with legacy systems, adapts to dynamic interface changes, reduces the risk of being identified as an automation tool, prevents system resource overload, and ensures the stability and reliability of the interaction process.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121807435B_ABST
    Figure CN121807435B_ABST
Patent Text Reader

Abstract

This invention provides a non-intrusive, human-like interaction method and system for legacy systems based on screen pixel recognition, belonging to the field of human-computer interaction technology. The method includes: collecting screen pixel stream data to extract interactive element features, constructing an interface element knowledge graph and storing hotspot coordinates; continuously collecting pixel stream data, calculating difference values, and dynamically matching and updating hotspots; generating operation instructions based on hotspot coordinates and timing parameters; and adjusting control parameters in real time based on resource usage to output the optimal operation sequence. This invention does not require modification of the legacy system's source code, enabling stable and efficient human-like interaction, improving automation efficiency, and reducing system resource consumption.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of human-computer interaction technology, and in particular to a non-intrusive mimicry interaction method and system for legacy systems based on screen pixel recognition. Background Technology

[0002] With the rapid development of information technology, many enterprises and organizations are still using outdated legacy systems. These systems are often retained for their stability and specific functionalities, but they are typically built on earlier technologies, have simple interfaces, and lack modern APIs. Traditional methods for automating interaction with legacy systems rely on techniques such as script recording and playback, OCR text recognition, and image template matching to simulate user actions. However, since legacy systems often lack programming interfaces, automation requires pixel-level recognition.

[0003] However, existing legacy system interaction technologies still suffer from problems such as a lack of adaptability to dynamic changes in the interface, difficulty in simulating the operating behavior patterns of real users, overly mechanical operation, lack of natural changes in operation rhythm and timing, easy identification as non-human operation and rejection or triggering of security mechanisms, and lack of perception of system resource status and adaptive adjustment mechanism. They cannot dynamically adjust the operation speed and frequency according to the actual load of the legacy system, which can easily lead to overload of target system resources or operation response timeouts. Summary of the Invention

[0004] This invention provides a non-intrusive mimicry interaction method and system for legacy systems based on screen pixel recognition, which can at least solve some of the problems existing in the prior art.

[0005] A first aspect of this invention provides a non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, comprising:

[0006] Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Construct an interface element knowledge graph based on the operation behavior sequence and the spatial features and store the coordinate range of interaction hotspots. Generate anthropomorphic operation sequences based on the interface element knowledge graph and task requirements and configure hotspot positioning parameters and timing control parameters.

[0007] Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract the pixel features of the change area and perform similarity matching with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hot zone to update the optimized hot zone coordinate range.

[0008] The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates and the operation duration is determined in combination with the timing control parameters. An initial operation command is generated and executed based on the target coordinates and the operation duration. Resource occupancy indicators are collected in real time. If the preset occupancy threshold is exceeded, the over-limit amplitude is calculated and the timing control parameters in the optimized operation sequence are corrected to obtain the correction control parameters.

[0009] The duration of subsequent operations is determined based on the modified control parameters, and the optimal operation instruction sequence is obtained by solving the problem. The optimal operation instruction sequence is then output and executed.

[0010] In one alternative implementation,

[0011] Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, construct an interface element knowledge graph and store the coordinate range of interactive hot zones, including:

[0012] The screen pixel stream data is collected frame by frame and converted into a pixel matrix. Multi-scale convolution feature extraction is performed on the pixel matrix to obtain a feature map. The feature map is semantically segmented to identify the boundary contours of the interface interactive elements. Morphological analysis is performed on the boundary contours to extract geometric topological features. Based on the geometric topological features and the pixel texture features corresponding to the boundary contours, pattern recognition is performed to determine the functional semantic type. The geometric spatial features of the interface interactive elements are determined by combining the position coordinates and size information corresponding to the boundary contours.

[0013] Collect the user's operation input sequence on the interface interaction elements and extract the operation coordinates and operation timestamps. Perform spatial association matching between the operation coordinates and the position coordinates in the geometric space features to determine the operation target element. Extract the time interval between adjacent operations based on the operation timestamps. Organize the operation target elements according to the time intervals to obtain the operation behavior sequence.

[0014] Directed edges are established between adjacent operation target elements in the operation behavior sequence, and frequency statistics are performed to obtain the operation transition probability. Based on the operation transition probability, an operation path topology graph is constructed and mapped and boundary expansion is performed with the interface interaction elements in the geometric space features to obtain the coordinate range of the interaction hot zone. The interface element knowledge graph is constructed by combining the functional semantic type and the operation transition probability.

[0015] In one alternative implementation,

[0016] Based on the interface element knowledge graph and task requirements, a human-like operation sequence is generated and hotspot positioning parameters and timing control parameters are configured, including:

[0017] Extract the pre-set task requirements, perform semantic parsing to obtain the task target description, and convert it into a functional semantic type sequence. Calculate the semantic similarity between the functional semantic type sequence and the functional semantic types in the interface element knowledge graph and determine the candidate operation node set. Combine the pre-acquired operation transition probabilities to perform state transition reasoning to obtain the selection priority of each candidate operation node. Sort the nodes in the candidate operation node set according to the selection priority to obtain the initial anthropomorphic operation sequence.

[0018] The coordinate range of the interactive hot zone is extracted from the knowledge graph of interface elements and the center coordinate is calculated. The boundary of the interactive hot zone coordinate range is extracted to obtain the boundary coordinate set. The mean vector and covariance matrix of the coordinate distribution are calculated based on the center coordinate and the boundary coordinate set. The probability density of the coordinate points within the interactive hot zone coordinate range is calculated to obtain the coordinate density distribution. The hot zone positioning parameters are determined by combining the mean vector and the covariance matrix.

[0019] Extract the time intervals corresponding to the initial anthropomorphic operation sequence and perform statistical analysis to obtain the expected duration and duration standard deviation. Perform a logarithmic transformation on the expected duration to obtain the logarithmic expected value and calculate the distribution parameters by combining the duration standard deviation. Initialize the random disturbance factor and combine it with the distribution parameters to obtain the timing control parameters. Configure the hot zone positioning parameters and the timing control parameters into the anthropomorphic operation sequence to obtain the anthropomorphic operation sequence.

[0020] In one alternative implementation,

[0021] Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract pixel features of the changed area and perform similarity matching with the interface element knowledge graph to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hotspot to update the optimized hotspot coordinate range, including:

[0022] The screen pixel stream data is continuously captured frame by frame to obtain the current pixel frame and historical pixel frames, and the pixel difference is calculated to obtain a pixel difference matrix. The absolute value of the pixel difference matrix is ​​summed to obtain the pixel difference value, which is then compared with a preset change threshold. If the pixel difference value exceeds the preset change threshold, the pixel difference matrix is ​​binarized to obtain a change mask. Based on the change mask, the current pixel frame is segmented and convolutional features are extracted to obtain the pixel features of the change region. The pixel features of the change region are compared with the spatial features stored in the interface element knowledge graph to calculate the feature vector similarity and obtain a similarity score. Based on the similarity score, the maximum matching selection is performed to obtain the current spatial feature.

[0023] The system extracts the current location coordinates from the current spatial features, extracts the historical location coordinates corresponding to the current spatial features from the interface element knowledge graph, calculates the vector difference between the current location coordinates and the historical location coordinates to obtain the location offset, applies the location offset to the interactive hotspot coordinate range stored in the interface element knowledge graph to perform a translation transformation to obtain the candidate hotspot coordinate range, verifies the spatial inclusion relationship between the candidate hotspot coordinate range and the location coordinates in the current spatial features, and adjusts the boundary of the candidate hotspot coordinate range according to the verification result to obtain the optimized hotspot coordinate range.

[0024] In one alternative implementation,

[0025] The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates, and the operation duration is determined by combining the timing control parameters. This includes:

[0026] Based on the optimized hot zone coordinate range, the interaction hot zone coordinate range of each operation node in the anthropomorphic operation sequence is replaced and updated to obtain the updated operation node sequence and perform integrity verification. If all operation nodes have completed the hot zone coordinate range update, the updated operation node sequence is output as the optimized operation sequence.

[0027] Extract hot zone positioning parameters from the optimized operation sequence and determine the mean vector, covariance matrix and corresponding sampling weights. Construct a multivariate normal distribution function based on the mean vector and covariance matrix and perform probability sampling according to the sampling weights to obtain a candidate coordinate set. Determine the spatial inclusion relationship between the candidate coordinate set and the optimized hot zone coordinate range, and select the candidate coordinates located within the optimized hot zone coordinate range as the target coordinates.

[0028] The timing control parameters are extracted from the optimized operation sequence and sampled according to a log-normal distribution to obtain the reference duration. The duration perturbation is initialized with a uniform distribution and the reference duration is multiplied with the duration perturbation to obtain the operation duration.

[0029] In one alternative implementation,

[0030] Based on the target coordinates and operation duration, an initial operation command is generated and executed. Resource occupancy indicators are collected in real time. If the resource occupancy exceeds a preset threshold, the over-limit magnitude is calculated, and the timing control parameters in the optimized operation sequence are corrected to obtain the corrected control parameters, including:

[0031] The target coordinates are converted into absolute coordinates in the screen coordinate system. An operation instruction data structure is constructed based on the absolute coordinates and the operation duration, and the initial operation instruction is obtained by encoding and encapsulation. The initial operation instruction is sent to the input interface of the legacy system and triggered to execute. The legacy system is monitored for resources to obtain resource occupancy indicators and compared with preset occupancy thresholds. If the resource occupancy indicators exceed the preset occupancy thresholds, it is determined that the resource occupancy is abnormal.

[0032] If abnormal resource usage exists, the difference between the resource usage index and the preset usage threshold is calculated to obtain the absolute value of the excess and the excess amplitude is calculated. The timing control parameters and the historical execution records of the corresponding operation nodes in the optimized operation sequence are extracted. The resource usage index in the historical execution records is subjected to timing analysis to obtain the resource usage trend and is linearly regressed with the excess amplitude to obtain the predicted resource load value. The duration adjustment factor is calculated based on the predicted resource load value and combined with the pre-acquired baseline duration to obtain the corrected baseline duration. The variance analysis of the pre-acquired duration disturbance is performed to obtain the disturbance fluctuation range and combined with the excess amplitude to perform shrinkage calculation to obtain the corrected disturbance fluctuation range and generate the corrected duration disturbance. The corrected duration disturbance is combined with the corrected baseline duration to obtain the corrected control parameters.

[0033] In one alternative implementation,

[0034] Based on the corrected control parameters, the duration of subsequent operations is determined, and the optimal operation instruction sequence is obtained by solving for it. The optimal operation instruction sequence is then output and executed, including:

[0035] Unexecuted operation nodes are extracted from the optimized operation sequence to obtain a set of operation nodes to be executed. The timing control parameters of each operation node in the set of operation nodes to be executed are replaced with the corrected control parameters. The corrected reference duration and the corrected duration perturbation amount of the corrected control parameters are extracted and sampled using a log-normal distribution to obtain the sampling duration of each operation node. The subsequent operation duration of each operation node is obtained based on the sampling duration and the pre-obtained corrected duration perturbation amount.

[0036] Based on the subsequent operation duration, the optimal execution sequence is obtained by performing time-series optimization on the set of operation nodes to be executed. The optimal execution sequence is then used to rearrange the set of operation nodes to be executed to obtain a rearranged operation node sequence. The target coordinates and subsequent operation durations of each operation node in the rearranged operation node sequence are encoded into instructions to obtain an operation instruction set. The optimal operation instruction sequence is then organized according to the time sequence.

[0037] Each operation instruction in the optimal operation instruction sequence is sent sequentially to the input interface of the legacy system and the response status is monitored. Based on the response status, the instruction execution status is determined and the operation instructions are repeatedly triggered until the optimal operation instruction sequence is completed.

[0038] A second aspect of the present invention provides a non-intrusive mimicry interaction system for legacy systems based on screen pixel recognition, comprising:

[0039] The parameter construction unit is used to collect screen pixel stream data of legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, it constructs an interface element knowledge graph and stores the coordinate range of interactive hot zones. Based on the interface element knowledge graph and task requirements, it generates anthropomorphic operation sequences and configures hot zone positioning parameters and timing control parameters.

[0040] The difference update unit is used to continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, the pixel features of the change area are extracted and similarity matching is performed with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, the coordinate range of the interactive hot zone is geometrically transformed and updated to obtain the optimized hot zone coordinate range.

[0041] The instruction execution unit is used to synchronize the optimized hot zone coordinate range to the anthropomorphic operation sequence to obtain an optimized operation sequence, sample the optimized hot zone coordinate range according to the hot zone positioning parameters to obtain the target coordinates and determine the operation duration in combination with the timing control parameters, generate an initial operation instruction based on the target coordinates and the operation duration and execute it, collect resource occupancy indicators in real time, and if the resource occupancy exceeds a preset occupancy threshold, calculate the over-limit amplitude and correct the timing control parameters in the optimized operation sequence to obtain corrected control parameters.

[0042] The parameter optimization unit is used to determine the duration of subsequent operations based on the modified control parameters, solve for the optimal operation instruction sequence, output the optimal operation instruction sequence, and execute it.

[0043] A third aspect of the present invention provides an electronic device, comprising:

[0044] A processor and a memory for storing processor-executable instructions, wherein the processor is configured to invoke instructions stored in the memory to perform the aforementioned method.

[0045] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0046] This invention achieves non-intrusive interaction with legacy systems through screen pixel stream data recognition, significantly reducing the cost of automating legacy systems. By constructing a knowledge graph of interface elements and storing the coordinates of interactive hot zones, it can automatically learn and memorize the structural features of the operating interface, adapting to different legacy system interfaces. Based on screen pixel difference values ​​and similarity matching, it calculates position offsets and achieves real-time optimization and adjustment of interactive hot zone coordinates, effectively solving the problem of interaction failure caused by dynamic changes in the interface. By configuring hot zone positioning parameters and timing control parameters, it simulates the randomness and naturalness of human operation, reducing the risk of being identified as an automated tool. It monitors system resource usage in real time and dynamically adjusts the operation timing to prevent high-frequency operations from causing system resource overload, ensuring the stability and reliability of the interaction process. Attached Figure Description

[0047] Figure 1 This is a flowchart illustrating a non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, according to an embodiment of the present invention.

[0048] Figure 2 This is a flowchart illustrating the coordinate transformation and operation instruction optimization of a non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, according to an embodiment of the present invention. Detailed Implementation

[0049] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0051] Figure 1 This is a flowchart illustrating a non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, as described in an embodiment of the present invention. Figure 1 As shown, the method includes:

[0052] Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Construct an interface element knowledge graph based on the operation behavior sequence and the spatial features and store the coordinate range of interaction hotspots. Generate anthropomorphic operation sequences based on the interface element knowledge graph and task requirements and configure hotspot positioning parameters and timing control parameters.

[0053] Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract the pixel features of the change area and perform similarity matching with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hot zone to update the optimized hot zone coordinate range.

[0054] The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates and the operation duration is determined in combination with the timing control parameters. An initial operation command is generated and executed based on the target coordinates and the operation duration. Resource occupancy indicators are collected in real time. If the preset occupancy threshold is exceeded, the over-limit amplitude is calculated and the timing control parameters in the optimized operation sequence are corrected to obtain the correction control parameters.

[0055] The duration of subsequent operations is determined based on the modified control parameters, and the optimal operation instruction sequence is obtained by solving the problem. The optimal operation instruction sequence is then output and executed.

[0056] In one alternative implementation,

[0057] Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, construct an interface element knowledge graph and store the coordinate range of interactive hot zones, including:

[0058] The screen pixel stream data is collected frame by frame and converted into a pixel matrix. Multi-scale convolution feature extraction is performed on the pixel matrix to obtain a feature map. The feature map is semantically segmented to identify the boundary contours of the interface interactive elements. Morphological analysis is performed on the boundary contours to extract geometric topological features. Based on the geometric topological features and the pixel texture features corresponding to the boundary contours, pattern recognition is performed to determine the functional semantic type. The geometric spatial features of the interface interactive elements are determined by combining the position coordinates and size information corresponding to the boundary contours.

[0059] Collect the user's operation input sequence on the interface interaction elements and extract the operation coordinates and operation timestamps. Perform spatial association matching between the operation coordinates and the position coordinates in the geometric space features to determine the operation target element. Extract the time interval between adjacent operations based on the operation timestamps. Organize the operation target elements according to the time intervals to obtain the operation behavior sequence.

[0060] Directed edges are established between adjacent operation target elements in the operation behavior sequence, and frequency statistics are performed to obtain the operation transition probability. Based on the operation transition probability, an operation path topology graph is constructed and mapped and boundary expansion is performed with the interface interaction elements in the geometric space features to obtain the coordinate range of the interaction hot zone. The interface element knowledge graph is constructed by combining the functional semantic type and the operation transition probability.

[0061] The screen capture module captures screen content at a rate of 25 frames per second. Each frame is stored at a resolution of 1920×1080 pixels, with each pixel containing three channels: red, green, and blue, each with a value ranging from 0 to 255. Each captured frame is converted into a three-dimensional pixel matrix, with dimensions of height × width × number of channels. For example, for a 1080p resolution screen, the converted pixel matrix size is 1080×1920×3.

[0062] Feature maps are obtained by performing multi-scale convolutional feature extraction on the pixel matrix. A convolutional network with five different scales is constructed, using convolutional kernels of sizes 3×3, 5×5, 7×7, 9×9, and 11×11 for each scale. The first convolutional layer uses 32 feature channels, and subsequent layers increase the number of feature channels to 64, 128, 256, and 512 respectively. After each convolutional layer, an activation function is applied and max pooling is performed with a pooling window size of 2×2 and a stride of 2 to reduce the feature map size and extract important features, resulting in feature maps of multiple sizes.

[0063] The feature maps are semantically segmented to identify the boundary contours of interactive elements in the interface. A semantic segmentation network based on an encoder-decoder architecture is employed. The encoder part uses the previously extracted feature maps, while the decoder part gradually restores the spatial resolution of the feature maps through transposed convolution. During the decoding process, a skip connection mechanism is introduced to fuse low-level detailed features with high-level semantic features to preserve boundary information. The output segmentation map is a probability map with the same size as the original image, where the value of each pixel represents the probability that the pixel belongs to an interactive element. By setting a threshold of 0.75, regions with probabilities greater than the threshold are marked as foreground, i.e., the interface interactive element regions. An edge detection algorithm is used to process the segmentation results to obtain closed boundary contours. For example, for a button element, its boundary contour is represented as a series of continuous two-dimensional coordinate points.

[0064] Morphological analysis is performed on the boundary contours to extract geometric topological features. For each detected boundary contour, its area, perimeter, minimum bounding rectangle, aspect ratio, circularity, convexity, and other geometric features are calculated. The area is calculated as the number of pixels inside the contour; the perimeter is calculated as the pixel length of the contour boundary; the minimum bounding rectangle is represented by four parameters: the coordinates of the top left corner, width, and height; the aspect ratio is the ratio of width to height; the circularity is 4π multiplied by the area divided by the square of the perimeter; and the convexity is the ratio of the contour area to its convex hull area. Topological features such as the number of corner points, corner point distribution, and curvature variation of the contour are extracted and collectively constitute the geometric topological feature vector of the interactive element.

[0065] Functional semantic types are determined by pattern recognition based on geometric topological features and pixel texture features corresponding to the boundary contours. Color histograms, gradient histograms, and local binary patterns of the internal regions of the boundary contours are extracted from the original pixel matrix as texture features. The geometric topological features and texture features are combined to form a complete feature description vector. A pre-trained classifier is used to classify the feature vectors and identify the functional semantic types of the elements. The classifier is trained on 10,000 manually labeled interface element samples to identify interactive element types such as buttons, checkboxes, input boxes, drop-down menus, sliders, and labels.

[0066] By combining the position coordinates and size information corresponding to the boundary contour, the geometric spatial features of the interface interactive elements are determined. The boundary contour of each interactive element is converted into a standardized spatial representation, including the center point coordinates, width, height, rotation angle, and normalized position relative to the screen. For example, a button located in the upper right corner of the screen can be represented by the following geometric spatial features: center point coordinates are 0.85 times the screen width and 0.12 times the screen height, width is 120 pixels, height is 40 pixels, and rotation angle is 0 degrees.

[0067] The system collects the sequence of user input actions performed on interface elements and extracts the action coordinates and timestamps. Hook functions monitor user actions such as mouse clicks, drags, scrolls, and keyboard input, recording the screen coordinates and time of each action. For mouse clicks, the system records the corresponding screen coordinates and click type; for drag operations, it records the start coordinates, path points, and end coordinates; for keyboard input, it records the input content and the position of the focused element. Each record includes the action type, coordinate information, and a timestamp accurate to milliseconds.

[0068] The operation coordinates are spatially correlated and matched with the position coordinates in the geometric space features to determine the operation target element. A spatial index structure is used to quickly find the interface element where the coordinate point is located. When the user's operation coordinates fall within the geometric boundary of an element, the current element is identified as the operation target. For cases with ambiguous boundaries, a distance-weighted algorithm is used to calculate the distance from the operation coordinates to the boundaries of each candidate element, and the element with the smallest distance is selected as the operation target.

[0069] The time interval between adjacent operations is extracted based on the operation timestamp, and the operation target elements are organized according to the time interval to obtain an operation behavior sequence. The time difference between two adjacent operations is calculated in seconds. Operations are grouped according to the time interval; intervals less than 1.5 seconds are considered as the same group of operations, and intervals greater than 30 seconds are considered as operations of different tasks. Within each group of operations, the operation target elements are organized in order of timestamp to form an operation behavior sequence. For example, the user's actions of clicking the "File" menu, the "Open" option, the file selection box, and the "OK" button in sequence will be organized into a complete operation behavior sequence.

[0070] Directed edges are established between adjacent target elements in the operation sequence, and frequency statistics are performed to obtain the operation transition probability. Statistical analysis is performed on all recorded operation sequences to calculate the number of transitions from element A to element B. The transition probability is calculated based on the number of transitions; that is, the probability of transitioning from element A to element B is equal to the number of transitions from A to B divided by the sum of all transitions originating from A. For example, if the transition from the "Save" button to the "Close" button occurs 80 times, and there are a total of 100 transitions originating from the "Save" button, then the transition probability is 0.8.

[0071] Based on the operation transition probabilities, an operation path topology graph is constructed and mapped to the interface interaction elements in the geometric space features. Boundary expansion calculations are then performed to obtain the coordinate range of the interaction hotspot. Using interaction elements as nodes and transition probabilities as edge weights, the operation path topology graph is constructed. For paths with high transition probabilities, the interaction hotspot range of related elements is expanded. An adaptive algorithm is used for hotspot expansion; the higher the transition probability, the larger the hotspot expansion range. For example, for a high-frequency operation path with a transition probability of 0.9, the hotspot range of the target element is expanded by 20%, ensuring that the expected operation is triggered even if the user's click is inaccurate. For each interface element, a hotspot description containing the coordinate range is generated.

[0072] A knowledge graph of interface elements is constructed by combining the functional semantic types and the operation transition probabilities. The functional semantic types of interface elements are used as node attributes, and the operation transition probabilities are used as edge attributes to construct a complete knowledge graph. The knowledge graph contains information such as the spatial location, functional type, interaction relationships, and usage frequency of elements.

[0073] In this embodiment, by directly converting the original screen pixel stream into a pixel matrix and performing multi-scale feature extraction and semantic segmentation, the general adaptability across platforms and application scenarios is improved. By performing geometric topological analysis of the boundary contours and combining pixel texture features for functional semantic discrimination, the accuracy and consistency of interface element recognition are improved. By spatially associating and matching user operation inputs with interface geometric spatial features, the ability to express user intent and operation patterns is enhanced. By probabilistically modeling the transfer relationships between adjacent operation targets and constructing an operation path topology graph, the automatic mining of high-frequency interaction paths and key operation nodes is realized, reducing the interference of noisy clicks or occasional operations on the analysis results.

[0074] In one alternative implementation,

[0075] Based on the interface element knowledge graph and task requirements, a human-like operation sequence is generated and hotspot positioning parameters and timing control parameters are configured, including:

[0076] Extract the pre-set task requirements, perform semantic parsing to obtain the task target description, and convert it into a functional semantic type sequence. Calculate the semantic similarity between the functional semantic type sequence and the functional semantic types in the interface element knowledge graph and determine the candidate operation node set. Combine the pre-acquired operation transition probabilities to perform state transition reasoning to obtain the selection priority of each candidate operation node. Sort the nodes in the candidate operation node set according to the selection priority to obtain the initial anthropomorphic operation sequence.

[0077] The coordinate range of the interactive hot zone is extracted from the knowledge graph of interface elements and the center coordinate is calculated. The boundary of the interactive hot zone coordinate range is extracted to obtain the boundary coordinate set. The mean vector and covariance matrix of the coordinate distribution are calculated based on the center coordinate and the boundary coordinate set. The probability density of the coordinate points within the interactive hot zone coordinate range is calculated to obtain the coordinate density distribution. The hot zone positioning parameters are determined by combining the mean vector and the covariance matrix.

[0078] Extract the time intervals corresponding to the initial anthropomorphic operation sequence and perform statistical analysis to obtain the expected duration and duration standard deviation. Perform a logarithmic transformation on the expected duration to obtain the logarithmic expected value and calculate the distribution parameters by combining the duration standard deviation. Initialize the random disturbance factor and combine it with the distribution parameters to obtain the timing control parameters. Configure the hot zone positioning parameters and the timing control parameters into the anthropomorphic operation sequence to obtain the anthropomorphic operation sequence.

[0079] A bag-of-words model-based text analysis method is used to extract keywords from task requirements. For example, "save file" will extract the keywords "save" and "file". Word segmentation is used to decompose the task requirements into a sequence of word units, and a pre-trained domain word vector model is used to map these word units to semantic vectors. Cluster analysis is performed on the semantic vectors to extract core semantic components, forming a task objective description. Action words in the task objective description are mapped to the functional semantic types of interface interaction elements, resulting in a functional semantic type sequence. For example, the task requirement "enter data into a table and save" is converted into a functional semantic type sequence of "input box - button". The model trained with 5000 labeled task samples achieves an accuracy of 91.2%.

[0080] The semantic similarity between the functional semantic type sequence and the functional semantic types in the interface element knowledge graph is calculated to determine the candidate operation node set. A cosine similarity calculation method is used, representing functional semantic types as multi-dimensional vectors, and calculating the cosine of the angle between the vectors as the semantic similarity. A threshold of 0.75 is set; when the similarity is greater than the threshold, the corresponding interface element is added to the candidate operation node set. For the "input box" type, all input box elements with a similarity greater than 0.75 are found in the interface element knowledge graph to form a candidate set. For example, for the task "Enter name in customer information table", five input box elements in the interface are found as candidates, with similarities of 0.92, 0.87, 0.83, 0.78, and 0.76, respectively.

[0081] The selection priority of each candidate operation node is obtained by combining the pre-acquired operation transition probabilities with state transition reasoning. A Markov decision process model is used, with the current interface state as the initial state, candidate operation nodes as possible actions, and operation transition probabilities as state transition probabilities. The long-term expected return of each candidate node is calculated using a value iteration algorithm, with 100 iterations and a convergence threshold of 0.001. Nodes with higher expected returns have higher selection priority. For example, for five candidate input boxes, the expected returns calculated based on historical operation data are 0.87, 0.75, 0.61, 0.42, and 0.29, indicating that the first input box is most likely the target the user wants to operate on.

[0082] The nodes in the candidate operation node set are sorted according to selection priority to obtain the initial anthropomorphic operation sequence. The candidate operation nodes are arranged in descending order of expected return to form a priority queue. For each type in the functional semantic type sequence, operation nodes are selected sequentially from the corresponding priority queue to form the initial anthropomorphic operation sequence. When multiple choices exist for the same type of node, selection is based on contextual relevance. For example, for the "input box-button" sequence, the input box with the highest expected return and the button with the highest probability of operation transition with that input box are selected to form the initial anthropomorphic operation sequence "name input box-save button".

[0083] The interaction hotspot coordinate range is extracted from the interface element knowledge graph, and the center coordinates are calculated. For each operation node, its corresponding interaction hotspot coordinate range is queried from the knowledge graph, represented as two-dimensional coordinates of the top-left and bottom-right corners. The center coordinates of the hotspot are calculated, i.e., the x-coordinate is the average of the x-coordinates of the top-left and bottom-right corners, and the y-coordinate is the average of the y-coordinates of the top-left and bottom-right corners. For example, if the hotspot coordinate range of a button is (500, 300) for the top-left corner and (600, 350) for the bottom-right corner, then the center coordinates are (550, 325).

[0084] Boundary coordinates are extracted from the coordinate range of the interactive hotspot to obtain a set of boundary coordinates. A contour tracking algorithm is used to sample a point every 5 pixels along the hotspot boundary in a clockwise direction, forming the boundary coordinate set. For regularly shaped hotspots, this can be simplified to the coordinates of the four corner points; for irregular hotspots, enough boundary points are retained to accurately describe their shape. For example, the boundary coordinate set of a rectangular button contains four coordinate points: top left (500, 300), top right (600, 300), bottom right (600, 350), and bottom left (500, 350).

[0085] The mean vector and covariance matrix of the coordinate distribution are calculated based on the set of center coordinates and boundary coordinates. The mean vector is the center coordinate, i.e., a two-dimensional vector. The deviations of the boundary coordinate points relative to the center coordinates are calculated, resulting in a set of deviation vectors. The covariance matrix is ​​then calculated based on this set of deviation vectors. The covariance matrix describes the distribution characteristics of the thermal zone in different directions. For a regular rectangular thermal zone, the diagonal elements of the covariance matrix represent the variance of the thermal zone in the horizontal and vertical directions, respectively, while the off-diagonal elements are 0. For an irregular thermal zone, the off-diagonal elements reflect the degree of inclination of the thermal zone.

[0086] The probability density distribution is obtained by calculating the coordinate density of coordinate points within the interactive hotspot. Based on a two-dimensional normal distribution model, the mean vector is used as the distribution center, and the covariance matrix is ​​used as the shape parameter of the distribution. The probability density value of each coordinate point within the hotspot is calculated. Areas with higher probability density indicate a greater likelihood of clicks during user interaction. The coordinate points within the hotspot are then sampled in a grid with a sampling interval of 3 pixels. The probability density value of each sampled point is calculated to form a density distribution map.

[0087] The hotspot location parameters are determined by combining the mean vector and the covariance matrix. These parameters include center coordinates, principal direction vector, and scale factor. The center coordinates are directly obtained from the mean vector; the principal direction vector is obtained by calculating the eigenvectors of the covariance matrix, representing the main extension direction of the hotspot; and the scale factor is determined by the eigenvalues ​​of the covariance matrix, representing the distribution range along the principal direction. These hotspot location parameters collectively describe the location, shape, and size characteristics of the hotspot, which are used to subsequently generate anthropomorphic click locations.

[0088] The time intervals corresponding to the initial anthropomorphic operation sequence are extracted and statistically analyzed to obtain the expected duration and standard deviation. Operation sequences similar to the initial anthropomorphic operation sequence are selected from historical operation data, and the time interval data between adjacent operations in these sequences are extracted. The mean of the time interval data is calculated as the expected duration, and the standard deviation is calculated to represent the degree of time fluctuation. For example, for the operation sequence "click the input box - enter content - click the save button", by analyzing 1000 historical operations, the expected duration from clicking the input box to starting input is 0.8 seconds, with a standard deviation of 0.2 seconds; the expected duration from completing input to clicking the save button is 1.5 seconds, with a standard deviation of 0.4 seconds.

[0089] The expected duration is logarithmically transformed to obtain the logarithmic expected value, which is then combined with the standard deviation of the duration to calculate the distribution parameters. Since human operational time intervals typically follow a log-normal distribution, the natural logarithm of the expected duration is taken to obtain the logarithmic expected value. The location and scale parameters of the log-normal distribution are then calculated using the standard deviation. The location parameter equals the logarithmic expected value minus half the square of the scale parameter; the scale parameter is obtained by jointly solving for the expected duration and the standard deviation. For example, with an expected duration of 1.5 seconds and a standard deviation of 0.4 seconds, the logarithmic transformation yields a location parameter of approximately 0.38 and a scale parameter of approximately 0.26.

[0090] Initialize the random disturbance factor and combine it with the distribution parameters to obtain the timing control parameters. Generate a random disturbance factor that follows a standard normal distribution, with its value controlled between -2 and +2. Combine the random disturbance factor with the distribution parameters to generate a random time interval that conforms to a log-normal distribution. The timing control parameters include a baseline time interval and a random fluctuation range, used to control the rhythm and naturalness of the anthropomorphic operation. For example, setting the baseline time interval to 1.5 seconds and the fluctuation range to ±0.4 seconds, the generated actual operation time interval may be 1.63 seconds.

[0091] The anthropomorphic operation sequence is obtained by configuring hotspot positioning parameters and timing control parameters into the anthropomorphic operation sequence. For each operation node in the operation sequence, a specific operation coordinate point is generated based on the hotspot positioning parameters, and an operation time point is generated based on the timing control parameters. The generated operation coordinate points are located within the hotspot and conform to a probability density distribution, exhibiting similar click position preferences to humans; the operation time points have natural interval variations, simulating the randomness of human operations. Finally, an anthropomorphic operation sequence is generated, containing the target element, precise coordinates, operation type, and timestamp for each operation, which can be directly used to automate interactive tasks while maintaining a high degree of similarity to real human operations.

[0092] In this embodiment, by semantically parsing the task requirements and converting them into a sequence of functional semantic types, and then calculating the semantic similarity with the functional semantic types in the interface element knowledge graph, the problem of insufficient adaptability caused by relying on hard-coded control identifiers or absolute coordinate matching is avoided. By combining the operation transition probability for state transition reasoning, the selection of candidate nodes not only reflects the semantic matching relationship, but also comprehensively considers the order and transition preferences in the real user operation path, thereby improving the rationality and success rate of the task execution path. By statistically modeling the spatial distribution of interactive hot zones in the interface element knowledge graph, and using the center coordinates, boundary coordinates, and the mean vector and covariance matrix calculated therefrom, the probability density of coordinate points within the hot zone is characterized, which effectively reduces the mechanicalness of the operation trajectory and improves the naturalness and concealment of spatial behavior. By statistically analyzing the historical operation time intervals and introducing logarithmic transformation and random disturbance factors to generate time sequence control parameters, the operation intervals maintain a stable expected rhythm while having a reasonable fluctuation range, avoiding the problem of monotonous rhythm caused by fixed delay or linear delay strategies.

[0093] In one alternative implementation,

[0094] Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract pixel features of the changed area and perform similarity matching with the interface element knowledge graph to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hotspot to update the optimized hotspot coordinate range, including:

[0095] The screen pixel stream data is continuously captured frame by frame to obtain the current pixel frame and historical pixel frames, and the pixel difference is calculated to obtain a pixel difference matrix. The absolute value of the pixel difference matrix is ​​summed to obtain the pixel difference value, which is then compared with a preset change threshold. If the pixel difference value exceeds the preset change threshold, the pixel difference matrix is ​​binarized to obtain a change mask. Based on the change mask, the current pixel frame is segmented and convolutional features are extracted to obtain the pixel features of the change region. The pixel features of the change region are compared with the spatial features stored in the interface element knowledge graph to calculate the feature vector similarity and obtain a similarity score. Based on the similarity score, the maximum matching selection is performed to obtain the current spatial feature.

[0096] The system extracts the current location coordinates from the current spatial features, extracts the historical location coordinates corresponding to the current spatial features from the interface element knowledge graph, calculates the vector difference between the current location coordinates and the historical location coordinates to obtain the location offset, applies the location offset to the interactive hotspot coordinate range stored in the interface element knowledge graph to perform a translation transformation to obtain the candidate hotspot coordinate range, verifies the spatial inclusion relationship between the candidate hotspot coordinate range and the location coordinates in the current spatial features, and adjusts the boundary of the candidate hotspot coordinate range according to the verification result to obtain the optimized hotspot coordinate range.

[0097] The screen pixel stream data is continuously captured frame by frame to obtain the current pixel frame and historical pixel frames, and the pixel difference is calculated to obtain a pixel difference matrix. The screen capture module captures the screen in real time at a frequency of 15 frames per second, with a resolution of 1920×1080 pixels. Each captured image frame is stored as the current pixel frame, and the image from the previous moment is retained as the historical pixel frame. Both images are stored in the form of a three-dimensional matrix with dimensions of height × width × number of channels, i.e., 1080×1920×3. Each pixel in the current pixel frame and the historical pixel frame is compared one by one, and the difference between its red, green and blue channels is calculated to obtain a pixel difference matrix of the same size. For example, if the red, green and blue values ​​of a point in the current pixel frame are (200, 150, 100), and the value of the corresponding point in the historical pixel frame is (190, 145, 105), then the value of that point in the difference matrix is ​​(10, 5, -5).

[0098] The absolute values ​​of the pixel difference matrix are summed to obtain the pixel difference value, which is then compared with a preset change threshold. The absolute values ​​of the differences in the red, green, and blue channels of each pixel in the pixel difference matrix are taken and summed to obtain the degree of difference for a single pixel. The degree of difference for all pixels is accumulated to obtain the pixel difference value for the entire screen, which represents the overall degree of change between two frames. The preset change threshold is 50000, determined through analysis of 500 samples of stable and changing interfaces. When the pixel difference value is below this threshold, the interface is considered basically stable; when it exceeds this threshold, the interface is determined to have changed significantly. For example, a calculated pixel difference value of 75200 exceeds the preset threshold, indicating a significant change in the interface.

[0099] If the pixel difference value exceeds a preset change threshold, the pixel difference matrix is ​​binarized to obtain a change mask. During binarization, the sum of the absolute values ​​of the differences in the three channels of each pixel is calculated to obtain a grayscale difference image. A pixel-level threshold of 30 is set, and the grayscale difference image is binarized: when the difference value at a pixel location is greater than the threshold, the corresponding change mask value is set to 1, indicating that a change has occurred at that location; when the difference value is not greater than the threshold, the change mask value is set to 0, indicating that the location remains stable. The change mask has the same width and height as the original image, but only one channel. The change mask can visually identify areas in the interface that have changed.

[0100] Based on the change mask, region segmentation and convolutional feature extraction are performed on the current pixel frame to obtain the pixel features of the change region. Morphological operations are used to process the change mask, including dilation and closure operations. The dilation operation uses a 5×5 rectangular kernel, and the closure operation uses a 7×7 rectangular kernel to connect adjacent change regions and fill small holes. A connected component analysis algorithm is used to identify continuous regions in the change mask, and regions with an area less than 100 pixels are filtered out to eliminate noise. For each valid connected region, its bounding box coordinates are recorded, and the corresponding image patch is cropped from the current pixel frame. A pre-trained feature extraction network is applied to the cropped image patch. This network contains four convolutional layers with kernel sizes of 5×5, 3×3, 3×3, and 3×3, and channel numbers of 32, 64, 128, and 256, respectively, followed by a max pooling layer. Finally, a feature vector describing the visual features of the change region is obtained, with a dimension of 1024. For example, for a newly appearing button area in the interface, the extracted feature vector can describe its shape, color, and texture features.

[0101] The similarity score is obtained by calculating the feature vector similarity between the pixel features of the changed region and the spatial features stored in the interface element knowledge graph. The spatial features of all known interface elements are retrieved from the interface element knowledge graph, and each element's spatial feature is also represented as a 1024-dimensional feature vector. Cosine similarity is used to calculate the similarity between the feature vector of the changed region and the feature vectors of each element in the knowledge graph. The cosine similarity value ranges from -1 to 1, with values ​​closer to 1 indicating greater similarity. The calculated similarity scores are normalized to obtain similarity scores between 0 and 1. For example, the similarity scores between a certain changed region and five elements in the knowledge graph are 0.92, 0.87, 0.64, 0.53, and 0.31, respectively.

[0102] The current spatial features are obtained by selecting the maximum match based on the similarity score. A similarity threshold of 0.75 is set, and candidate matches with similarity scores greater than the threshold are filtered out. If multiple candidates exist, the element with the highest similarity score is selected as the matching result; if no candidates exist, the current changing region is considered a newly discovered interface element. For successfully matched elements, their current spatial features are updated, including location coordinates, size, and shape description. The location coordinates are obtained by adding half the width and height of the bounding box of the changing region to the coordinates of the center point; the size uses the width and height of the bounding box; and the shape description uses the outline information of the changing region. For example, if a button is matched as the "Confirm Button" in the knowledge graph, its current spatial features are updated to center coordinates (950, 580) and a size of 120×40 pixels.

[0103] The system extracts the current location coordinates from the current spatial features and the historical location coordinates corresponding to the current spatial features from the interface element knowledge graph. The current location coordinates are directly obtained from the current spatial features and represented as two-dimensional coordinate points; the historical location coordinates are retrieved from the interface element knowledge graph to find the location information of the element at its most recent update. For example, the current location coordinates of a "Confirm Button" are (950, 580), while its historical location coordinates are (800, 580).

[0104] The position offset is calculated by performing a vector difference between the current position coordinates and the historical position coordinates. The difference between the horizontal and vertical coordinates is calculated separately to obtain a two-dimensional vector representing the position offset. The position offset represents the distance and direction of movement of the interface element on the screen. For example, if the difference between the current position coordinates (950, 580) and the historical position coordinates (800, 580) is (150, 0), it means that the button has moved 150 pixels to the right horizontally, while remaining unchanged vertically.

[0105] The position offset is applied to the interaction hotspot coordinate range stored in the interface element knowledge graph to perform a translation transformation to obtain the candidate hotspot coordinate range. The interaction hotspot coordinate range associated with the current element is obtained from the interface element knowledge graph, represented as a coordinate pair of the top-left and bottom-right corners. The position offset is added to the top-left and bottom-right corner coordinates respectively to obtain the translated candidate hotspot coordinate range. For example, if the original hotspot coordinate range of a button is top-left (780, 560) and bottom-right (820, 600), and the position offset is (150, 0), then the translated candidate hotspot coordinate range is top-left (930, 560) and bottom-right (970, 600).

[0106] The spatial inclusion relationship between the candidate hotspot coordinate range and the position coordinates in the current spatial features is verified, and the boundary of the candidate hotspot coordinate range is adjusted according to the verification result to obtain the optimized hotspot coordinate range. It is checked whether the candidate hotspot is completely contained within the boundary of the current interface element. If the hotspot exceeds the element boundary, boundary adjustment is performed. The size of the hotspot is scaled proportionally according to the actual size of the current element, or aligned along the element center. For example, if the actual boundary of the current button is the upper left corner (910, 560) and the lower right corner (990, 600), while the candidate hotspot is the upper left corner (930, 560) and the lower right corner (970, 600), the hotspot is found to be completely inside the button, but small. It can be appropriately expanded and adjusted to the upper left corner (915, 565) and the lower right corner (985, 595), so that the hotspot covers most of the button area but retains a certain margin.

[0107] In this embodiment, by performing pixel-by-pixel difference between the current pixel frame and historical pixel frames and combining it with change threshold judgment, the identification of changed areas can be automatically triggered when the interface undergoes local or overall changes. Only the areas that have actually changed are analyzed, avoiding the unnecessary computational overhead caused by repeatedly processing the entire frame image, thus improving the sensitivity and processing efficiency of change detection. By calculating the vector difference between the current position coordinates and historical position coordinates and applying the position offset to the translation correction of the interactive hotspot coordinate range, the adaptation cycle is significantly shortened and maintenance costs are reduced. Through spatial inclusion relationship verification and boundary adjustment, the corrected hotspot range is kept consistent with the current interface element position, effectively avoiding the problem of clicks going out of bounds or insufficient coverage caused by offset estimation errors.

[0108] In one alternative implementation,

[0109] The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates, and the operation duration is determined by combining the timing control parameters. This includes:

[0110] Based on the optimized hot zone coordinate range, the interaction hot zone coordinate range of each operation node in the anthropomorphic operation sequence is replaced and updated to obtain the updated operation node sequence and perform integrity verification. If all operation nodes have completed the hot zone coordinate range update, the updated operation node sequence is output as the optimized operation sequence.

[0111] Extract hot zone positioning parameters from the optimized operation sequence and determine the mean vector, covariance matrix and corresponding sampling weights. Construct a multivariate normal distribution function based on the mean vector and covariance matrix and perform probability sampling according to the sampling weights to obtain a candidate coordinate set. Determine the spatial inclusion relationship between the candidate coordinate set and the optimized hot zone coordinate range, and select the candidate coordinates located within the optimized hot zone coordinate range as the target coordinates.

[0112] The timing control parameters are extracted from the optimized operation sequence and sampled according to a log-normal distribution to obtain the reference duration. The duration perturbation is initialized with a uniform distribution and the reference duration is multiplied with the duration perturbation to obtain the operation duration.

[0113] Based on the optimized hotspot coordinate range, the interaction hotspot coordinate range of each operation node in the anthropomorphic operation sequence is replaced and updated to obtain the updated operation node sequence, and its integrity is verified. Each operation node in the anthropomorphic operation sequence is traversed, and the corresponding interface element identifier is retrieved. The latest optimized hotspot coordinate range for that element is found in the interface element knowledge graph based on the interface element identifier. The original interaction hotspot coordinate range of the operation node is replaced with the found optimized hotspot coordinate range, completing the update of a single operation node. For example, for the "Click Confirm Button" node in the operation sequence, the original interaction hotspot coordinate range is the top left corner (780, 560) and the bottom right corner (820, 600), which is updated to the optimized hotspot coordinate range of the top left corner (915, 565) and the bottom right corner (985, 595).

[0114] Integrity checks are performed to verify that all nodes in the operation sequence have completed hotspot coordinate updates. The updated sequence of operation nodes is traversed, and an update flag is set for each node, initially set to 0. When a node completes its hotspot coordinate update, the corresponding flag is set to 1. Integrity checks are determined by calculating whether the sum of all flags equals the total number of operation nodes. If the check passes, all nodes have completed the update; if it fails, the node that did not complete the update is recorded, and an exception log is generated. For example, in a sequence with 5 operation nodes, the sum of the flags after the update is 5, so the check passes; if the sum is 4, it means one node did not complete the update, and the check fails. If all operation nodes have completed the hotspot coordinate range update, the updated sequence of operation nodes is output as an optimized operation sequence for subsequent interactive execution.

[0115] The hot zone positioning parameters are extracted from the optimized operation sequence, and the mean vector, covariance matrix, and corresponding sampling weights are determined. For each operation node in the optimized operation sequence, the hot zone positioning parameters are extracted from its updated hot zone coordinate range. The mean vector is obtained by calculating the center coordinates of the hot zone coordinate range; that is, the x-coordinate is the average of the x-coordinates of the upper left and lower right corners, and the y-coordinate is the average of the y-coordinates of the upper left and lower right corners. For example, the mean vector of the upper left corner (915, 565) and lower right corner (985, 595) of the hot zone coordinate range is (950, 580).

[0116] The covariance matrix is ​​constructed by calculating the variance of the heat area in the horizontal and vertical directions. The horizontal variance is equal to the square of the heat area width divided by 12, and the vertical variance is equal to the square of the heat area height divided by 12, applicable to rectangular heat areas. For a heat area with a width of 70 pixels and a height of 30 pixels, the horizontal variance is approximately 408.33, and the vertical variance is approximately 75.00. The constructed covariance matrix is ​​a 2×2 diagonal matrix, with the diagonal elements representing the horizontal and vertical variances, and the off-diagonal elements being 0.

[0117] Sampling weights are determined by analyzing historical interaction data. A kernel density estimation method is used, with historical click coordinates as sample points to calculate the click probability density at each location within the hotspot. For new hotspots with fewer than 20 historical data points, a Gaussian distribution with a higher center weight is used by default as the sampling weight. For example, if historical interaction data for a button shows the highest click frequency in the center area and lower click frequency in the edge areas, the resulting sampling weight is 1.0 at the center, decreasing to 0.3 at the edge as the distance from the center increases.

[0118] A multivariate normal distribution function is constructed based on the mean vector and the covariance matrix, and a candidate coordinate set is obtained by probability sampling according to the sampling weights. A two-dimensional normal distribution function is defined using the mean vector and the covariance matrix. A rejection sampling method is employed, combined with sampling weights, to generate random coordinate points that follow a multivariate normal distribution. The sampling weight value corresponding to each random coordinate point is calculated, and a uniformly random number between 0 and 1 is generated. If the random number is less than the sampling weight value, the coordinate point is accepted; otherwise, it is rejected and resampling is performed. This process is repeated until a predetermined number of candidate coordinate points are obtained, typically set to 50. For example, for a hot zone with center coordinates (950, 580) and diagonal elements of the covariance matrix of 408.33 and 75.00, the generated candidate coordinates may be distributed near the center, such as (952, 583), (945, 578), (960, 585), etc.

[0119] The spatial inclusion relationship between the candidate coordinate set and the optimized hot zone coordinate range is determined. Each candidate coordinate is checked to see if it lies within the hot zone boundary. The x-coordinate of the candidate coordinate is determined to be greater than or equal to the x-coordinate of the upper left corner of the hot zone and less than or equal to the x-coordinate of the lower right corner of the hot zone. Simultaneously, the y-coordinate of the candidate coordinate is determined to be greater than or equal to the y-coordinate of the upper left corner of the hot zone and less than or equal to the y-coordinate of the lower right corner of the hot zone. Candidate coordinates that pass the check are marked as valid coordinates; those that fail are marked as invalid coordinates and removed from the candidate set. For example, candidate coordinates (952, 583) are located within the hot zone boundaries (915, 565) and (985, 595) and are determined to be valid coordinates; while candidate coordinates (990, 600) are outside the hot zone boundaries and are determined to be invalid coordinates.

[0120] Candidate coordinates within the optimized hot zone coordinate range are selected as target coordinates. A coordinate is randomly selected from the set of valid candidate coordinates as the final interactive target coordinate. If the number of valid candidate coordinates is zero, the center coordinates of the hot zone are used directly as the target coordinates. For each operation node in the operation sequence, its own target coordinates are generated. For example, the candidate coordinates (952, 583) are ultimately selected as the precise coordinates for clicking the "Confirm" button.

[0121] The timing control parameters are extracted from the optimized operation sequence and sampled using a log-normal distribution to obtain the baseline duration. The timing control parameters include two distribution parameters: a location parameter and a scale parameter, which together define the shape of the log-normal distribution. The location parameter determines the center position of the distribution, and the scale parameter determines the dispersion of the distribution. Based on the location and scale parameters, a log-normal random number generator is used to generate the baseline duration. A random number following a standard normal distribution is generated, multiplied by the scale parameter, added to the location parameter, and then the exponential function value is calculated to obtain a random number following a log-normal distribution, which is then used as the baseline duration. For example, if the location parameter of an operation node is 0.38 and the scale parameter is 0.26, the generated baseline duration is 1.63 seconds.

[0122] The operation duration is obtained by initializing a uniformly distributed duration perturbation and multiplying the baseline duration by the duration perturbation. The duration perturbation, used to simulate minute fluctuations in human operation time, is generated by a uniformly distributed random number generator with a value range of 0.95 to 1.05. The final operation duration is obtained by multiplying the baseline duration by the duration perturbation. For example, multiplying a baseline duration of 1.63 seconds by a perturbation of 1.02 yields an operation duration of 1.66 seconds.

[0123] The target coordinates and operation duration are configured into each node of the optimized operation sequence to form complete anthropomorphic operation instructions. Each instruction contains three core elements: operation type, target coordinates, and execution time. For example, for the operation node "click the confirmation button", the generated operation instruction is: operation type is "left mouse click", target coordinates are (952, 583), and execution time is the current time plus 1.66 seconds.

[0124] In this embodiment, by dynamically replacing the optimized hot zone coordinate range obtained in the preceding step with each operation node in the anthropomorphic operation sequence, and performing integrity verification on the operation node sequence before output, the problem of some nodes still using outdated hot zone information due to interface changes or local adjustments is avoided. This improves the coherence and reliability of the operation sequence at the execution level and reduces the risk of failure or accidental touches. By constructing a multivariate normal distribution based on the mean vector, covariance matrix, and sampling weights and performing probability sampling, and then filtering out effective target coordinates through spatial inclusion relationships, the mechanical problems caused by consistent click position heights are effectively avoided, the probability of the landing point deviating from the effective area is reduced, and the operation hit rate is improved. By sampling the timing control parameters using a log-normal distribution and superimposing a uniform distribution perturbation to generate the operation duration, the operation interval is made to conform to the statistical law of human operation rhythm as a whole. While ensuring task efficiency, this significantly enhances the authenticity and unpredictability of time behavior and reduces the risk of being judged as non-human operation by rule detection or anomaly identification mechanisms.

[0125] In one alternative implementation,

[0126] Based on the target coordinates and operation duration, an initial operation command is generated and executed. Resource occupancy indicators are collected in real time. If the resource occupancy exceeds a preset threshold, the over-limit magnitude is calculated, and the timing control parameters in the optimized operation sequence are corrected to obtain the corrected control parameters, including:

[0127] The target coordinates are converted into absolute coordinates in the screen coordinate system. An operation instruction data structure is constructed based on the absolute coordinates and the operation duration, and the initial operation instruction is obtained by encoding and encapsulation. The initial operation instruction is sent to the input interface of the legacy system and triggered to execute. The legacy system is monitored for resources to obtain resource occupancy indicators and compared with preset occupancy thresholds. If the resource occupancy indicators exceed the preset occupancy thresholds, it is determined that the resource occupancy is abnormal.

[0128] If abnormal resource usage exists, the difference between the resource usage index and the preset usage threshold is calculated to obtain the absolute value of the excess and the excess amplitude is calculated. The timing control parameters and the historical execution records of the corresponding operation nodes in the optimized operation sequence are extracted. The resource usage index in the historical execution records is subjected to timing analysis to obtain the resource usage trend and is linearly regressed with the excess amplitude to obtain the predicted resource load value. The duration adjustment factor is calculated based on the predicted resource load value and combined with the pre-acquired baseline duration to obtain the corrected baseline duration. The variance analysis of the pre-acquired duration disturbance is performed to obtain the disturbance fluctuation range and combined with the excess amplitude to perform shrinkage calculation to obtain the corrected disturbance fluctuation range and generate the corrected duration disturbance. The corrected duration disturbance is combined with the corrected baseline duration to obtain the corrected control parameters.

[0129] The target system's display settings are used to obtain the current screen resolution and scaling. If the target system is running in a multi-monitor environment, the monitor index of the target window and its position offset relative to the primary monitor must also be determined. The calculation to convert the target coordinates to absolute coordinates includes multiplying the relative coordinates by the scaling factor to obtain the actual pixel coordinates; adding the position offset of the top-left corner of the target window on the screen; and if the target window is not on the primary monitor, the position offset between monitors must also be added. For example, if the target coordinates are (952, 583), the position of the top-left corner of the target window on the screen is (100, 50), and the screen scaling factor is 1.25, then the converted absolute coordinates are (1290, 779).

[0130] An operation instruction data structure is constructed based on the absolute coordinates and the operation duration, and then encoded and encapsulated to obtain the initial operation instruction. The operation instruction data structure contains multiple fields: instruction type, coordinate value, timestamp, and additional parameters. The instruction type field indicates the specific action of the operation, such as mouse movement, single click, double click, key press, etc., and is encoded with integers, for example, 1 indicates mouse movement, and 2 indicates left click. The coordinate value field stores the converted absolute coordinates, containing two integer values, horizontal and vertical. The timestamp field indicates the precise time of instruction execution, calculated by adding the operation duration to the current system time, with a precision of milliseconds. The additional parameter field is used to store additional information for specific instruction types, such as the key code of a key press instruction. The constructed data structure is binary encoded to obtain the initial operation instruction that can be recognized by the input interface. For example, the encoding format of the left click instruction is: instruction type (2) occupies 1 byte, coordinate value (1290, 779) occupies 8 bytes, timestamp occupies 8 bytes, no additional parameters, and the total length is 17 bytes of binary data.

[0131] The initial operation command is sent to the legacy system's input interface and triggered for execution. This initial operation command is sent through the low-level input device driver interface provided by the operating system. Depending on the operating system environment, the corresponding input interface API is selected. For example, in one operating system, the input device service interface is used; in another, the human-machine interface device access interface is used. Sufficient execution permissions must be obtained before sending the command. A precise timer is used to control the execution time of the command, ensuring that the command is executed at the specified timestamp. For example, if the current system time is 13:45:30.500 and the operation duration is 1.66 seconds, then the command execution time is 13:45:32.160, and the timer will wait 1.66 seconds before precisely triggering the command execution. After the command is sent, the execution result is monitored through a status callback mechanism, recording the execution status and completion time.

[0132] Resource monitoring of the legacy system yields resource usage metrics, which are then compared to preset thresholds. Resource usage data, including CPU utilization, memory usage, disk read / write speed, and network transfer rate, is obtained through the operating system's performance counter API. Monitoring is performed every 200 milliseconds, and the average of five consecutive data collections is used as the current resource usage metric. Preset thresholds are determined based on the target system's performance baseline: CPU utilization at 80%, memory usage at 85% of available memory, disk read / write speed at 100 MB / s, and network transfer rate at 50 MB / s. The monitored resource usage metrics are compared to their corresponding preset thresholds to determine if any anomalies exist. For example, if CPU utilization reaches 92% after an operation, exceeding the preset threshold of 80%, an anomaly is identified.

[0133] If abnormal resource usage is detected, the difference between the resource usage indicator and the preset usage threshold is calculated to obtain the absolute value of the excess and the magnitude of the excess. The absolute value of the excess is calculated by subtracting the preset usage threshold from the resource usage indicator. The magnitude of the excess is calculated by dividing the absolute value of the excess by the preset usage threshold. For example, if the CPU utilization rate is 92% and the preset threshold is 80%, the absolute value of the excess is 12%, and the magnitude of the excess is 0.15. In the case of multiple abnormal resource usages occurring simultaneously, the one with the largest magnitude of the excess is selected as the primary abnormal indicator.

[0134] Extract the timing control parameters and corresponding historical execution records of the operation nodes from the optimized operation sequence. Retrieve the historical records of the 10 most recent executions of the current operation sequence from the operation log database. Each record includes execution time, resource usage, and operation result. Sort the historical records by time and extract the resource usage indicators and corresponding timing control parameters for each execution. For example, the historical records of an operation node "Click the Confirm Button" show that the CPU utilization rates for the 10 most recent executions were 75%, 78%, 76%, 82%, 79%, 85%, 80%, 83%, 87%, and 92%, with corresponding baseline durations of 1.20 seconds, 1.35 seconds, 1.30 seconds, 1.50 seconds, 1.45 seconds, 1.60 seconds, 1.55 seconds, 1.65 seconds, 1.70 seconds, and 1.63 seconds, respectively.

[0135] A time-series analysis of resource usage indicators in the historical execution records is performed to obtain the resource usage trend, and a linear regression is used to fit this trend with the over-limit amplitude to obtain the predicted resource load value. A moving average method is used to smooth the historical resource usage indicators with a window size of 3, resulting in a smoothed resource usage sequence. The first difference of the resource usage indicators is calculated to determine whether there is a significant upward or downward trend in resource usage. A linear regression method is used, with the historical baseline duration as the independent variable and the resource usage indicator as the dependent variable, to fit a model of the relationship between resource usage and operation duration. Based on the fitted model and the current over-limit amplitude, the resource load required for the next execution is predicted. For example, the fitted linear relationship shows that for every 0.1 second increase in the baseline duration, the CPU utilization rate increases by an average of 3%. The current over-limit amplitude is 0.15, and the predicted resource load for the next execution will increase by 4.5%, with a predicted resource load value of 96.5%.

[0136] The corrected baseline duration is obtained by calculating the duration adjustment factor based on the predicted resource load value and combining it with the pre-acquired baseline duration. The duration adjustment factor is calculated based on the inverse relationship between resource load and operation duration; when resource load is too high, the operation time needs to be extended to reduce system load. The formula for calculating the duration adjustment factor is 1 plus the product of the over-limit amplitude and a proportional coefficient. The proportional coefficient is set according to the resource type: 2.0 for CPU and memory, and 1.5 for disk and network. The corrected baseline duration is obtained by multiplying the pre-acquired baseline duration by the duration adjustment factor. For example, if the current baseline duration is 1.63 seconds, the over-limit amplitude is 0.15, the resource type is CPU, and the proportional coefficient is 2.0, the calculated duration adjustment factor is 1.30, and the corrected baseline duration is 2.12 seconds.

[0137] A variance analysis is performed on the pre-acquired duration disturbance to obtain the disturbance fluctuation range. This range is then combined with the over-limit amplitude to calculate a corrected disturbance fluctuation range and generate a corrected duration disturbance. The pre-acquired duration disturbance range is 0.95 to 1.05, with a fluctuation range of 0.10. The fluctuation range is narrowed based on the over-limit amplitude, with a narrowing ratio of 1 minus the over-limit amplitude, ensuring more stable operation duration fluctuations under high system load. The corrected disturbance fluctuation range is the original fluctuation range multiplied by the narrowing ratio. Within the corrected disturbance fluctuation range, a uniformly distributed random number is generated as the corrected duration disturbance. For example, with an over-limit amplitude of 0.15 and a narrowing ratio of 0.85, the corrected disturbance fluctuation range is 0.085, the corrected duration disturbance range is 0.9575 to 1.0425, and the randomly generated corrected duration disturbance is 0.98.

[0138] The corrected duration perturbation is combined with the corrected baseline duration to obtain the corrected control parameters. The corrected baseline duration is multiplied by the corrected duration perturbation to obtain the operation duration, which is the core component of the corrected control parameters. The corrected control parameters also include the operation type and target coordinates, retained from the original operation command. For example, if the corrected baseline duration is 2.12 seconds and the corrected duration perturbation is 0.98, the calculated final operation duration is 2.08 seconds. In the corrected operation command data structure, the operation type and target coordinates remain unchanged; only the timestamp field is updated to the current system time plus 2.08 seconds. The corrected control parameters are used to update the corresponding nodes in the operation sequence, replacing the original timing control parameters, and re-triggering operation execution.

[0139] In this embodiment, by converting the target coordinates into absolute coordinates in the screen coordinate system and combining them with the operation duration to generate standardized operation instructions, which are then sent to the legacy system input interface, automated interaction can be completed without altering the internal structure of the legacy system, ensuring good compatibility with existing systems. By quantifying the extent of resource overruns and combining them with historical execution records for trend analysis, the load changes on system resources from subsequent operations can be predicted, enabling smoother control of resource usage, reducing drastic fluctuations in resource consumption, and improving the stability of system operation. By making targeted corrections to the baseline duration and disturbance fluctuation range based on the resource load prediction results, the operation rhythm actively converges towards lower loads while maintaining human-like characteristics, achieving a dynamic balance between operational efficiency, system load, and natural behavior. This significantly enhances the system's security, robustness, and sustainable application capabilities under high load or long-term operation scenarios.

[0140] Figure 2 This is a flowchart illustrating the coordinate transformation and operation instruction optimization of a non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, according to an embodiment of the present invention.

[0141] In one alternative implementation,

[0142] Based on the corrected control parameters, the duration of subsequent operations is determined, and the optimal operation instruction sequence is obtained by solving for it. The optimal operation instruction sequence is then output and executed, including:

[0143] Unexecuted operation nodes are extracted from the optimized operation sequence to obtain a set of operation nodes to be executed. The timing control parameters of each operation node in the set of operation nodes to be executed are replaced with the corrected control parameters. The corrected reference duration and the corrected duration perturbation amount of the corrected control parameters are extracted and sampled using a log-normal distribution to obtain the sampling duration of each operation node. The subsequent operation duration of each operation node is obtained based on the sampling duration and the pre-obtained corrected duration perturbation amount.

[0144] Based on the subsequent operation duration, the optimal execution sequence is obtained by performing time-series optimization on the set of operation nodes to be executed. The optimal execution sequence is then used to rearrange the set of operation nodes to be executed to obtain a rearranged operation node sequence. The target coordinates and subsequent operation durations of each operation node in the rearranged operation node sequence are encoded into instructions to obtain an operation instruction set. The optimal operation instruction sequence is then organized according to the time sequence.

[0145] Each operation instruction in the optimal operation instruction sequence is sent sequentially to the input interface of the legacy system and the response status is monitored. Based on the response status, the instruction execution status is determined and the operation instructions are repeatedly triggered until the optimal operation instruction sequence is completed.

[0146] The set of unexecuted operation nodes is obtained by extracting unexecuted operation nodes from the optimized operation sequence. By traversing the optimized operation sequence and checking the execution status flag of each operation node, operation nodes with the status flag "not executed" are extracted to form the set of operation nodes to be executed. The execution status flag has three values: not executed, executed, and executed failed. During operation sequence initialization, the status flag of all nodes is set to "not executed"; when a node successfully completes an operation, the status flag is updated to "executed"; when an operation fails, the status flag is updated to "executed failed". For example, if an optimized operation sequence contains 10 operation nodes, where the first 3 nodes have the status flag "executed", the 4th node has the status flag "executed failed", and the 5th to 10th nodes have the status flag "not executed", then the set of operation nodes to be executed contains the 4th to 10th nodes, a total of 7 operation nodes.

[0147] The corrected control parameters are applied to replace the timing control parameters of each operation node in the set of operation nodes to be executed. For each operation node in the set of operation nodes to be executed, the original timing control parameters are extracted, including the base duration and duration perturbation. The original base duration is replaced with the corrected base duration from the previously obtained corrected control parameters, and the original duration perturbation is replaced with the corrected duration perturbation. If the set of operation nodes to be executed contains multiple different types of operations, such as click, swipe, and input, the corresponding corrected control parameters are applied for different types of operations. For example, for click operations, the original base duration is 1.63 seconds, and the corrected duration is 2.12 seconds; for swipe operations, the original base duration is 2.15 seconds, and the corrected duration is 2.80 seconds. The correction of the duration perturbation is performed in the same way.

[0148] The correction reference duration and correction duration disturbance of the correction control parameters are extracted and sampled using a log-normal distribution to obtain the sampling duration of each operation node. The correction reference duration is extracted from the correction control parameters as the mean parameter of the log-normal distribution; the standard deviation parameter of the log-normal distribution is calculated based on the fluctuation range of the correction duration disturbance. A log-normal distribution sampling function is constructed using the two parameters to generate a random sampling duration for each node in the set of operation nodes to be executed. The sampling method is to generate a random number that follows a standard normal distribution, multiply it by the standard deviation parameter, add the logarithm of the mean parameter, and finally take the exponent to obtain the sampling duration. For example, the correction reference duration of a certain operation node is 2.12 seconds, the correction duration disturbance range is 0.9575 to 1.0425, the calculated mean parameter is 0.75, the standard deviation parameter is 0.04, and the randomly sampled sampling duration is 2.15 seconds.

[0149] The subsequent operation duration for each operation node is obtained by calculating the sampling duration and the pre-acquired correction duration perturbation. The subsequent operation duration is calculated by multiplying the sampling duration of each operation node by the corresponding correction duration perturbation. The correction duration perturbation follows a uniform distribution, and its value range is determined by the fluctuation range of the correction perturbation. For example, if the sampling duration is 2.15 seconds and the correction duration perturbation is randomly generated as 0.98, the calculated subsequent operation duration is 2.11 seconds. This calculation is repeated for all nodes in the set of operation nodes to be executed, resulting in a complete set of subsequent operation durations.

[0150] The optimal execution sequence is obtained by performing time-series optimization on the set of nodes to be executed based on the duration of subsequent operations. A heuristic search algorithm is used to optimize the execution order of the nodes to be executed to minimize the total execution time while ensuring the logical dependencies of the operations. The optimization process includes constructing an operation dependency graph, calculating the critical path, and adjusting non-critical nodes. The operation dependency graph is constructed by analyzing the dependencies between operation nodes, and the edges between nodes represent execution order constraints. The critical path is the longest path from the start node to the end node in the dependency graph, which determines the shortest execution time of the entire sequence. By adjusting the execution time and order of non-critical nodes, the overall execution time can be optimized. For example, the set of nodes to be executed contains four operations: "Login - Select Menu - Fill in Form - Click Confirm". There are dependencies between "Login" and "Select Menu", "Select Menu" and "Fill in Form", and "Fill in Form" and "Click Confirm". The optimized execution sequence is: Login (2.11 seconds) - Select Menu (1.85 seconds) - Fill in Form (3.25 seconds) - Click Confirm (2.15 seconds).

[0151] The set of operation nodes to be executed is rearranged according to the optimal execution sequence to obtain a rearranged operation node sequence. The operation nodes to be executed are reorganized according to the order in the optimal execution sequence to form the rearranged operation node sequence. For operation nodes with parallel execution conditions, the optimal parallel execution scheme is determined based on dependencies and resource consumption. During the rearrangement process, the natural smoothness of the operation must also be considered to avoid unnatural interaction rhythms caused by operation intervals that are too short or too long. For example, the original order of the set of operation nodes to be executed is "Login - Fill in form - Select menu - Click confirm", and the rearranged order according to the optimal execution sequence is "Login - Select menu - Fill in form - Click confirm", adjusting the positions of "Fill in form" and "Select menu".

[0152] The target coordinates and subsequent operation durations of each operation node in the rearranged operation node sequence are encoded into instructions to obtain an operation instruction set, which is then organized according to the temporal order to obtain the optimal operation instruction sequence. For each operation node in the rearranged operation node sequence, its target coordinates and subsequent operation duration are extracted to construct an operation instruction data structure. The operation instruction data structure includes fields such as instruction identifier, instruction type, target coordinates, execution time, and status flag. The instruction identifier is a unique identifier for the instruction, generated by a 16-bit random number; the instruction type indicates the specific action of the operation, such as mouse movement, single click, double click, etc.; the target coordinates are the precise location of the operation; the execution time is calculated by adding the accumulated operation duration to the current time; and the status flag indicates the current state of the instruction, initially set to "pending execution". The constructed operation instructions are organized according to the order of the rearranged operation node sequence to form the optimal operation instruction sequence. For example, the rearranged operation node "Login" has a target coordinate of (952, 583), a subsequent operation duration of 2.11 seconds, is encoded as the instruction identifier "1234", the instruction type is "left click", the target coordinate is (952, 583), the execution time is "current time + 2.11 seconds", and the status flag is "pending execution".

[0153] Each operation instruction in the optimal operation instruction sequence is sequentially sent to the input interface of the legacy system, and the response status is monitored. A timer queue is established, and the sending plan is scheduled according to the execution time of each operation instruction. When the execution time of the instruction is reached, the corresponding operation instruction is sent through the operating system's input device interface. Instruction sending is asynchronous and does not block the execution of the main thread. At the same time, a monitoring thread is started to capture the interface changes and response status of the legacy system in real time. The monitoring methods include screen pixel comparison, interface element recognition, and system event capture. For example, after sending the "click the confirmation button" instruction, the state changes of the confirmation button (such as color change, pressed effect) and the subsequent interface response (such as the appearance of a prompt box, page jump) are monitored.

[0154] Based on the response status, the execution status of the instruction is determined, and the operation instruction is repeatedly triggered until the optimal operation instruction sequence is completed. The monitored response status is analyzed to determine whether the operation instruction was successfully executed. The response status is divided into three categories: success, failure, and timeout. A success status indicates that the system response after instruction execution meets expectations, such as the interface showing the expected changes; a failure status indicates that the system response after instruction execution is abnormal, such as an error message appearing or the interface not changing as expected; a timeout status indicates that no system response was detected within a predetermined time. For a success status, the status flag of the instruction is updated to "executed," and the next instruction in the sequence is executed. For failure or timeout status, a preset retry strategy is used: for simple operations such as clicking, the original instruction is directly repeated; for complex operations such as inputting text, a clearing operation may be performed before re-entry. The maximum number of retries is set to 3. If the operation still fails after 3 retries, the instruction is marked as "execution failed," an error log is recorded, and the next instruction in the sequence is executed. For example, after sending the "Click Confirm Button" command, if the button status changes to "Pressed" but the expected page redirection is not detected, the execution is deemed to have failed, and the system automatically resends the command. After two retries, if the page redirection is detected to be successful, the command status is updated to "Executed," and the next command in the sequence is executed.

[0155] In this embodiment, by identifying unexecuted operation nodes from the optimized operation sequence and uniformly applying corrected control parameters, the timing rhythm of subsequent operations is calibrated as a whole. This avoids the problem of inconsistent rhythms caused by some nodes still using old parameters. By probabilistically sampling the corrected baseline duration and disturbance amount to generate the subsequent operation duration of each operation node, the adjusted timing not only conforms to the current system carrying capacity but also maintains reasonable randomness and continuity, which is conducive to maintaining the natural characteristics of operation behavior. By optimizing the timing of operation nodes to be executed based on the subsequent operation duration and reordering them, the operation execution order can take into account both task logic constraints and system load conditions. This avoids response delays or failure accumulation caused by strictly following the original order when resources are tight, thus improving the flexibility and fault tolerance of the overall execution process. By monitoring the response status of the legacy system in real time after each operation instruction is executed and deciding whether to continue triggering subsequent instructions or repeat the current instruction based on the execution feedback, it is possible to more accurately identify whether the operation has taken effect, reducing the problem of missed operations or repeated erroneous operations caused by misjudgment, and significantly improving the reliability and consistency of task execution.

[0156] A second aspect of the present invention provides a non-intrusive mimicry interaction system for legacy systems based on screen pixel recognition, comprising:

[0157] The parameter construction unit is used to collect screen pixel stream data of legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, it constructs an interface element knowledge graph and stores the coordinate range of interactive hot zones. Based on the interface element knowledge graph and task requirements, it generates anthropomorphic operation sequences and configures hot zone positioning parameters and timing control parameters.

[0158] The difference update unit is used to continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, the pixel features of the change area are extracted and similarity matching is performed with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, the coordinate range of the interactive hot zone is geometrically transformed and updated to obtain the optimized hot zone coordinate range.

[0159] The instruction execution unit is used to synchronize the optimized hot zone coordinate range to the anthropomorphic operation sequence to obtain an optimized operation sequence, sample the optimized hot zone coordinate range according to the hot zone positioning parameters to obtain the target coordinates and determine the operation duration in combination with the timing control parameters, generate an initial operation instruction based on the target coordinates and the operation duration and execute it, collect resource occupancy indicators in real time, and if the resource occupancy exceeds a preset occupancy threshold, calculate the over-limit amplitude and correct the timing control parameters in the optimized operation sequence to obtain corrected control parameters.

[0160] The parameter optimization unit is used to determine the duration of subsequent operations based on the modified control parameters, solve for the optimal operation instruction sequence, output the optimal operation instruction sequence, and execute it.

[0161] A third aspect of the present invention provides an electronic device, comprising:

[0162] A processor and a memory for storing processor-executable instructions, wherein the processor is configured to invoke instructions stored in the memory to perform the aforementioned method.

[0163] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0164] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.

[0165] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A non-intrusive mimicry interaction method for legacy systems based on screen pixel recognition, characterized in that, include: Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Construct an interface element knowledge graph based on the operation behavior sequence and the spatial features and store the coordinate range of interaction hotspots. Generate anthropomorphic operation sequences based on the interface element knowledge graph and task requirements and configure hotspot positioning parameters and timing control parameters. Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract the pixel features of the change area and perform similarity matching with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hot zone to update the optimized hot zone coordinate range. The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates and the operation duration is determined in combination with the timing control parameters. An initial operation command is generated and executed based on the target coordinates and the operation duration. Resource occupancy indicators are collected in real time. If the preset occupancy threshold is exceeded, the over-limit amplitude is calculated and the timing control parameters in the optimized operation sequence are corrected to obtain the correction control parameters. The duration of subsequent operations is determined based on the modified control parameters, and the optimal operation instruction sequence is obtained by solving the problem. The optimal operation instruction sequence is then output and executed. Collect screen pixel stream data from legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, construct an interface element knowledge graph and store the coordinate range of interactive hot zones, including: The screen pixel stream data is collected frame by frame and converted into a pixel matrix. Multi-scale convolution feature extraction is performed on the pixel matrix to obtain a feature map. The feature map is semantically segmented to identify the boundary contours of the interface interactive elements. Morphological analysis is performed on the boundary contours to extract geometric topological features. Based on the geometric topological features and the pixel texture features corresponding to the boundary contours, pattern recognition is performed to determine the functional semantic type. The geometric spatial features of the interface interactive elements are determined by combining the position coordinates and size information corresponding to the boundary contours. Collect the user's operation input sequence on the interface interaction elements and extract the operation coordinates and operation timestamps. Perform spatial association matching between the operation coordinates and the position coordinates in the geometric space features to determine the operation target element. Extract the time interval between adjacent operations based on the operation timestamps. Organize the operation target elements according to the time intervals to obtain the operation behavior sequence. Directed edges are established between adjacent operation target elements in the operation behavior sequence, and frequency statistics are performed to obtain the operation transition probability. Based on the operation transition probability, an operation path topology graph is constructed and mapped and boundary expansion is performed with the interface interaction elements in the geometric space features to obtain the coordinate range of the interaction hot zone. The interface element knowledge graph is constructed by combining the functional semantic type and the operation transition probability.

2. The method of claim 1, wherein, Based on the interface element knowledge graph and task requirements, a human-like operation sequence is generated and hotspot positioning parameters and timing control parameters are configured, including: Extract the pre-set task requirements, perform semantic parsing to obtain the task target description, and convert it into a functional semantic type sequence. Calculate the semantic similarity between the functional semantic type sequence and the functional semantic types in the interface element knowledge graph and determine the candidate operation node set. Combine the pre-acquired operation transition probabilities to perform state transition reasoning to obtain the selection priority of each candidate operation node. Sort the nodes in the candidate operation node set according to the selection priority to obtain the initial anthropomorphic operation sequence. The coordinate range of the interactive hot zone is extracted from the knowledge graph of interface elements and the center coordinate is calculated. The boundary of the interactive hot zone coordinate range is extracted to obtain the boundary coordinate set. The mean vector and covariance matrix of the coordinate distribution are calculated based on the center coordinate and the boundary coordinate set. The probability density of the coordinate points within the interactive hot zone coordinate range is calculated to obtain the coordinate density distribution. The hot zone positioning parameters are determined by combining the mean vector and the covariance matrix. Extract the time intervals corresponding to the initial anthropomorphic operation sequence and perform statistical analysis to obtain the expected duration and duration standard deviation. Perform a logarithmic transformation on the expected duration to obtain the logarithmic expected value and calculate the distribution parameters by combining the duration standard deviation. Initialize the random disturbance factor and combine it with the distribution parameters to obtain the timing control parameters. Configure the hot zone positioning parameters and the timing control parameters into the anthropomorphic operation sequence to obtain the anthropomorphic operation sequence.

3. The method of claim 1, wherein, Continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, extract pixel features of the changed area and perform similarity matching with the interface element knowledge graph to obtain the current spatial features and calculate the position offset. Based on the position offset, perform geometric transformation on the coordinate range of the interactive hotspot to update the optimized hotspot coordinate range, including: The screen pixel stream data is continuously captured frame by frame to obtain the current pixel frame and historical pixel frames, and the pixel difference is calculated to obtain a pixel difference matrix. The absolute value of the pixel difference matrix is ​​summed to obtain the pixel difference value, which is then compared with a preset change threshold. If the pixel difference value exceeds the preset change threshold, the pixel difference matrix is ​​binarized to obtain a change mask. Based on the change mask, the current pixel frame is segmented and convolutional features are extracted to obtain the pixel features of the change region. The pixel features of the change region are compared with the spatial features stored in the interface element knowledge graph to calculate the feature vector similarity and obtain a similarity score. Based on the similarity score, the maximum matching selection is performed to obtain the current spatial feature. The system extracts the current location coordinates from the current spatial features, extracts the historical location coordinates corresponding to the current spatial features from the interface element knowledge graph, calculates the vector difference between the current location coordinates and the historical location coordinates to obtain the location offset, applies the location offset to the interactive hotspot coordinate range stored in the interface element knowledge graph to perform a translation transformation to obtain the candidate hotspot coordinate range, verifies the spatial inclusion relationship between the candidate hotspot coordinate range and the location coordinates in the current spatial features, and adjusts the boundary of the candidate hotspot coordinate range according to the verification result to obtain the optimized hotspot coordinate range.

4. The method according to claim 1, characterized in that, The optimized hot zone coordinate range is synchronized to the anthropomorphic operation sequence to obtain the optimized operation sequence. The optimized hot zone coordinate range is sampled according to the hot zone positioning parameters to obtain the target coordinates, and the operation duration is determined by combining the timing control parameters. This includes: Based on the optimized hot zone coordinate range, the interaction hot zone coordinate range of each operation node in the anthropomorphic operation sequence is replaced and updated to obtain the updated operation node sequence and perform integrity verification. If all operation nodes have completed the hot zone coordinate range update, the updated operation node sequence is output as the optimized operation sequence. Extract hot zone positioning parameters from the optimized operation sequence and determine the mean vector, covariance matrix and corresponding sampling weights. Construct a multivariate normal distribution function based on the mean vector and covariance matrix and perform probability sampling according to the sampling weights to obtain a candidate coordinate set. Determine the spatial inclusion relationship between the candidate coordinate set and the optimized hot zone coordinate range, and select the candidate coordinates located within the optimized hot zone coordinate range as the target coordinates. The timing control parameters are extracted from the optimized operation sequence and sampled according to a log-normal distribution to obtain the reference duration. The duration perturbation is initialized with a uniform distribution and the reference duration is multiplied with the duration perturbation to obtain the operation duration.

5. The method of claim 1, wherein, Based on the target coordinates and operation duration, an initial operation command is generated and executed. Resource occupancy indicators are collected in real time. If the occupancy exceeds a preset threshold, the over-limit magnitude is calculated, and the timing control parameters in the optimized operation sequence are corrected to obtain the corrected control parameters, including: The target coordinates are converted into absolute coordinates in the screen coordinate system. An operation instruction data structure is constructed based on the absolute coordinates and the operation duration, and the initial operation instruction is obtained by encoding and encapsulation. The initial operation instruction is sent to the input interface of the legacy system and triggered to execute. The legacy system is monitored for resources to obtain resource occupancy indicators and compared with preset occupancy thresholds. If the resource occupancy indicators exceed the preset occupancy thresholds, it is determined that the resource occupancy is abnormal. If abnormal resource usage exists, the difference between the resource usage index and the preset usage threshold is calculated to obtain the absolute value of the excess and the excess amplitude is calculated. The timing control parameters and the historical execution records of the corresponding operation nodes in the optimized operation sequence are extracted. The resource usage index in the historical execution records is subjected to timing analysis to obtain the resource usage trend and is linearly regressed with the excess amplitude to obtain the predicted resource load value. The duration adjustment factor is calculated based on the predicted resource load value and combined with the pre-acquired baseline duration to obtain the corrected baseline duration. The variance analysis of the pre-acquired duration disturbance is performed to obtain the disturbance fluctuation range and combined with the excess amplitude to perform shrinkage calculation to obtain the corrected disturbance fluctuation range and generate the corrected duration disturbance. The corrected duration disturbance is combined with the corrected baseline duration to obtain the corrected control parameters.

6. The method of claim 1, wherein, Based on the corrected control parameters, the duration of subsequent operations is determined, and the optimal operation instruction sequence is obtained by solving for it. The optimal operation instruction sequence is then output and executed, including: Unexecuted operation nodes are extracted from the optimized operation sequence to obtain a set of operation nodes to be executed. The timing control parameters of each operation node in the set of operation nodes to be executed are replaced with the corrected control parameters. The corrected reference duration and the corrected duration perturbation amount of the corrected control parameters are extracted and sampled using a log-normal distribution to obtain the sampling duration of each operation node. The subsequent operation duration of each operation node is obtained based on the sampling duration and the pre-obtained corrected duration perturbation amount. Based on the subsequent operation duration, the optimal execution sequence is obtained by performing time-series optimization on the set of operation nodes to be executed. The optimal execution sequence is then used to rearrange the set of operation nodes to be executed to obtain a rearranged operation node sequence. The target coordinates and subsequent operation durations of each operation node in the rearranged operation node sequence are encoded into instructions to obtain an operation instruction set. The optimal operation instruction sequence is then organized according to the time sequence. Each operation instruction in the optimal operation instruction sequence is sent sequentially to the input interface of the legacy system and the response status is monitored. Based on the response status, the instruction execution status is determined and the operation instructions are repeatedly triggered until the optimal operation instruction sequence is completed.

7. A legacy system non-intrusive mimicry interaction system based on screen pixel recognition for implementing the method of any of the preceding claims 1-6, characterized in that, include: The parameter construction unit is used to collect screen pixel stream data of legacy systems and extract spatial features and operation behavior sequences of interface interaction elements. Based on the operation behavior sequences and spatial features, it constructs an interface element knowledge graph and stores the coordinate range of interactive hot zones. Based on the interface element knowledge graph and task requirements, it generates anthropomorphic operation sequences and configures hot zone positioning parameters and timing control parameters. The difference update unit is used to continuously collect screen pixel stream data and calculate pixel difference values. If the pixel difference value exceeds a preset change threshold, the pixel features of the change area are extracted and similarity matching is performed with the knowledge graph of interface elements to obtain the current spatial features and calculate the position offset. Based on the position offset, the coordinate range of the interactive hot zone is geometrically transformed and updated to obtain the optimized hot zone coordinate range. The instruction execution unit is used to synchronize the optimized hot zone coordinate range to the anthropomorphic operation sequence to obtain an optimized operation sequence, sample the optimized hot zone coordinate range according to the hot zone positioning parameters to obtain the target coordinates and determine the operation duration in combination with the timing control parameters, generate an initial operation instruction based on the target coordinates and the operation duration and execute it, collect resource occupancy indicators in real time, and if the resource occupancy exceeds a preset occupancy threshold, calculate the over-limit amplitude and correct the timing control parameters in the optimized operation sequence to obtain corrected control parameters. The parameter optimization unit is used to determine the duration of subsequent operations based on the modified control parameters, solve for the optimal operation instruction sequence, output the optimal operation instruction sequence, and execute it.

8. An electronic device, comprising: include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 6.