Game control method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting and understanding the text descriptions of image blocks from game images, the system automatically determines operation commands, solving the problem of users being unable to operate the game for a short period of time and improving the efficiency of automatic control in the game.

CN122183142APending Publication Date: 2026-06-12TENCENT DIGITAL (SHENZHEN) CO LTD +1

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: TENCENT DIGITAL (SHENZHEN) CO LTD
Filing Date: 2024-12-10
Publication Date: 2026-06-12

AI Technical Summary

⚠Technical Problem

In existing technologies, users are unable to operate the game for a short period of time, resulting in the unmet need for automatic game control.

⚗Method used

By acquiring game images, extracting image blocks of game elements and performing semantic understanding, text description statements are generated, and operation instructions are determined based on these statements to automatically control the game.

🎯Benefits of technology

It enables accurate understanding of the game state and automatic operation without relying on manual user input, thus improving game response efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122183142A_ABST

Patent Text Reader

Abstract

The application relates to the technical field of computers, and provides a game control method and device, electronic equipment and a storage medium, which comprises the following steps: acquiring a game image collected in a running process of a game application; intercepting image blocks of multiple game elements from the game image; performing semantic understanding on the image blocks of the game elements to obtain text description sentences of the image blocks; determining an operation instruction matched with a game state represented by the game image according to the text description sentences of the multiple image blocks in the game image; and performing simulated operation on the game application according to the operation instruction. The scheme can automatically understand the content in the game image, automatically determine an operation instruction suitable for the game state represented by the current game image, and automatically operate the game application.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, specifically to a game control method, device, electronic device, and storage medium. Background Technology

[0002] With the development of computer technology, game applications have become increasingly sophisticated. Typically, game applications require user control, meaning they respond to user actions. However, in reality, there may be situations where users are temporarily unable to operate the game. Therefore, how to achieve automatic game control is a pressing technical problem that needs to be solved in related technologies. Summary of the Invention

[0003] In view of this, embodiments of this application propose a game control method, device, electronic device, and storage medium to achieve automatic game control.

[0004] The embodiments of this application are implemented using the following technical solutions:

[0005] In a first aspect, embodiments of this application provide a game control method, comprising: acquiring a game image captured during the operation of a game application; extracting multiple image blocks of game elements from the game image; performing semantic understanding on each image block of the game elements to obtain a text description statement for each image block; determining an operation instruction adapted to the game state represented by the game image based on the text description statements of the multiple image blocks in the game image; and performing simulated operation on the game application according to the operation instruction.

[0006] Secondly, embodiments of this application provide a game control device, comprising: an acquisition module for acquiring game images captured during the operation of a game application; a capture module for capturing image blocks of multiple game elements from the game images; a semantic understanding module for performing semantic understanding on the image blocks of each game element to obtain text description statements for each image block; an operation instruction determination module for determining operation instructions adapted to the game state represented by the game images based on the text description statements of the multiple image blocks in the game images; and a simulation operation module for performing simulated operation on the game application according to the operation instructions.

[0007] In some embodiments, the game control device further includes: a data acquisition module for acquiring game response data of the game application in response to the operation command; and a test result determination module for determining the test result of the game application based on the game response data and benchmark response data corresponding to the game state represented by the game image and the operation command.

[0008] In some embodiments, the operation instruction determination module includes: a game state determination unit, configured to determine the game state represented by the game image based on text description statements of multiple image blocks in the game image; and an operation instruction determination unit, configured to determine an operation instruction adapted to the game state represented by the game image based on the game state represented by the game image.

[0009] In some embodiments, the plurality of game elements includes at least a main virtual character; the game state determination unit includes: a first determination unit, configured to determine, from the text description statements of a plurality of image blocks in the game image, a text description statement related to the main virtual character; a first state determination unit, configured to determine, based on the text description statement related to the main virtual character, the main virtual character's own state in the game image; a second state determination unit, configured to determine, based on the text description statements of the plurality of image blocks, other than the text description statement related to the main virtual character, the game environment state of the main virtual character in the game image; and a third state determination unit, configured to determine, based on, the main virtual character's own state in the game image and the game environment state of the main virtual character in the game image, the game state represented by the game image.

[0010] In some embodiments, the first determining unit includes: an element descriptor acquisition unit, configured to acquire element descriptors for each of the image blocks; a target image block determining unit, configured to determine, among the plurality of image blocks, a target image block whose element descriptor represents the main virtual character; and a target determining unit, configured to use the text description statement corresponding to the target image block as a text description statement related to the main virtual character.

[0011] In some embodiments, the element descriptor acquisition unit is configured to: compress the text description statements of each image block to obtain element descriptors for each image block.

[0012] In some embodiments, the element descriptor acquisition unit is further configured to: compress the text description statements of each image block by guiding the large language model through compression guidance text to obtain the element descriptors of each image block.

[0013] In other embodiments, the element descriptor acquisition unit is further configured to: acquire image features of each of the image blocks;

[0014] The image features of the image block are matched in a feature library to determine the target image features that match the image features of the image block.

[0015] The element descriptors associated with the target image features in the feature library are used as element descriptors for the corresponding image blocks.

[0016] In some embodiments, the cropping module includes: a target detection unit, configured to perform target detection on the game image to obtain a target detection result; the target detection result includes pixel position information of each game element in the game image and semantic category of each game element; a first cropping unit, configured to crop image blocks of each game element from the game image based on the pixel position information of each game element in the game image; and correspondingly, an element descriptor acquisition unit, configured to: use the semantic category of each game element as the element descriptor of the image block where the game element is located.

[0017] In some embodiments, the operation instruction determination module includes: a use case determination unit, configured to determine, from the game test case set of the game application, a target game test case that matches the game state; and an operation instruction determination unit, configured to acquire operation instructions from the target game test case as operation instructions that match the game state represented by the game image; the game control device further includes: a benchmark response data acquisition module, configured to acquire benchmark response data from the target game test case as benchmark response data corresponding to the game state represented by the game image and the operation instructions.

[0018] In other embodiments, the cropping module includes: a center point coordinate acquisition unit, used to acquire the center point coordinates of each game element in the game image; a size prediction unit, used to predict the size of the detection box based on the center point coordinates and the game image using N different target size detection networks, to obtain N size prediction results corresponding to each center point coordinate, wherein the size prediction results include predicted size information and prediction probability; N is a positive integer greater than 1; a probability threshold determination unit, used to determine a reference probability threshold corresponding to each center point coordinate based on the prediction probability among the N size prediction results for the same center point coordinate; a target result determination unit, used to determine the target size prediction result with the highest prediction probability exceeding the corresponding reference probability threshold among the N size prediction results for the same center point coordinate based on the reference probability threshold corresponding to each center point coordinate; and a second cropping unit, used to crop the game image according to the predicted size information in the target size prediction results corresponding to each center point coordinate and the corresponding center point coordinate, to obtain an image block of the game element corresponding to the center point coordinate.

[0019] In some embodiments, the game image presents at least two game characters, and the game control device further includes: an image block set determination module, configured to aggregate image blocks representing the same game character from N consecutively acquired game images according to the element descriptors of each image block, to obtain an image block set corresponding to each game character; N is an integer greater than 1; and a trajectory generation module, configured to perform trajectory fitting for each game character according to the position points corresponding to each image block in the corresponding image block set, in the order of the acquisition time of the source game images from first to last, to obtain the movement trajectory of the game character.

[0020] In some embodiments, the semantic understanding module includes: an image encoding unit, configured to encode the image blocks of each game element to obtain image encoding features of each image block; and a text decoding unit, configured to decode the text based on the image encoding features of each image block to obtain a text description statement of each image block.

[0021] Thirdly, embodiments of this application provide an electronic device, including: a processor; and a memory storing computer instructions, which, when executed by the processor, implement the above-described method.

[0022] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the above-described method.

[0023] Fifthly, embodiments of this application provide a computer program product, including computer instructions that, when executed by a processor, implement the above-described method.

[0024] In this application, after extracting multiple image blocks representing game elements from a game image, semantic understanding is performed on each image block to obtain a text description statement for each image block. Since the text description statements of the image blocks can describe the relevant states and information of the game elements presented on the image blocks, the game state presented by the game image can be accurately understood using the text description statements of multiple image blocks. This enables automatic understanding of the content in the game image and automatic determination of operation instructions applicable to the game state represented by the current game image, thus achieving automatic game control without relying on manual user operation. Furthermore, game images contain a large number of background pixels, which contribute little to understanding the game state represented by the game image. Therefore, extracting image blocks representing game elements from the game image for semantic understanding, rather than performing semantic understanding on the entire game image, improves the efficiency of semantic understanding and thus enhances the response efficiency of the game application.

[0025] These or other aspects of this application will become more apparent in the following description of the embodiments. Attached Figure Description

[0026] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0027] Figure 1A This is a schematic diagram illustrating the implementation environment of this application according to an embodiment of this application.

[0028] Figure 1A This is a schematic diagram illustrating the implementation environment of this application according to an embodiment of this application.

[0029] Figure 2 This is a flowchart illustrating a game control method according to an embodiment of this application.

[0030] Figure 3 This is a flowchart illustrating the training process of a target detection model according to an embodiment of this application.

[0031] Figure 4 This is a schematic diagram illustrating the extraction of image blocks from a game image according to an embodiment of this application.

[0032] Figure 5A This is a flowchart illustrating step 240 according to an embodiment of this application.

[0033] Figure 5B This is a flowchart illustrating a game control method according to another embodiment of this application.

[0034] Figure 6 This is a flowchart illustrating step 220 according to an embodiment of this application.

[0035] Figure 7 This is a flowchart illustrating step 510 according to an embodiment of this application.

[0036] Figure 8 This is a flowchart illustrating a game control method according to another embodiment of this application.

[0037] Figure 9 This is a flowchart illustrating a game control method according to another embodiment of this application.

[0038] Figure 10 This is a block diagram of a game control device according to an embodiment of this application.

[0039] Figure 11A schematic diagram of the structure of a computer system suitable for implementing the electronic device of the present application is shown. Detailed Implementation

[0040] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application.

[0041] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without creative effort are within the scope of protection of the present application.

[0042] In the following description, the terms "first" and "second" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first" and "second" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0043] In this document, "multiple" refers to two or more. "And / or" describes the relationship between associated objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following associated objects are in an "or" relationship. In the following description, references to "some embodiments or some embodiment methods" describe a subset of all possible embodiments. However, it is understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with each other without conflict.

[0044] Figure 1A This is a schematic diagram illustrating the implementation environment of this application according to an embodiment of this application. For example... Figure 1A As shown, the implementation environment includes a terminal 110 and a first electronic device 120, which is communicatively connected to the terminal 110. A game application client can run on the terminal 110, and the first electronic device 120 provides services to the game application. The terminal 110 can be used to execute the methods provided in this application. Figure 1AAs shown, the terminal can capture game images in real time, then extract image blocks of multiple game elements from the game images, perform semantic understanding on each image block, and generate text descriptions for each image block. Subsequently, by combining the text descriptions of multiple image blocks, the game state represented by the game image is determined, and operation instructions adapted to this game state are identified. Then, the game application is simulated according to the determined operation instructions. This achieves automatic game control.

[0045] Figure 1B This is a schematic diagram illustrating the implementation environment of this application according to an embodiment of this application, as shown below. Figure 1B As shown, the implementation environment includes a terminal 110, a first electronic device 120, and a second electronic device 130. The terminal 110 is communicatively connected to both the first electronic device 120 and the second electronic device 130. A test environment can be deployed on the first electronic device 120, and the server program of the game application to be tested can be run in the test environment provided by the first electronic device 120. The terminal 110 can also run the game client of the game application to be tested and display the game interface of the game application.

[0046] The second electronic device 130 can be used to perform the methods provided in this application, such as... Figure 1B As shown, the second electronic device can acquire game images captured on the terminal, then extract multiple image blocks of game elements from the game images, perform semantic understanding on each image block, and generate text description statements for each image block. Subsequently, by combining the text description statements of multiple image blocks, the game state represented by the game image is determined, and operation instructions adapted to the game state represented by the game image are determined. Then, the game application is simulated according to the determined operation instructions, and after the simulation operation, response data of the game application in response to the simulated operation is collected from the terminal 110 and / or the first electronic device 120. The second electronic device 130 can determine the test results of the game application based on the response data and the benchmark response data corresponding to the game state represented by the game image and the operation instructions.

[0047] Terminal 110 can be a smartphone, tablet, laptop, desktop computer, smart TV, smartwatch, in-vehicle terminal, virtual interaction device, etc., but is not limited to these.

[0048] The first electronic device 120 and the second electronic device 130 can be electronic devices with certain computing capabilities, such as servers. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.

[0049] In some embodiments, the method provided in this application can also be executed by the first electronic device 120, and the game application can be tested through the interaction between the terminal 110 and the first electronic device 120. This eliminates the need to deploy a second electronic device.

[0050] In some embodiments, the method provided in this application can also be executed by the terminal 110, and the game application can be tested through the interaction between the terminal 110 and the first electronic device 120.

[0051] Figure 2 Figure 1 is a flowchart illustrating a game control method according to an embodiment of this application. The method includes:

[0052] Step 210: Acquire game images captured during the operation of the game application.

[0053] The game image can be captured in real time by the terminal running the game application. Step 220: Extract image blocks of multiple game elements from the game image.

[0054] Game elements refer to key visual objects presented on the game interface of a game application. Examples include virtual characters and virtual items used by these characters. Virtual characters can be divided into non-player characters and player-controllable characters. Player-controllable characters include the main virtual character controlled by the current terminal. Depending on the interaction with the main virtual character, game characters interacting with it can include either teammate virtual characters (allies) or opponent virtual characters (opponents). Virtual characters can also include those playing a supporting role or serving as background elements within the game application. Game elements can also include controllable gestures on the game application interface, as well as pop-ups and other visual elements. A single image block represents a game element; in other words, an image block of a game element refers to the pixel area containing the game element.

[0055] In some embodiments, step 220 includes the following (1)-(2):

[0056] (1) Perform object detection on the game image to obtain the object detection results; the object detection results include the pixel position information of each game element in the game image and the semantic category of each game element.

[0057] Object detection models can be used to detect objects in game images. In this embodiment, the objects to be detected by the object detection model are game elements in the game image. An object detection model refers to a neural network model used to identify the location and semantic category of an object. The semantic category of an object can refer to the name of the object or the category to which the object belongs. The object detection model can be constructed using one or more neural networks, such as convolutional neural networks, pooling neural networks, fully connected networks, etc., without specific limitations here.

[0058] The object detection model involves two recognition tasks: object location recognition and object semantic category recognition. The object detection model can include a branch network corresponding to each recognition task, such as a location recognition branch network for recognizing the object's location and a semantic category recognition branch network for recognizing the semantic category.

[0059] In addition, the object detection model also includes an image feature extraction network, which extracts image features from the game image to obtain the image features of the game image; the image features of the game image are input into the position recognition branch network, which performs position regression based on the image features of the game image and outputs the pixel position information of each game element; the image features of the game image are input into the semantic category recognition branch network, which classifies based on the image features of the game image and outputs the semantic category of each game element.

[0060] It is understandable that if the position recognition branch network outputs the pixel position information of K game elements for the input game image, the corresponding semantic category recognition branch network outputs the semantic category of K game elements for the input game image, where K is a positive integer.

[0061] In some embodiments, the pixel position information of a game element can be determined by the position information of the center point of the detection box surrounding the game element in the game image and the size information of the detection box. The position of the detection box surrounding a game element in the game image can be approximated as the pixel position information of the game element surrounded by the detection box.

[0062] To ensure the accuracy of the object detection model in identifying objects from game images captured from game applications, it needs to be trained beforehand using initial training data. This initial training data includes multiple training samples. Each training sample consists of a sample game image captured during the game application's operation and its annotation information. The annotation information includes the location information of game elements within the sample game image and their semantic category information. The semantic category information of a game element can be its name (e.g., the name of a game character or virtual item). The annotation information for the sample game images can be determined manually by annotators.

[0063] Figure 3 This is a flowchart illustrating the training process of an object detection model according to an embodiment of this application, as shown below. Figure 3 As shown, it includes:

[0064] Step 310: The image feature extraction network in the target detection model extracts image features from the sample game image to obtain the sample image features of the sample game image.

[0065] Step 320: The position recognition branch network in the target detection model performs position regression based on the image features of the sample game image to obtain the position recognition result of the sample game image. The position recognition result includes the pixel position information of each game element identified from the sample game image.

[0066] Step 330: The semantic category recognition branch network in the object detection model performs semantic classification based on the image features of the sample game image to obtain the semantic classification result of the sample game image. The semantic classification result includes the predicted semantic category of each game element identified from the sample game image.

[0067] Step 340: Calculate the location prediction loss based on the labeled location information of the sample game images and the location recognition results of the sample game images.

[0068] The location prediction loss can be calculated by applying the first loss function to the labeled location information and the location recognition results of the sample game image. The first loss function can be the absolute value loss function, the mean squared error loss function, the intersection-union ratio loss function, etc., and no specific limitation is made here.

[0069] Step 350: Calculate the semantic classification loss based on the labeled semantic category information of the sample game images and the semantic classification results of the sample game images.

[0070] The semantic classification loss can be calculated by applying the second loss function to the labeled semantic category information of the sample game images and the semantic classification results of the sample game images. The second loss function can be the absolute value loss function, the mean squared error loss function, the cross-entropy loss function, etc. The second loss function can be different from or the same as the first loss function; no specific restrictions are imposed here.

[0071] Step 360: The location prediction loss and semantic classification loss are weighted and calculated to obtain the object detection loss. The weighting coefficients of the location prediction loss and the semantic classification loss can be set according to actual needs, for example, both can be 0.5, or they can be different.

[0072] Step 370: Adjust the parameters of the target detection model according to the target detection loss until the training termination condition is met.

[0073] The training termination condition can be either reaching a threshold number of iterations for the object detection model or the convergence of the object detection loss; no specific limitation is made here. Alternatively, after training the object detection model for a period of time, the object detection accuracy of the model can be tested. If the detection accuracy exceeds an accuracy threshold, the training termination condition can be determined.

[0074] Different game applications involve different game elements, such as different game characters and virtual items. In the above embodiment, sample game images from the game application to be tested and corresponding annotation information are used to train the object detection model. This allows the object detection model to learn the features of each game element in the game application to be tested during the training process. This ensures the adaptability of the object detection model to object detection of game images from the game application to be tested after training, and can accurately identify game elements in game images from the game application to be tested, thus ensuring the accuracy of the detection results.

[0075] (2) Based on the pixel position information of each game element in the game image, extract the image block of each game element from the game image.

[0076] The pixel position information of a game element in the game image indicates the pixel boundary of the game element. Therefore, the game image can be cropped according to the pixel boundary indicated by the pixel position information to obtain the image blocks of each game element.

[0077] Figure 4 This is a schematic diagram illustrating the extraction of image blocks from a game image according to an embodiment of this application, such as... Figure 4 As shown, the game image 400 presents multiple virtual characters. Figure 4Exemplary examples show a first virtual character 410, a second virtual character 420, a third virtual character 430, and a fourth virtual character 440. Figure 4 The dashed bounding box surrounding the virtual character is the detection box defined for the virtual character. Therefore, based on the position of the detection box corresponding to each virtual character in the game image, the part enclosed by the detection box can be extracted from the game image, for example... Figure 4 This results in four image blocks.

[0078] Since virtual characters and virtual items in game applications are generated through 3D modeling in the game engine, the game engine stores the model data of the 3D models of virtual characters and virtual items. From the corresponding model data, the center point coordinates of the 3D model of the virtual character and the center point coordinates of the 3D model of the virtual item can be obtained. The center point coordinates of the virtual character in the game image are equivalent to projecting the center point coordinates of the 3D model of the virtual character onto the screen coordinate system. Similarly, the center point coordinates of the virtual item in the game image are equivalent to projecting the center point coordinates of the 3D model of the virtual item onto the screen coordinate system.

[0079] Furthermore, related technologies provide plugins that can automatically calculate the center point coordinates of virtual objects (virtual characters / virtual items) in the screen coordinate system based on the virtual object (virtual character / virtual item) model data in the game engine. Therefore, plugins can automatically obtain the center point coordinates of game elements representing virtual objects in the game image. Examples of such plugins include the Unity Editor.Tools class. Alternatively, the center point coordinates of the virtual object's 3D model can be obtained from the model data of the virtual object (virtual character / virtual item) through the calling interface provided by the game engine. These center point coordinates can then be projected onto the screen coordinate system to obtain the center point coordinates of the virtual object in the game image. In addition, for control elements on the game interface, plugins for obtaining the control's center point coordinates can be used to automatically obtain their coordinates. Examples of such control center point acquisition plugins include the NGUI plugin.

[0080] Therefore, in some embodiments, the center point coordinates of each game element in the game image can be obtained by calling a plugin, and then the size detection network can predict the size of the detection box of each game element based on the game image and the center point coordinates of each game element in the game image.

[0081] Size detection networks can be neural networks constructed using convolutional neural networks, fully connected networks, etc. These size detection networks are used to output the size of a detection box that encloses a target in the image, with the input center point coordinates as the input center point coordinates, based on the input image and the input center point coordinates.

[0082] The size detection network includes an image feature extraction network and a detection box size recognition branch network. The image feature extraction network extracts features from the game image to obtain the image features of the game image. Then, the detection box size recognition branch network predicts the size of the detection box based on the image features of the game image and the coordinates of the center point of each game element in the game image, and outputs the size of the detection box corresponding to each game element, such as the height and width of the detection box of a rectangle.

[0083] If the detection box of a game element is a rectangle, the size detection network can output the width and height of the detection box for each game element. Then, by combining the center coordinates of the game element and the width and height of its detection box, the pixel position information of the game element in the game image can be determined. For example, if the center coordinates of a game element in the game image are (xcenter, ycenter), and the width and height of the detection box output by the size detection network for this game element are W and H, then the coordinates of the top-left vertex of the detection box are (xcenter-W / 2, ycenter-H / 2), and the coordinates of the bottom-right vertex are (xcenter+W / 2, ycenter+H / 2). The position information of the rectangle formed from the top-left vertex to the bottom-right vertex of the detection box is the pixel position information of the game element in the game image.

[0084] Step 230: Perform semantic understanding on the image blocks of each game element to obtain the text description statements of each image block.

[0085] A text description of an image tile refers to a natural language text statement that describes the game elements within the image tile. For example, if the game element presented in an image tile is a game character, the text description of the image tile can describe the game character's form, actions, color, name, etc.; if the game element presented in an image tile is a control, the text description of the image tile can describe the control's name, color, etc.; if the game element presented in an image tile is a virtual item, the text description of the image tile can describe the virtual item's name, form, location, etc.

[0086] In some embodiments, considering that the operation controls in the game application may not change as the game interaction process progresses, in step 230, image blocks representing operation controls can be pre-filtered out, and then semantic understanding can be performed on image blocks representing other game elements besides operation controls. This can reduce the amount of computation for semantic understanding and improve the efficiency of game testing.

[0087] In some embodiments, if the pixel position information of game elements in the game image is determined by means of an object detection model in step 220, the semantic category of the game elements in the object detection results output by the object detection model can be used to filter out image blocks representing operation controls. It is understood that the semantic category of an image block representing an operation control corresponds to whether the game element presented by that image block belongs to an operation control, or more specifically, which operation control the game element belongs to.

[0088] In some embodiments, step 230 includes the following steps A1 and A2:

[0089] Step A1: Image encoding is performed on the image blocks of each game element to obtain the image encoding features of each image block.

[0090] Image encoding of game elements' image blocks can be performed using an image encoder in a graph-to-text model. The image encoding features of an image block can be a feature vector of a preset length, which can reflect the semantic information of the image.

[0091] For example, if an image encoder is constructed using a convolutional neural network and a normalization processing network, the image encoding feature h output by the image encoder for image block I is: h = CNN-Norm(I), where CNN (Convolutional Neural Network)-Norm (normalization) represents the image encoder constructed using a convolutional neural network and a normalization processing network. Of course, image encoders are not limited to those constructed using convolutional neural networks and normalization processing networks; they can also be constructed using other neural networks.

[0092] Step A2: Decode the text based on the image encoding features of each image block to obtain the text description statement for each image block.

[0093] The image-encoded features of image patches can be decoded into text using a text decoder in a graph-to-text model. The text decoder decodes the input features and outputs a text sequence. The text decoder can be constructed using recurrent neural networks (unidirectional or bidirectional), long short-term memory networks, etc. The text decoder decodes and outputs word by word based on the input features, and after outputting the first word, it can decode and output the next word based on the input features and the already decoded words.

[0094] For example, if the text decoder is a recurrent neural network, for the t-th word to be output, the text decoder can determine the hidden state vector ht = RNN(ht-1, yt-1) for the t-th word based on the hidden state vector ht-1 determined for the (t-1)-th decoded output word and yt-1 determined for the (t-1)-th decoded output word. Then, based on the hidden state vector ht for the t-th word to be output, the probability P(yt(a)|ht, yt-1) = softmax(W*ht) of each word in the dictionary (let's say word a) as the t-th word yt is calculated, where W is the weight parameter of the text decoder, and the word with the highest probability is selected as the t-th word. This process continues, sequentially outputting each word in the text description sentence until the end-of-text marker is output.

[0095] In some embodiments, to ensure the effectiveness and accuracy of the text descriptions generated by the graph-to-text model for each game element in a game application, or to facilitate the graph-to-text model's better understanding of the game elements in the game application, the graph-to-text model can be pre-trained using second training data. The second training data includes multiple sample image blocks from the game application to be tested and sample descriptions for each sample image block. The sample descriptions for the sample image blocks can be annotated by an annotator, and each sample image block can represent a game element. In some embodiments, the training phase of the graph-to-text model can be divided into a first training phase and a second training phase, and the second training data is correspondingly divided into two parts: first-phase training data and second-phase training data. For ease of description, the sample image blocks in the first-phase training data are referred to as first sample image blocks, and the sample image blocks in the second-phase training data are referred to as second sample image blocks.

[0096] In the first training phase, some words in the sample description of the first sample image block can be masked beforehand to obtain the partially masked description of the first sample image block. Then, the graph-to-text model can be trained as follows: the image encoder in the graph-to-text model extracts image features from the first sample image block to obtain the corresponding sample image features; the text decoder predicts the masked words in the partially masked description of the first sample image block based on the sample image features and the embedding vector sequence of the partially masked description; the masking prediction loss is calculated based on the real words in the masked portion of the partially masked description and the predicted masked words; and the parameters of the graph-to-text model are adjusted based on the masking prediction loss until the first training termination condition is met.

[0097] After the first training phase, the second phase training data can be used to continue training the graph-to-text model. The image encoder in the graph-to-text model extracts image features from the second sample image patch to obtain the corresponding sample image features; the text decoder generates text based on the sample image features of the second sample image patch to obtain the predicted text description statement for the second sample image patch; the prediction loss is calculated based on the sample description statement and the predicted text description statement of the second sample image patch; and the parameters of the graph-to-text model are adjusted based on the prediction loss until the second training termination condition is met.

[0098] In the training process described above, since the image-to-text model is used to predict the masked words in the partial masked description of the first sample image block in advance, the image-to-text model can quickly learn the state description methods and characteristics related to each game element in the game application. Then, the image-to-text model is trained to generate text descriptions based on image blocks using the second stage training data, which can improve the training efficiency of the image-to-text model.

[0099] Step 240: Based on the text descriptions of multiple image blocks in the game image, determine the operation instructions that match the game state represented by the game image.

[0100] The game state represented by the game image refers to the game process state at the moment represented by the game image. For example, it indicates whether the game state represented by the game image is in a normal game process state. When the game process is paused due to pop-up windows or other prompts, it can be determined that the game state represented by the game image is that the game is in a paused state.

[0101] In the event of an unexpected pop-up window appearing in the game interface, by following steps 220-230 above, an image block representing the pop-up window can be extracted from the game image, along with a corresponding text description statement representing the pop-up window. The text description statement of the image block can characterize the text content displayed in the pop-up window and the type of pop-up window. Therefore, based on the text description statement of the image block representing the pop-up window, it can be identified that the game state represented by the game image is that a pop-up window XX has appeared and the game process is paused.

[0102] If the game process is in a normal state, the game state represented by the game image can also be used to indicate the game state faced by the main virtual character controlled by the test terminal in the current game image. The game state represented by the game image can be reflected by the main virtual character's own state in the game image and the game environment state of the main virtual character in the game image.

[0103] The state of the main virtual character in the game image refers to the attribute state of the main virtual character in the game image. Attribute states include the main virtual character's energy value, skill state (such as skills that can be released), movement state (location, movement routes, etc.), and usable virtual items, etc.

[0104] The game environment state of the main virtual character in the game image refers to the influence attributes of other virtual characters in the game image on the main virtual character. For example, it may include the attribute state of the main virtual character's teammates (such as energy value, attack status, etc.), the number of teammates, the attribute state of the main virtual character's opponents, the number of opponents, and the distance between other virtual characters and the main virtual character.

[0105] In some embodiments, as shown in FIG5, step 240 includes:

[0106] Step 510: Determine the game state represented by the game image based on the text description statements of multiple image blocks in the game image.

[0107] In some embodiments, considering that the appearance of some pop-ups during the game may pause the game, it is possible to determine whether there is a pop-up that would cause the game to pause, based on the description object (i.e. the game element described) of the text description statement of multiple image blocks. If there is, the game state represented by the game image can be determined: the appearance of pop-up XX indicates that the game process is paused.

[0108] If, based on the description objects (i.e., the game elements described) of the text description statements of multiple image blocks, it is determined that there is no pop-up window indicating that the game will be paused, the game state faced by the main virtual character in the current game image can be determined based on the main virtual character selected when entering the current target game level, and based on the text description statements of multiple image blocks in the game image and the main virtual character in the current target game level. For example, whether the main virtual character is being attacked or whether it can continue to move forward.

[0109] Step 520: Determine the operation instructions that are compatible with the game state represented by the game image, based on the game state represented by the game image.

[0110] If the game state represented by the game image is that the game is paused due to an abnormality, the operation command to exit the paused state can be determined based on the text description of the game element that caused the game process to pause, and in combination with the interaction processing logic in the game application. The operation command to exit the paused state is, for example, the operation command to restore the game process to normal.

[0111] If the game state represented by the game image is not in a paused state, the operation commands that correspond to the game state represented by the game image are those that are beneficial to the main virtual character in the game state represented by the game image. Operation commands that are beneficial to the main virtual character can be those that help the main virtual character win the game, those that allow the main virtual character to avoid current attacks or dangers, those that reduce the damage the main virtual character is currently taking in the game scene, or those that allow the main virtual character to pass the current level or complete the current game task.

[0112] In some embodiments, if there are multiple operation commands that are beneficial to the main virtual character in the game state represented by the game image, in step 240, one operation command can be randomly selected from the multiple operation commands that are beneficial to the main virtual character in the game state represented by the game image; or an operation command that has not been tested in the game state can be selected from the multiple operation commands that are beneficial to the main virtual character in the game state represented by the game image.

[0113] In some embodiments, a first set of candidate operation instructions can be pre-defined based on the game processing logic of the game application, corresponding to each game pause factor in a game paused state. The operation instructions in the first set of candidate operation instructions are used to exit the corresponding game paused state. Based on this, if it is determined that the game state represented by the game image is a game paused state, and the game pause factor causing the game pause is determined according to the text description of the image block that caused the game pause, an operation instruction is selected from the first set of candidate operation instructions corresponding to that game pause factor as the operation instruction adapted to the game state represented by the game image.

[0114] In some embodiments, a second set of candidate operation instructions for each game character in each game state can be pre-defined based on the game processing logic of the game application. The second set of candidate operation instructions for a game character in a game state is the set of operation instructions that are beneficial to the game character in that game state. Based on this, after determining that the game state represented by the game image is normal game progress based on the text description statements of multiple image blocks in the game image, and after determining the game state faced by the main virtual character in the current game image, an operation instruction is selected from the second set of candidate operation instructions corresponding to the main virtual character in the current game state as the operation instruction that is compatible with the game state represented by the game image.

[0115] Step 250: Perform simulated operation of the game application according to the operation instructions.

[0116] The operation instructions adapted to the game state represented by the game image indicate the operations that need to be triggered in the game interface. Therefore, the game application can be simulated to respond to the operation instructions. For example, if the operation instruction adapted to the game state represented by the game image indicates controlling the main virtual character to move forward, the operation of simulating the main virtual character to move forward is triggered, so that the game application controls the main virtual character to move forward. In this application, after extracting multiple image blocks of game elements from the game image, semantic understanding is performed on each image block to obtain the text description statement of each image block. Since the text description statement of the image block can describe the relevant state and information of the game elements presented on the image block, the game state presented by the game image can be accurately understood by using the text description statements of multiple image blocks, realizing automatic understanding of the content in the game image and automatic determination of the operation instructions applicable to the game state represented by the current game image, thus realizing automatic game operation.

[0117] In some embodiments, when the user selects an automatic game mode on the game application's interface displayed on the terminal, the game application can automatically operate the game according to the method provided in this application, without relying on user operation. If the user selects a normal game mode, the game application controls the game in response to the user's operation triggered on the game interface. Furthermore, after selecting to enter automatic game mode, the user can choose to exit automatic game mode and switch to normal game mode as needed.

[0118] Figure 5B This is a flowchart illustrating a game control method according to another embodiment of this application. (See attached flowchart.) Figure 5B As shown, after step 250, the method further includes:

[0119] Step 260: Collect game response data of the game application responding to operation commands.

[0120] Game response data refers to the data generated by a game application in response to simulated operations performed by the application according to operation instructions adapted to the game state represented by the game image. In some embodiments, game response data may include a response record of the game application to the simulated operation. This response record may be a record generated during the process of the game application responding to the simulated operation. Through this response record, the response logic executed by the game application in response to operation instructions adapted to the game state represented by the game image can be determined.

[0121] Step 270: Determine the test results of the game application based on the game response data and the benchmark response data corresponding to the game state and operation commands represented by the game image.

[0122] The baseline response data, corresponding to the game state and operation commands represented by the game image, refers to the ideal response data of the game application after executing the operation commands in the game state represented by the game image. In other words, the baseline response data indicates the expected response logic that the game application should execute according to the operation commands in the game state represented by the game image.

[0123] If the collected response data is compared with the corresponding baseline response data, and it is determined that the response logic executed by the game application reflected in the response data is the same as the expected response logic indicated by the baseline response data, it can be determined that the game application, in the game state represented by the game image, executes the same response logic according to the operation instructions as expected.

[0124] In step 270, based on the response data and the benchmark response data corresponding to the game state and operation commands represented by the game image, test records for the game application in response to the game state represented by the game image and the corresponding operation commands can be determined. These test records indicate whether the response logic executed by the game application for operation commands adapted to the game state represented by the game image is the same as the expected response logic. The test records are then added to the game application's test results. Figure 5B In a corresponding embodiment, the solution of this application is applied to game testing. The game application to be tested can run on a terminal, and game images are captured in real time while the game application is running on the terminal.

[0125] In some embodiments, considering that a game application involves multiple game levels, and the processing logic of different game levels varies significantly, the game application can be tested on a level-by-level basis. The method described in this application can be used to test each game level involved in the game application. A game level refers to a game stage, a game process, or a game task that divides the game flow. Correspondingly, after determining the target game level to be tested, the user can select to enter the target game level on the terminal's game interface. The game engine deployed in the test environment can load the virtual scene data corresponding to the target game level, and the game application's screen in the target game level can be displayed on the terminal's game interface.

[0126] In some embodiments, considering that there are multiple virtual characters available for user operation in a game application, and that the interaction logic of different virtual characters differs in the same game level, different virtual characters can be selected in each game level to test the interaction processing logic of each virtual character in the game level.

[0127] Understandably, after the game application performs simulated operations according to the operation instructions, the interface of the game application displayed on the terminal side is updated accordingly. Then, the updated game image can be captured and processed according to the process shown in steps 220-270 above.

[0128] In related technologies, game applications need to undergo testing after updates, and a key testing phase is UI testing. UI testing typically involves pre-setting a test flow, defining the actions to be simulated within the game interface, and automating these actions during testing. However, unexpected pop-ups not defined in the test flow may occur during testing, causing the test to be interrupted or fail.

[0129] In this application, after extracting multiple image blocks representing game elements from a game image, semantic understanding is performed on each image block to obtain a text description statement for each image block. Since the text description statements of the image blocks can describe the relevant states and information of the game elements presented on the image blocks, the game state presented by the game image can be accurately understood using the text description statements of multiple image blocks. This enables automatic understanding of the content in the game image and automatic determination of operation instructions applicable to the game state represented by the current game image, allowing the game testing process to continue and ensuring the effective execution of the automated testing process for game applications. This effectively solves the problem in related technologies where the test process is interrupted or fails due to the inability to identify the game state. Furthermore, game images contain a large number of background pixels, which contribute little to understanding the game state represented by the game image. Therefore, extracting image blocks representing game elements from the game image for semantic understanding, instead of performing semantic understanding on the entire game image, can improve the efficiency of semantic understanding during game testing, thereby improving the testing efficiency of game applications.

[0130] In other embodiments, such as Figure 6 As shown, step 220 includes the following steps 610-650:

[0131] Step 610: Obtain the center point coordinates of each game element in the game image. See the description above for the method of obtaining the center point coordinates of each game element.

[0132] Step 620: Predict the size of the detection box using N different target size detection networks based on the center point coordinates and the game image, and obtain N size prediction results corresponding to each center point coordinate. The size prediction results include predicted size information and prediction probability; N is a positive integer greater than 1.

[0133] The target size detection network functions similarly to the size detection network described above; both use the input game image and the coordinates of the center points of each game element in the image to predict the size of the detection box for each game element. Therefore, the predicted size information in each size prediction result indicates the size of the detection box for the corresponding game element, where the coordinates of the center point of the detection box for a game element are the coordinates of the center point of the corresponding game element.

[0134] Considering the significant differences in size and shape among different types of game elements in game images, using a single target size detection network to predict the size of the detection boxes for different types of game elements may result in insufficient accuracy for some types of game elements. For example, the predicted detection box size may be too large or too small, leading to unreasonable subsequent image patches. For instance, the image patch may not fully represent the game element, or it may contain too much other image content besides the game element, interfering with the subsequent generation of text descriptions for the game element. Therefore, in this embodiment, N different target size detection networks are used to predict the size of the detection boxes for each game element in the game image.

[0135] N different target size detection networks are suitable for detecting different types of game elements, or in other words, N different target size detection networks are suitable for detecting different ranges of target sizes. In some embodiments, N can be 3, for example, one target size detection network is suitable for predicting the detection box size of virtual characters, one target size detection network is suitable for predicting the detection box size of virtual props, and one target size detection network is suitable for predicting the detection box size of operation controls. In other embodiments, N can be more or less, which is not specifically limited here.

[0136] The structure of each target size detection network can be similar to that of the size detection network mentioned above. For example, they all include an image feature extraction network and a bounding box size recognition branch network. However, the structure of the image feature extraction network in different target size detection networks can be the same or different, and the structure of the bounding box size recognition branch network in different target size detection networks can be different. For example, the number of layers in the neural network can be different, or the type of neural network deployed in each layer can be different.

[0137] In some embodiments, considering that N different target size detection networks all involve image feature extraction of game images, in order to avoid performing feature extraction on the same game image multiple times and wasting computational resources, N different target size detection networks can share the same image feature extraction network. That is, the same image feature extraction network extracts features from the game image to obtain the corresponding image features. Then, the coordinates of multiple center points and image features are input into the detection box size recognition branch network of each target size detection network for detection box size prediction.

[0138] If the center point coordinates of K game elements in the game image are obtained in step 610, then in step 620, each target size detection network predicts the detection box size based on the K center point coordinates and the game image, and outputs a size prediction result corresponding to each of the K center point coordinates. The predicted size information in the size prediction result corresponding to a center point coordinate is used to indicate the size of the detection box of the game element corresponding to that center point coordinate (e.g., including width and height). The prediction probability in the size prediction result corresponding to a center point coordinate reflects the confidence level of the predicted size information for the game element corresponding to that center point coordinate. The higher the prediction probability, the higher the confidence level, and the higher the accuracy of the predicted size information.

[0139] Step 630: Determine the reference probability threshold corresponding to each center point coordinate based on the prediction probabilities in the N size prediction results for the same center point coordinates.

[0140] In some embodiments, for each center point coordinate, the average of the predicted probabilities among the N size prediction results corresponding to that center point coordinate can be used as the reference probability threshold for that center point coordinate. For example, if the predicted probability output by the i-th target size detection network for the j-th center point coordinate in the N target size detection networks is Pij, then the reference probability threshold P(j) corresponding to the j-th center point coordinate is expressed as:

[0141]

[0142] In some embodiments, for each center point coordinate, the predicted probability of the target percentile among the predicted probabilities of the N size prediction results corresponding to the center point coordinate can be used as the reference probability threshold corresponding to the center point coordinate. The target percentile is greater than 50% and less than 100%, for example, the target percentile is 75%, 80%, 85%, 90%, etc.

[0143] Step 640: Based on the reference probability threshold corresponding to each center point coordinate, among the N size prediction results for the same center point coordinate, determine the target size prediction result with the highest prediction probability that exceeds the corresponding reference probability threshold.

[0144] In other words, for each center point coordinate, among the N size prediction results of that center point coordinate, the size prediction result with the highest prediction probability is taken as the target size prediction result corresponding to that center point coordinate.

[0145] Step 650: According to the predicted size information and corresponding center point coordinates in the target size prediction results corresponding to each center point coordinate, the game image is cropped to obtain the image block of the game element corresponding to the center point coordinate.

[0146] In some embodiments, if, for a given center point coordinate, none of the N size prediction results corresponding to that center point coordinate have a prediction probability exceeding the corresponding reference probability threshold, then it is not necessary to determine the target predicted size information for that center point coordinate, and subsequent image patch cropping is also unnecessary. It is understood that if none of the N size prediction results corresponding to a center point coordinate have a prediction probability exceeding the corresponding reference probability threshold, it indicates that the confidence levels of the size prediction results for that center point coordinate are all low. If subsequent image patch cropping is performed using the predicted size information from these low-confidence size prediction results, the effectiveness of the cropped image patch will be low.

[0147] In the above embodiment, since the predicted probability of the target size prediction result corresponding to a center point coordinate exceeds the reference probability threshold corresponding to that center point coordinate, it indicates that the accuracy or confidence of the target size prediction result corresponding to that center point coordinate is high. Among the N size prediction results for the same center point coordinate, the target size prediction result with the highest predicted probability exceeding the corresponding reference probability threshold is determined. This ensures that the size prediction result with the highest confidence is determined from the size prediction results with high confidence, and is used as the corresponding target size prediction result. This can avoid the problem that even if the size prediction result with the highest predicted probability is used as the target size prediction result, the confidence of the target size prediction result is still not high because the confidence of the N size prediction results is low.

[0148] In some embodiments, such as Figure 7 As shown, step 510 includes the following steps 710-740:

[0149] Step 710: Among the text description statements of multiple image blocks in the game image, determine the text description statement related to the main virtual character.

[0150] The text description statements related to the main virtual character can include text description statements of the presented game elements as image blocks of the main virtual character. The presented game elements as image blocks of the main virtual character can refer to the presented game elements as image blocks of the main virtual character as a whole, or it can refer to the presented game elements as image blocks of items worn by the main virtual character, such as virtual props, accessories, etc.

[0151] In some embodiments, the text description statements associated with the main virtual character may also include text description statements of image tiles of game elements that present the skill status attributes of the main virtual character.

[0152] In some embodiments, step 710 includes the following ①-③:

[0153] ① Obtain the element descriptors for each image block.

[0154] The element descriptor of an image patch can refer to the name of the game element presented by the image patch, or the semantic category of the game element presented by the image patch. In some embodiments, if the object detection result is obtained by an object detection model in step 220, the semantic category of each game element in the object detection result can be used as the element descriptor of the image patch where the game element is located.

[0155] In some embodiments, step 710 includes: acquiring image features of each image block; matching the image features of the image blocks in a feature library to determine target image features that match the image features of the image blocks; and using the element descriptors associated with the target image features in the feature library as element descriptors for the corresponding image blocks. The feature library stores the element descriptors associated with each image feature. Therefore, after determining the image features of an image block, the feature similarity between the image features of the image block and each image feature in the feature library can be calculated. If the feature similarity between an image feature in the feature library (let's say image feature A) and the image features of the current image block (let's say image block B) exceeds a similarity threshold, it can be determined that image feature A in the feature library is a target image feature that matches the image features of image block B. Consequently, the element descriptors associated with the target image features can be determined in the feature library as element descriptors for image block B.

[0156] In some embodiments, considering that the text description of an image tile is used to describe the game elements presented in the image tile, the text description of the image tile generally indicates the object described by the text description. Therefore, keywords representing the object described by the text description can be extracted from the image tile as element descriptors. For example, a regular expression corresponding to the syntax of the text description can be determined based on the syntax of the text description. Then, the determined regular expression can be used to locate the position of the keyword representing the object described by the text description in the text description, and the keyword at that position can be extracted as an element descriptor.

[0157] For example, if the text description of an image patch is "a white chicken with an orange beakis standing in the grass", the syntax of this text description is subject-linking verb-predicate. In this syntax, the words that represent the object described by the text description are generally the words in the subject. Therefore, the name of the subject can be extracted as the element description word.

[0158] In some embodiments, considering that the text description statements of image tiles are used to describe the game elements presented in the image tiles, the text description statements of each image tile can be compressed to obtain element descriptive terms for each image tile. Specifically, compressing the text description statements of image tiles means converting longer text description statements of image tiles into shorter text tags.

[0159] In some embodiments, the text description statements of each image block can be compressed using a large language model guided by compression guidance text to obtain the element description words of the image block.

[0160] Large language models, also known as large models, natural language models, or large-scale language models, are deep learning models trained on large amounts of text data. They are capable of generating natural language text or understanding the meaning of language text and can be applied to various natural language processing tasks, such as text classification and question answering.

[0161] The compression guide text is the instruction text given to the large language model. In this embodiment, the compression guide text is used to instruct the large language model to compress the input text description. For example, if the text description of an image patch is: "a white chicken with an orange beak is standing in the grass", the compression guide text could be: "What category does the object described in 'a white chicken with an orange beak is standing in the grass' belong to?" Under the guidance of the compression guide text, the large language model will output the element description word "chicken".

[0162] In some embodiments, to facilitate subsequent processing, the compression guidance text can instruct the large language model not only to compress the input text description statement and output the corresponding element descriptors, but also to output the numeric tags corresponding to the element descriptors. Understandably, in this case, a numeric tag dictionary can be pre-constructed, maintaining the correspondence between element descriptors and numeric tags. After the large language model learns this numeric tag dictionary, it can compress the text description statement of the image patch according to the compression guidance text, outputting the element descriptors of the image patch and their corresponding numeric tags.

[0163] Continuing with the example above, the compressed guiding text could be: "What category does the object described in 'a white chicken with an orange beak is standing in the grass' belong to, and what is its corresponding numerical label?" Guided by the compressed guiding text above, the large language model could output the response: "The object described in 'a white chicken with an orange beak is standing in the grass' belongs to the category 'chicken,' and in the previously constructed numerical label dictionary, the numerical label corresponding to this category is 2."

[0164] Compressing the text description of an image tile into shorter element descriptions makes it easier to directly identify the name of the game element presented in the image tile, or to intuitively determine which game element the text description describes.

[0165] ② Among multiple image blocks, determine the target image block whose element descriptor represents the main virtual character.

[0166] A target image block refers to an image block whose element description represents the main virtual character. Correspondingly, the game element presented by the target image block is the main virtual character.

[0167] ③ Use the text description statement corresponding to the target image block as the text description statement related to the main virtual character.

[0168] In some embodiments, considering that icons or operation controls on the interface of a game application represent skill states and are used to represent the skill states of the main virtual character, the text descriptions of the icons or operation controls representing skill states can also be used as text descriptions related to the main virtual character.

[0169] Step 720: Determine the state of the main virtual character in the game image based on the text description statements related to the main virtual character.

[0170] Based on the text descriptions associated with the main virtual character, the character's status in the game image can be determined, such as the character's energy level, skill status (e.g., skills that can be used), movement status (location, movement routes, etc.), available virtual items, and whether the character is currently under attack.

[0171] Step 730: Determine the game environment state of the main virtual character in the game image based on the text description statements of multiple image blocks, excluding the text description statements related to the main virtual character.

[0172] Based on the text descriptions other than those related to the main virtual character, the game environment state of the main virtual character in the game image can be determined accordingly. For example, the attribute state of the main virtual character's teammates (such as energy value, attack status, etc.), the number of teammates, the attribute state of the main virtual character's opponents, and the number of opponents.

[0173] Step 740: Determine the game state represented by the game image based on the main virtual character's own state in the game image and the game environment state of the main virtual character in the game image.

[0174] The state of the main virtual character in the game image and the state of the game environment in the game image can be combined to form the game state represented by the game image. It can be understood that the game state represented by the game image reflects the game state faced by the main virtual character in the current game image.

[0175] In some embodiments, before determining the target image block whose element descriptor represents the main virtual character among multiple image blocks, the method may further include: determining, based on the element descriptors of each image block, whether there exists a first image block whose element descriptor is located in a specified pop-up name set, wherein the pop-up name in the specified pop-up name set refers to the name of a pop-up that causes the game application to pause. If a first image block exists, the game element presented by the first image block corresponds to a pop-up that causes the game to pause, and then the game state represented by the game image can be determined based on the text description of the first image block.

[0176] If it is determined that there is no first image block in the specified pop-up name set among multiple image blocks, the target image block representing the main virtual character can be determined among multiple image blocks, thereby determining the game state faced by the main virtual character in the current game image.

[0177] In the above embodiments, the main virtual character's own state in the game image is determined based on the text description statements related to the main virtual character, and the game environment state of the main virtual character in the game image is determined based on other text description statements besides those related to the main virtual character. Combining the main virtual character's own state and the game environment state in the game image, the game state represented by the game image is determined. This approach not only focuses on the main virtual character itself but also considers the influence of other virtual characters or the game scene in the game image on the main virtual character, ensuring the accuracy of the determined game state and thus the accuracy of the subsequently determined operation commands.

[0178] In some embodiments, where the solution of this application is applied to game testing, such as Figure 8 As shown, step 520 includes:

[0179] Step 810: In the game test case set of the game application, determine the target game test cases that are compatible with the game state.

[0180] The game test case set for a game application includes one or more test cases applicable to each game state. Therefore, after determining the game state represented by the game image, test cases applicable to the current game state represented by the game image can be identified from the game test case set; these are the target game test cases adapted to the game state represented by the game image.

[0181] In some embodiments, the game test case set of a game application may include a set of tested cases and a set of untested cases. Correspondingly, in step 810, a target game test case that matches the game state can be determined from the set of untested cases. Correspondingly, after step 260, the target game test case can be added to the set of tested cases and removed from the set of untested cases. This ensures that all game test cases in the game application can be tested without being reused, avoiding duplicate testing.

[0182] In some embodiments, when testing a game application on a level-by-level basis, a set of test cases can be set for each game level, and during the testing of a game level, the set of tested and untested test cases corresponding to that game level can be dynamically maintained. In step 810, target game test cases that match the game state represented by the game image can be determined from the set of untested test cases corresponding to the target game level.

[0183] In some embodiments, if there are multiple game test cases that are compatible with the game state, one can be selected as the target game test case.

[0184] Step 820: Obtain the operation instructions from the target game test case as operation instructions that are compatible with the game state represented by the game image.

[0185] The game test cases record the game state to which the test case applies, as well as the operation commands to be executed and the baseline response data. Therefore, the operation commands in the target game test cases can be used as operation commands adapted to the game state represented by the current game image.

[0186] Correspondingly, before step 270, the method further includes: step 830, obtaining benchmark response data in the target game test case as benchmark response data corresponding to the game state and operation instructions represented by the game image.

[0187] In the above embodiment, after determining the game state represented by the game image, target game test cases applicable to the game state represented by the game image are obtained from the set of game test cases constructed for the game application. The game application is then tested based on the operation instructions and benchmark response data in the target game test cases. In this way, the game test cases are used to guide the game testing process. That is, the game state and operation instructions defined in the game test cases indicate the functions that need to be tested in the game application, making the game testing process targeted and able to cover all game test cases in the set of game test cases.

[0188] In some embodiments, the game image presents at least two game characters; the method further includes:

[0189] Based on the element descriptors of each image patch, the image patches representing the same game character from N consecutively acquired game images are aggregated to obtain the image patch set corresponding to each game character; N is an integer greater than 1.

[0190] For each game character, the trajectory of the game character is obtained by fitting the trajectory of each image block in the corresponding image block set according to the order of the acquisition time of the source game images.

[0191] Multiple image blocks with the same meta-sketching term can be considered as image blocks representing the same game character. It can be understood that all image blocks in the set of image blocks corresponding to a game character represent that game character.

[0192] The location point corresponding to an image block refers to the position of the game character represented by the image block in the game scene at the time when the game image from which the image block originates. During image acquisition for a game application, the server of the game application can be requested to provide the position information of each game character at the corresponding time. This position information indicates the position point of the game character in the game scene at that time. Since the game character represented by the image block can be determined through its element descriptor, the location point corresponding to the image block can be obtained by using the character position information acquired from the game image from which the image block originates at the corresponding time, based on the element descriptor of the image block. In this embodiment, since the movement trajectory of the game character is generated based on the position point of the game character represented by the image block in the game scene at the time when the game image from which the image block originates, the accuracy of the determined movement trajectory can be guaranteed even if the viewpoints of the game images acquired at different times are different (or the position and / or angle of the virtual camera are different).

[0193] Figure 9 This is a flowchart illustrating a game control method according to an embodiment of this application. For example... Figure 9 As shown, after obtaining the game image, image blocks of multiple game elements are extracted from the game image. Figure 9 The example shows four image blocks. Then, a graph-to-text model is used to generate text descriptions for each image block. Next, labels are determined for each image block; for example, a label can be an element descriptor, or an element descriptor combined with a numeric label.

[0194] In a specific embodiment, the label of an image patch can be determined using either of the following two methods:

[0195] Method 1: Use a large language model to compress the text description of the image patch to obtain the element description words of the image patch, or obtain the element description words of the image patch and the numerical label.

[0196] In Method 1, the large language model can perform text compression based on an attention mechanism, following the steps ①-⑦ below:

[0197] ① Long text preprocessing;

[0198] Large language models can first perform word segmentation on the input long text (e.g., compressed guiding text including text descriptions of image blocks), where the word segmentation result of the input long text X can be represented as:

[0199] Segment(X) = {X1, X2, ..., Xn}; (Formula 2); where X1, X2, ..., Xn represent n segments sequentially divided from a long text, and n is a positive integer greater than 1. Specifically, a word segmentation tool can be used to segment the long text.

[0200] ② Map the text to an embedded vector representation;

[0201] The embedding layer in a large language model can map each word in the segmentation result to an embedding vector representation. The embedding vector representation Ei of the i-th word Xi in the segmentation result Segment(X) is:

[0202] Ei=Embed(Xi), i=1, 2,...,n; (Formula 3)

[0203] Large language models can compress text descriptions of image patches based on attention mechanisms. For example, the QKV (Query-Key-Value) attention mechanism can be used: leveraging QKV attention allows large language models to better capture the relationships between word segments. This mechanism allows for the simultaneous consideration of long texts and the interactions between individual word segments when generating element descriptions.

[0204] ③ Determine the query vector, key vector, and value vector

[0205] The large language model can generate query vectors (Q), key vectors (K), and value vectors (V) for each segment in the word segmentation result. For example, for a segment s, the query vector Qs, the key vector Ks, and the value vector Vs corresponding to segment s can be calculated according to the following formula:

[0206] Qs = WQs * Es + bQs; (Formula 4)

[0207] Ks = WKs * Es + bKs; (Formula 5)

[0208] Vs = WVs * Es + bVs; (Formula 6)

[0209] Where Es is the embedding vector representation of word segment s; WQs is the weight matrix used to calculate the query vector Qs corresponding to word segment s, bQs is the bias matrix used to calculate the query vector Qs corresponding to word segment s; WKs is the weight matrix used to calculate the key vector Ks corresponding to word segment s; bKs is the bias matrix used to calculate the key vector Ks corresponding to word segment s; WVs is the weight matrix used to calculate the value vector Vs corresponding to word segment s; bVs is the bias matrix used to calculate the value vector Vs corresponding to word segment s; where WQs, bQs, WKs, bKs, WVs, and bVs can all be determined through training.

[0210] Similarly, a query vector (Q), key vector (K), and value vector (V) can be generated for the text description statement of an image patch. For example, for the text description statement l of an image patch, the query vector Ql, the key vector Kl, and the value vector Vl corresponding to the text description statement l can be calculated according to the following formula:

[0211] Ql = WQl * El + bQl; (Formula 7)

[0212] Kl=WKl*El+bKl; (Formula 8)

[0213] Vl = WVl*El + bVl; (Formula 9)

[0214] Wherein, El is the embedding vector representation of the text description statement l; WQl is the weight matrix used to calculate the query vector Ql corresponding to the text description statement l, bQl is the bias matrix used to calculate the query vector Ql corresponding to the text description statement l; WKl is the weight matrix used to calculate the key vector Kl corresponding to the text description statement l; bKl is the bias matrix used to calculate the key vector Kl corresponding to the text description statement l; WVl is the weight matrix used to calculate the value vector Vl corresponding to the text description statement l; bVl is the bias matrix used to calculate the value vector Vl corresponding to the text description statement l; where WQl, bQl, WKl, bKl, WVl and bVl can all be determined through training.

[0215] ④ Calculate attention scores;

[0216] Large language models can calculate the attention score between each word segment and the text description statement, enabling interaction between long and short texts. For example, the attention score between word segment s and text description statement l can be calculated using the following formula, including the first attention score Asl of word segment s to text description statement l, and the second attention score Als of text description statement l to word segment s:

[0217] Asl=Softmax(Qs*KlT) / sqrt(dk1); (Formula 10)

[0218] Als=Softmax(Ql*KsT) / sqrt(dk2); (Formula 11)

[0219] Where KlT represents the transpose of the key vector Kl corresponding to the text description statement l; dk1 represents the dimension of the key vector Kl corresponding to the text description statement l; sqrt(dk1) represents the arithmetic square root of dk1; KsT represents the transpose of the key vector Ks corresponding to the word segment s; dk2 represents the dimension of the key vector Ks corresponding to the word segment s; and sqrt(dk2) represents the arithmetic square root of dk2.

[0220] ⑤ Sum the attention scores by weight.

[0221] The weighted sum of attention scores for word segment s can be calculated using the following formula 12:

[0222] Os = Asl * Vl; (Formula 12)

[0223] And calculate the weighted sum of attention scores for the text description statement l according to the following formula 13:

[0224] Ol = Als * Vs; (Formula 13)

[0225] ⑥ Combine the weighted sums of attention scores and then activate them;

[0226] The weighted sum of attention scores Os for word segment s and the weighted sum of attention scores Ol for text description sentence l are concatenated to obtain the concatenated result O:

[0227] O = Concat(Os, Ol); (Formula 14)

[0228] Next, the concatenation result O is activated using the tanh activation function to obtain the activation result H:

[0229] H = tanh(WH*O + bH); (Formula 15)

[0230] WH is the weight matrix of the activation layer, and bH is the bias matrix of the activation layer, which can be determined through training.

[0231] ⑦ Output by category.

[0232] In the classification layer of a large language model (e.g., it can be constructed using a fully connected network), the activation processing result H can be mapped to labels, such as element descriptors.

[0233] The input Y of the classification layer can be represented as:

[0234] Y = Softmax(WC*H + bC); (Formula 15)

[0235] WC is the weight matrix of the classification layer, and bC is the bias matrix of the classification layer, which can be determined through training.

[0236] Through the above steps, the encoding of long text into short text was achieved. By utilizing the QKV attention mechanism, the large language model can focus on the parts of the long text that are relevant to the target. Furthermore, the calculation of attention weights allows the model to automatically select key information. In addition, based on the QKV attention mechanism, the large language model can better capture the correlation between texts and simultaneously consider the interaction between the text description sentences and word segmentation in the long text, making the generated element description words more accurate.

[0237] Method 2: Match the image features of the image patch with the image features associated with each label in the feature library to obtain the element descriptor of the image patch, or obtain the element descriptor of the image patch and the numerical label.

[0238] The feature library stores the labels corresponding to each game element and the image features of each game element. For example, for the labels of image blocks determined according to method 1, the labels of the image blocks and the image features of the image blocks can be associated and stored in the feature library for subsequent feature matching.

[0239] Subsequently, based on the text descriptions of multiple image blocks and the labels of each image block, the game state represented by the game image is determined, and the operation instructions corresponding to the game state represented by the game image are determined. Then, the game application is simulated according to the operation instructions, and game response data is collected. Based on the response data and the benchmark response data corresponding to the game state represented by the game image and the operation instructions, the test results of the game application are determined. Figure 9 The specific implementation details of each step in the embodiment can be found in the description above, and will not be repeated here.

[0240] The solution presented in this application uses image blocks of game elements captured from game images and performs semantic understanding on these blocks. This effectively allows for the understanding of the game state represented by the game image, enabling automatic image reading and the automatic determination and execution of appropriate operation commands for the current game state. It eliminates the need for script logic and can perform automated operations even in complex game scenarios, ensuring continuous automated game testing. Furthermore, since this solution can identify multiple game characters in the game image, during game application testing, movement trajectories can be generated for different game characters from multiple consecutively acquired game images, achieving multi-character memorization.

[0241] The following describes an apparatus embodiment of this application, which can be used to perform the methods described in the above embodiments of this application. For details not disclosed in the apparatus embodiments of this application, please refer to the method embodiments described in the above embodiments of this application.

[0242] Figure 10 This is a block diagram of a game control device according to an embodiment of this application, such as... Figure 10 As shown, the game control device includes: an acquisition module 1010 for acquiring game images captured during the operation of the game application; a capture module 1020 for capturing image blocks of multiple game elements from the game images; a semantic understanding module 1030 for performing semantic understanding on the image blocks of each game element to obtain text description statements for each image block; an operation instruction determination module 1040 for determining operation instructions that match the game state represented by the game images based on the text description statements of multiple image blocks in the game images; and a simulation operation module 1050 for performing simulated operations on the game application according to the operation instructions.

[0243] In some embodiments, the game control device further includes: a data acquisition module for acquiring game response data of the game application in response to operation commands; and a test result determination module for determining the test result of the game application based on the game response data and the benchmark response data corresponding to the game state and operation commands represented by the game image.

[0244] In some embodiments, the operation instruction determination module 1040 includes: a game state determination unit, configured to determine the game state represented by the game image based on the text description statements of multiple image blocks in the game image; and an operation instruction determination unit, configured to determine an operation instruction adapted to the game state represented by the game image based on the game state represented by the game image.

[0245] In some embodiments, the multiple game elements include at least a main virtual character; the game state determination unit includes: a first determination unit, configured to determine, from the text description statements of multiple image blocks in the game image, a text description statement related to the main virtual character; a first state determination unit, configured to determine, based on the text description statement related to the main virtual character, the main virtual character's own state in the game image; a second state determination unit, configured to determine, based on the text description statements of the multiple image blocks, other than the text description statement related to the main virtual character, the game environment state of the main virtual character in the game image; and a third state determination unit, configured to determine, based on, the main virtual character's own state in the game image and the game environment state of the main virtual character in the game image, the game state represented by the game image.

[0246] In some embodiments, the first determining unit includes: an element descriptor acquisition unit, configured to acquire element descriptors for each image block; a target image block determining unit, configured to determine, among multiple image blocks, a target image block whose element descriptor represents a master virtual character; and a target determining unit, configured to use the text description statement corresponding to the target image block as a text description statement related to the master virtual character.

[0247] In some embodiments, the element descriptor acquisition unit is configured to: compress the text description statements of each image block to obtain the element descriptor of each image block.

[0248] In some embodiments, the element descriptor acquisition unit is further configured to: compress the text description statements of each image block by guiding the large language model through compression guidance text to obtain the element descriptors of each image block.

[0249] In other embodiments, the element descriptor acquisition unit is further configured to: acquire image features of each image block; match the image features of the image block in a feature library to determine target image features that match the image features of the image block; and use the element descriptors associated with the target image features in the feature library as element descriptors of the corresponding image block.

[0250] In some embodiments, the cropping module 1020 includes: a target detection unit, configured to perform target detection on the game image to obtain a target detection result; the target detection result includes pixel position information of each game element in the game image and semantic category of each game element; a first cropping unit, configured to crop image blocks of each game element from the game image based on the pixel position information of each game element in the game image; and a corresponding element descriptor acquisition unit, configured to: use the semantic category of each game element as the element descriptor of the image block where the game element is located.

[0251] In some embodiments, the operation instruction determination module 1040 includes: a use case determination unit, configured to determine a target game test case that matches the game state from a set of game test cases for the game application; an operation instruction determination unit, configured to acquire operation instructions from the target game test case as operation instructions that match the game state represented by the game image; and a game control device, further including: a benchmark response data acquisition module, configured to acquire benchmark response data from the target game test case as benchmark response data corresponding to the game state represented by the game image and the operation instructions.

[0252] In other embodiments, the cropping module 1020 includes: a center point coordinate acquisition unit, used to acquire the center point coordinates of each game element in the game image; a size prediction unit, used to predict the size of the detection box based on the center point coordinates and the game image using N different target size detection networks, to obtain N size prediction results corresponding to each center point coordinate, the size prediction results including predicted size information and prediction probability; N is a positive integer greater than 1; a probability threshold determination unit, used to determine a reference probability threshold corresponding to each center point coordinate based on the prediction probability among the N size prediction results for the same center point coordinate; a target result determination unit, used to determine the target size prediction result with the highest prediction probability exceeding the corresponding reference probability threshold among the N size prediction results for the same center point coordinate based on the reference probability threshold corresponding to each center point coordinate; and a second cropping unit, used to crop in the game image according to the predicted size information in the target size prediction results corresponding to each center point coordinate and the corresponding center point coordinate, to obtain an image block of the game element corresponding to the center point coordinate.

[0253] In some embodiments, the game image presents at least two game characters, and the game control device further includes: an image block set determination module, used to aggregate image blocks representing the same game character from N consecutively acquired game images according to the element descriptors of each image block, to obtain an image block set corresponding to each game character; and a trajectory generation module, used to perform trajectory fitting for each game character according to the position points corresponding to each image block in the corresponding image block set, in the order of the acquisition time of the source game images from first to last, to obtain the movement trajectory of the game character.

[0254] In some embodiments, the semantic understanding module 1030 includes: an image encoding unit, used to encode the image blocks of each game element to obtain the image encoding features of each image block; and a text decoding unit, used to decode the text based on the image encoding features of each image block to obtain the text description statement of each image block.

[0255] Figure 11 A schematic diagram of a computer system suitable for implementing the electronic device of the embodiments of this application is shown. The electronic device may be the second electronic device or the first electronic device or a terminal as described above, used to implement the game control method provided in this application. It should be noted that... Figure 11 The computer system 1600 of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0256] like Figure 11As shown, the computer system 1600 includes a Central Processing Unit (CPU) 1601, which can perform various appropriate actions and processes, such as executing the methods described in the above embodiments, based on programs stored in Read-Only Memory (ROM) 1602 or programs loaded from storage portion 1608 into Random Access Memory (RAM) 1603. The RAM 1603 also stores various programs and data required for system operation. The CPU 1601, ROM 1602, and RAM 1603 are interconnected via a bus 1604. An Input / Output (I / O) interface 1605 is also connected to the bus 1604.

[0257] The following components are connected to I / O interface 1605: an input section 1606 including a keyboard, mouse, microphone, etc.; an output section 1607 including a cathode ray tube (CRT), liquid crystal display (LCD), and speakers, etc.; a storage section 1608 including a hard disk, etc.; and a communication section 1609 including a network interface card such as a LAN (Local Area Network) card and a modem, etc. The communication section 1609 performs communication processing via a network such as the Internet. A drive 1610 is also connected to I / O interface 1605 as needed. Removable media 1611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 1610 as needed so that computer instructions read from them can be loaded into storage section 1608 as needed.

[0258] In particular, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising computer instructions. When these computer instructions are executed by the central processing unit (CPU) 1601, various functions defined in the system of this application are performed.

[0259] This application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the methods described in any of the above method embodiments.

[0260] It should be noted that the computer-readable storage medium shown in the embodiments of this application can be a computer-readable signal medium, a computer-readable storage medium, or any combination of the two. Computer-readable storage media can be, for example, but not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable storage medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.

[0261] In the embodiments of this application, the terms "module" or "unit" refer to computer instructions or a portion of computer instructions that have a predetermined function and work together with other related parts to achieve a predetermined goal. These instructions can be implemented, wholly or partially, using software, hardware (e.g., processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that functions as a whole.

[0262] The above are merely preferred embodiments of this application and are not intended to limit this application in any way. Although this application has disclosed preferred embodiments as above, it is not intended to limit this application. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the technical solution of this application. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of this application without departing from the scope of the technical solution of this application shall still fall within the scope of the technical solution of this application.

Claims

1. A game control method, characterized in that, include: Acquire game images captured during the operation of the game application; Extract image blocks of multiple game elements from the game image; Semantic understanding is performed on the image blocks of each game element to obtain text description statements for each image block; Based on the text description statements of multiple image blocks in the game image, determine the operation instructions that are compatible with the game state represented by the game image; Perform simulated operation of the game application according to the operation instructions.

2. The method according to claim 1, characterized in that, After performing simulated operations on the game application according to the operation instructions, the method further includes: Collect game response data of the game application in response to the operation command; The test results of the game application are determined based on the game response data and the benchmark response data corresponding to the game state represented by the game image and the operation instructions.

3. The method according to claim 1, characterized in that, The step of determining operation instructions adapted to the game state represented by the game image based on the text description statements of multiple image blocks in the game image includes: The game state represented by the game image is determined based on the text description statements of multiple image blocks in the game image; Based on the game state represented by the game image, determine the operation instructions that are compatible with the game state represented by the game image.

4. The method according to claim 3, characterized in that, The plurality of game elements includes at least a main virtual character; determining the game state represented by the game image based on the text descriptions of multiple image blocks in the game image includes: Among the text description statements of multiple image blocks in the game image, determine the text description statement related to the main virtual character; The state of the main virtual character in the game image is determined based on the text description statements related to the main virtual character. Based on the text description statements of the plurality of image blocks, excluding the text description statements related to the main virtual character, the game environment state of the main virtual character in the game image is determined; The game state represented by the game image is determined based on the main virtual character's own state in the game image and the game environment state of the main virtual character in the game image.

5. The method according to claim 4, characterized in that, The step of determining the text description statement related to the main virtual character from the text description statements of multiple image blocks in the game image includes: Obtain the element descriptors for each of the image blocks; Among the plurality of image blocks, the element descriptor is determined to represent the target image block of the main virtual character; The text description statement corresponding to the target image block is used as the text description statement related to the main virtual character.

6. The method according to claim 5, characterized in that, The step of obtaining the element descriptors for each of the image blocks includes: Text compression is performed on the text description statements of each image block to obtain the element descriptors of each image block.

7. The method according to claim 6, characterized in that, The text compression of the text description statements for each image block to obtain element descriptive terms for each image block includes: By using a large language model guided by compression, the text description statements of each image block are compressed to obtain the element descriptive words of each image block.

8. The method according to claim 5, characterized in that, The step of obtaining the element descriptors for each of the image blocks includes: Obtain the image features of each of the image blocks; The image features of the image block are matched in a feature library to determine the target image features that match the image features of the image block. The element descriptors associated with the target image features in the feature library are used as element descriptors for the corresponding image blocks.

9. The method according to claim 4, characterized in that, The step of extracting image blocks containing multiple game elements from the game image includes: The game image is subjected to object detection to obtain object detection results; the object detection results include the pixel position information of each game element in the game image and the semantic category of each game element; Based on the pixel position information of each game element in the game image, image blocks of each game element are extracted from the game image; The step of obtaining the element descriptors for each of the image blocks includes: The semantic category of each game element is used as the element descriptor of the image block in which the game element is located.

10. The method according to any one of claims 2 to 7, characterized in that, The step of determining operation instructions adapted to the game state represented by the game image, based on the game state represented by the game image, includes: From the set of game test cases for the game application, determine the target game test cases that are compatible with the game state; Obtain the operation instructions from the target game test cases, and use them as operation instructions that are adapted to the game state represented by the game image; Before determining the test result of the game application based on the response data and the benchmark response data corresponding to the game state represented by the game image and the operation instructions, the method further includes: Obtain the baseline response data from the target game test case as the baseline response data corresponding to the game state represented by the game image and the operation command.

11. The method according to any one of claims 1 to 8, characterized in that, The step of extracting image blocks containing multiple game elements from the game image includes: Obtain the center point coordinates of each game element in the game image; The detection box size is predicted by N different target size detection networks based on the center point coordinates and the game image, resulting in N size prediction results corresponding to each center point coordinate. The size prediction results include predicted size information and prediction probability; N is a positive integer greater than 1. Based on the predicted probabilities from the N size prediction results for the same center point coordinates, determine the reference probability threshold corresponding to each center point coordinate. Based on the reference probability threshold corresponding to each center point coordinate, among the N size prediction results for the same center point coordinate, determine the target size prediction result with the highest prediction probability that exceeds the corresponding reference probability threshold. Based on the predicted size information and corresponding center point coordinates in the target size prediction results corresponding to the center point coordinates, the game image is cropped to obtain the image block of the game element corresponding to the center point coordinates.

12. The method according to any one of claims 5 to 9, characterized in that, The game image depicts at least two game characters; The method further includes: Based on the element descriptors of each image block, the image blocks representing the same game character from N consecutively acquired game images are aggregated to obtain the image block set corresponding to each game character; N is an integer greater than 1. For each game character, the movement trajectory of the game character is obtained by fitting the trajectory of each image block in the corresponding image block set according to the position points of each image block in the set and in the order of the acquisition time of the source game images.

13. The method according to any one of claims 1 to 9, characterized in that, The step of performing semantic understanding on the image blocks of each game element to obtain text description statements for each image block includes: Image encoding is performed on the image blocks of each game element to obtain the image encoding features of each image block; Text decoding is performed based on the image encoding features of each image block to obtain the text description statement of each image block.

14. A game control device, characterized in that, include: The acquisition module is used to acquire game images captured during the operation of the game application; The cropping module is used to crop image blocks of multiple game elements from the game image; The semantic understanding module is used to perform semantic understanding on the image blocks of each game element to obtain the text description statements of each image block; The operation instruction determination module is used to determine operation instructions that are compatible with the game state represented by the game image based on the text description statements of multiple image blocks in the game image. The simulation operation module is used to simulate operations on the game application according to the operation instructions.

15. An electronic device, characterized in that, include: processor; A memory storing computer instructions that, when executed by the processor, implement the method as described in any one of claims 1-13.

16. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that, when executed by a processor, implement the method described in any one of claims 1-13.

17. A computer program product comprising computer instructions, characterized in that, When executed by a processor, the computer instructions implement the method as described in any one of claims 1-13.