Information acquisition method, image processing method, device, and electronic device

By transmitting and correcting information between subnetworks of a neural network, the problem of inaccurate identification of object attributes and relationships in existing technologies is solved, enabling more efficient information acquisition and virtual reality interaction.

CN113191462BActive Publication Date: 2026-06-16BEIJING SAMSUNG TELECOM R&D CENT +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING SAMSUNG TELECOM R&D CENT
Filing Date
2020-01-13
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify the attributes and relationships of objects when detecting those that are close to or obscured, resulting in low accuracy in information acquisition.

Method used

By employing a method of joint training of multiple sub-networks, information is transferred and corrected between the sub-networks of the neural network to obtain the attribute features and relationship features of objects, thereby achieving more accurate identification of the attribute information of objects and the relationship information between objects in images.

🎯Benefits of technology

It improves the accuracy of identifying object attributes and relationship information, can better handle objects that are similar in appearance or are occluded, and enhances the natural interaction between virtual objects and real-world scenes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN113191462B_ABST
    Figure CN113191462B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide an information acquisition method, an image processing method, an apparatus and an electronic device, and relate to the technical field of image processing. The information acquisition method comprises: acquiring attribute features of objects in an image and relationship features between the objects; correcting the relationship features according to the attribute features, acquiring relationship information between the objects according to the corrected relationship features, and / or correcting the attribute features according to the relationship features, acquiring attribute information of the objects according to the corrected attribute features. The information acquisition method provided by the embodiments of the present application can improve the accuracy of information acquisition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and more specifically, to an information acquisition method, an image processing method, an apparatus, and an electronic device. Background Technology

[0002] Object detection is a technique in computer vision used to identify specific categories of objects from images or videos. In recent years, researchers have begun to explore using overall information from images to perform 3D object detection, that is, detecting the attribute information of objects, such as their location, category, and the relationships between different objects.

[0003] In existing technologies, multiple different neural networks are typically used to identify images separately, obtaining the attributes of objects and the relationships between different objects. However, in some cases, such as when different categories of objects with similar appearances appear in an image, or when objects are occluded, it is difficult to identify the attributes of objects and the relationships between different objects by relying solely on different neural networks, resulting in low accuracy in information acquisition. Summary of the Invention

[0004] This application provides an information acquisition method and an image to address how to more accurately output response information when using intelligent chatbots to interact with users. The technical solution is as follows:

[0005] Firstly, an information acquisition method is provided, the method comprising:

[0006] To obtain the attribute features of objects in an image and the relationship features between objects;

[0007] Correct the relation features based on the attribute features, obtain the relation information between objects based on the corrected relation features, and / or correct the attribute features based on the relation features, obtain the attribute information of the objects based on the corrected attribute features.

[0008] Secondly, an image processing method is provided, the method comprising:

[0009] To obtain attribute and relationship information of objects in an image;

[0010] Add virtual objects to the image based on attribute and relationship information.

[0011] Thirdly, an information acquisition device is provided, the device comprising:

[0012] The first acquisition module is used to acquire the attribute features of objects in the image and the relationship features between objects;

[0013] The correction module is used to correct relation features based on attribute features, obtain relation information between objects based on the corrected relation features, and / or correct attribute features based on relation features, obtain attribute information of objects based on the corrected attribute features.

[0014] Fourthly, an image processing apparatus is provided, the apparatus comprising:

[0015] The second acquisition module is used to acquire attribute information and relationship information of objects in the image;

[0016] Add a module to add virtual objects to an image based on attribute and relationship information.

[0017] Fifthly, an electronic device is provided, comprising:

[0018] One or more processors;

[0019] Memory;

[0020] One or more applications, wherein the applications are stored in memory and configured to be executed by one or more processors, the applications being configured to: perform operations corresponding to the information acquisition method shown in the first aspect.

[0021] Sixthly, an electronic device is provided, the electronic device comprising:

[0022] One or more processors;

[0023] Memory;

[0024] One or more applications, wherein the applications are stored in memory and configured to be executed by one or more processors, the applications being configured to: perform operations corresponding to the image processing method shown in the second aspect.

[0025] In a seventh aspect, a computer-readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, at least one program, code set, or instruction set is loaded and executed by a processor to implement the information acquisition method as described in the first aspect.

[0026] In an eighth aspect, a computer-readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, at least one program, code set, or instruction set is loaded and executed by a processor to implement the image processing method as shown in the second aspect.

[0027] The beneficial effects of the technical solution provided in this application are:

[0028] This application provides an information acquisition method, an image processing method, an apparatus, and an electronic device. Compared with the prior art, the information acquisition method of this application uses multiple sub-networks of a neural network to detect feature regions and acquire the attribute features of objects in the image and the relationship features between objects. The multiple sub-networks are interconnected and transmit information to each other during the detection process. That is, the relationship features are corrected according to the attribute features, and the relationship information between objects is obtained according to the corrected relationship features, and / or the attribute features are corrected according to the relationship features, and the attribute information of objects is obtained according to the corrected attribute features, thereby more accurately identifying the attribute information of objects and the relationship information between objects in the target image.

[0029] The image processing method of this application inputs attribute information and / or relationship information into a rendering prediction network to obtain virtual position information, virtual pose information and virtual action information of virtual objects that can be rendered into the image. It can estimate the possible position, pose and action of virtual objects according to the category, pose and relationship of real objects in the image, thereby realizing natural interaction between virtual and reality. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments of this application will be briefly introduced below.

[0031] Figure 1 This is a schematic diagram of a prior art method for estimating the three-dimensional pose of an object.

[0032] Figure 2a This is a schematic diagram of the object to be identified in an example;

[0033] Figure 2b This is a schematic diagram of the overall scene in one example;

[0034] Figure 3a A schematic diagram of the object to be identified in an example;

[0035] Figure 3b This is a schematic diagram of the overall scene in one example;

[0036] Figure 4a This is a schematic diagram of an existing image recognition scheme.

[0037] Figure 4b This is a schematic diagram of an existing image recognition scheme.

[0038] Figure 4c This is a schematic diagram of an existing image recognition scheme.

[0039] Figure 4d This is a schematic diagram of an existing image recognition scheme.

[0040] Figure 5 This is a schematic flowchart of an information acquisition method provided in an embodiment of this application;

[0041] Figure 6 A schematic diagram illustrating a sub-network mutual correction scheme provided in an embodiment of this application;

[0042] Figure 7 A schematic diagram illustrating a sub-network mutual correction scheme provided in an embodiment of this application;

[0043] Figure 8 This is a flowchart illustrating an information acquisition method in one example of this application;

[0044] Figure 9 This is a schematic diagram illustrating information retrieval in one example of this application;

[0045] Figure 10 This is a schematic diagram illustrating information retrieval in one example of this application;

[0046] Figure 11 This example displays a scene illustration and an illustration of the virtual character to be added.

[0047] Figure 12 This is a schematic diagram illustrating the rendering of virtual characters in existing technologies.

[0048] Figure 13 This is a schematic diagram illustrating the rendering of a virtual character in one example of this application;

[0049] Figure 14 This is a schematic flowchart of an image processing method provided in an embodiment of this application;

[0050] Figure 15 This is a schematic diagram of an image processing method in one example of this application;

[0051] Figure 16 This is a schematic diagram illustrating the image recognition and rendering process in one example of this application;

[0052] Figure 17 This is a schematic diagram of an image processing method in one example of this application;

[0053] Figure 18 This is a schematic diagram of an image processing method in one example of this application;

[0054] Figure 19 This is a schematic diagram of an image processing method in one example of this application;

[0055] Figure 20 This is a schematic diagram of an information acquisition device provided in an embodiment of this application;

[0056] Figure 21 This is a schematic diagram of an image processing device structure provided in an embodiment of this application;

[0057] Figure 22 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0058] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting the invention.

[0059] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this application means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.

[0060] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0061] Object detection is a technique in computer vision used to detect objects of a specific category from images or videos. Specifically, for an input image, it can provide the bounding box for each object and its corresponding category label, finding wide application in computer vision. In the past, object detection was typically performed on two-dimensional color images (2D RGB), resulting in bounding boxes that were also two-dimensional rectangles. However, in three-dimensional scenes containing depth data (RGB-D), simply obtaining the 2D bounding box is insufficient; it is also necessary to accurately estimate the object's 3D pose to obtain a rectangular bounding box.

[0062] like Figure 1As shown, when performing object detection and 3D pose estimation, the original input image is subjected to multiple convolution operations to obtain features at different scales. Then, the features at each scale are spliced ​​and fused together, and object classification and bounding box regression are performed on this basis.

[0063] Existing methods typically only utilize local information in an image to estimate the 3D pose of an object. Multiple different detection networks are usually used to detect the image separately. However, in some cases, such as objects of different categories that look similar or occluded objects, using only one network to detect a local part of the image is not enough to accurately determine the object's category and pose.

[0064] like Figure 2a As shown, simply from Figure 2a It's difficult to distinguish. Figure 2a Is what is shown a painting or a television? Figure 2b As shown, if the entire image is considered holistically, and the relationships between objects within the image are analyzed, more accurate results can be obtained. For example... Figure 2b Based on the fact that the object is hanging on the wall and located behind the sofa, it is easy to deduce that it is a picture frame rather than a television.

[0065] Similarly, such as Figure 3a As shown, it is difficult to judge. Figure 3a The categories and orientations of the objects shown are combined with Figure 3b As shown, based on the relationship between the bed, the left side, and the area under the lamp, it can be determined that the object is a bedside table.

[0066] In the past two years, some researchers in academia have begun to try to use the overall information of images to improve the performance of 3D object detection.

[0067] Currently, there are several methods for 3D image detection:

[0068] 1) Combining Figure 4a and Figure 4b As shown, for the input image, preliminary object detection and pose estimation are first performed. Then, the scene is remodeled using a pre-prepared CAD model. For both the input image and the modeled scene, the surface normal, depth map, and object mask are extracted and compared pairwise to correct the object pose estimation results. After correction, rendering and comparison are performed again, iteratively improving the accuracy of object pose estimation.

[0069] While this method leverages information from the entire scene to improve object pose estimation performance to some extent, it still has several shortcomings: First, the initial object detection and pose estimation, as well as the surface normal directions, depth maps, and object masks used for comparison, are all extracted separately from the input image. There is no information transfer between these extraction processes. Second, CAD models of the objects are required to render a scene map based on the object category and pose for comparison. However, in practice, accurate models of various objects in the scene are unavailable. Using coarse approximations inevitably leads to significant differences in features between the rendered and input images, even if the object pose and category are correctly estimated.

[0070] 2) such as Figure 4c As shown, a holistic 3D indoor scene understanding approach can also be adopted. Holistic 3D indoor scene understanding refers to understanding indoor scenes in a 3D environment by combining object bounding boxes, room layout, and cameras. Figure 4c The model in this paper can simultaneously solve all three tasks—2D detection, global 3D inference, and 2D projection—given only a single RGB image. Essentially, this method improves prediction by parameterizing the target rather than directly estimating it. Compared to collaborative training that trains different modules separately, it employs three collaborative losses—3D bounding boxes, 2D projection, and physical constraints—to estimate geometrically consistent and physically plausible 3D scenes.

[0071] 3) such as Figure 4d As shown, a dense fusion model is proposed to estimate the 6D pose of a set of known objects from color depth images by making full use of two complementary data sources. This model is a general framework for estimating the 6D pose of a set of known objects from color depth images. The model is a heterogeneous architecture that processes the two data sources separately and uses a novel dense fusion network to extract pixel-level dense feature embeddings from which the pose is estimated.

[0072] To estimate the 3D pose of an object, existing techniques typically employ multiple sub-networks to detect images separately. These sub-networks are used to perform three tasks: object category, object pose, and object relationship recognition. Each sub-network contains multiple convolutional / fully connected layers. Existing methods train these three networks as independent tasks, or only address one or two of these tasks. During independent training, features extracted by different networks cannot be transferred between them; therefore, it is impossible to update the current network's features using information from other networks.

[0073] Therefore, this invention proposes a method to jointly train three related tasks—object category, pose, and inter-object relationship recognition—to improve system performance. Specifically, a gated message passing system is added after each of the three branches to achieve feature refinement among the three networks, and the refined features are used for final recognition.

[0074] To address at least one of the aforementioned technical problems or areas requiring improvement in the prior art, embodiments of this application provide an image detection method, apparatus, electronic device, and computer-readable storage medium. The image detection method of this application can employ multiple sub-networks of a neural network to detect feature regions, obtaining attribute features of objects in the image and relationship features between objects. The multiple sub-networks are interconnected and exchange information during the detection process, that is, the relationship features are corrected based on the attribute features, and the relationship information between objects is obtained based on the corrected relationship features, and / or the attribute features are corrected based on the relationship features, and the attribute information of objects is obtained based on the corrected attribute features, thereby more accurately identifying the attribute information of objects and the relationship information between objects in the target image.

[0075] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0076] This application provides one possible implementation method, such as... Figure 5 As shown, an information acquisition method is provided, which may include the following steps:

[0077] Step S501: Obtain the attribute features of objects in the image and the relationship features between objects.

[0078] The image can be a color image or a color image containing depth information; the feature region of the image can be a region formed by related object pairs in the image.

[0079] In the specific implementation process, images can be acquired through image acquisition devices such as AR (Augmented Reality) devices.

[0080] Specifically, existing basic networks can be used to perform feature extraction and object detection on images. After obtaining candidate bounding boxes, the object detection module combines objects in pairs, filters out related object pairs, forms feature regions, and thus obtains the image.

[0081] Specifically, obtaining the attribute features of objects in an image and the relationship features between objects can include:

[0082] Images are input into a neural network for recognition. During the recognition process, the intermediate layers of each subnetwork of the neural network output attribute features and relational features.

[0083] Step S502: Correct the relation features according to the attribute features, obtain the relation information between objects according to the corrected relation features, and / or correct the attribute features according to the relation features, obtain the attribute information of the objects according to the corrected attribute features.

[0084] The attribute information includes the object's category information and the object's pose information; the neural network includes multiple sub-networks, including a category recognition network for recognizing category information, a pose recognition network for recognizing pose information, and a relationship recognition network for recognizing relationship information.

[0085] One possible implementation of this application embodiment is that the attribute information includes the object's category information and the object's posture information.

[0086] Specifically, the object's category information indicates what kind of object it is, the object's pose information can be the rotation angle of the object in the target image, and the relationship between objects can include the object's action information, as well as the connection between two objects. For example, "reading" in "a person reading" refers to the relationship, and "a painting hanging on the wall" can also refer to the relationship of "hanging on...".

[0087] It is understandable that while identifying the category of an object, its location can be directly identified, meaning the object's location information can be obtained directly.

[0088] like Figure 6 As shown, during the recognition process, multiple sub-networks can exchange information to correct the attribute features of objects and the relationship features between objects, thereby obtaining the attribute information and relationship information of the objects.

[0089] In the above embodiments, multiple sub-networks of a neural network are used to detect feature regions, thereby obtaining the attribute features of objects in the image and the relationship features between objects. The multiple sub-networks work together and exchange information during the detection process. That is, the relationship features are corrected according to the attribute features, and the relationship information between objects is obtained according to the corrected relationship features, and / or the attribute features are corrected according to the relationship features, and the attribute information of objects is obtained according to the corrected attribute features, thereby more accurately identifying the attribute information of objects and the relationship information between objects in the target image.

[0090] In one possible implementation of this application embodiment, before step S501, it may further include:

[0091] (1) Obtain the initial image, and extract features from the initial image based on the feature extraction network to obtain the shared feature map;

[0092] (2) Identify objects in the shared feature map based on the recognition network;

[0093] (3) Select related object pairs from the identified objects, form feature regions based on the related object pairs, and use the feature regions as images.

[0094] Specifically, the feature extraction network can use the VGG16 network, which is a convolutional programming network containing 16 convolutional layers and fully connected layers. The VGG network simplifies the neural network structure. The recognition network can use the Faster R-CNN network, which is a neural network that includes convolutional layers, RPN (Region Proposal Network), RoI pooling, and classification and regression networks.

[0095] In the specific implementation process, other networks can also be used for feature extraction of the initial image and recognition of the shared feature map, and no restrictions are imposed here.

[0096] Specifically, after identifying objects in the shared feature map, the objects are paired up to filter out related object pairs and form candidate regions.

[0097] One possible implementation of this application embodiment involves inputting an image into a neural network for recognition. During the recognition process, the intermediate layer structures of each subnetwork of the neural network output attribute features and relationship features, which may include:

[0098] a. Obtain the object 3 features in the image, input the object features into the category recognition network, and obtain the category features output by the intermediate layer structure of the category recognition network;

[0099] b. Obtain the pose features in the image, input the pose features into the pose recognition network, and obtain the pose features output by the intermediate layer structure of the pose recognition network.

[0100] c. Obtain the scene features of the image, input the scene features into the relationship recognition network, and obtain the relationship features output by the intermediate layer structure of the relationship recognition network.

[0101] One possible implementation of the embodiments of this application is as follows: Figure 7As shown, step S502, which involves correcting the relationship features based on attribute features and obtaining the relationship information between objects based on the corrected relationship features, and / or correcting the attribute features based on the relationship features and obtaining the attribute information of the objects based on the corrected attribute features, may include:

[0102] (1) Correct the relation features based on the attribute features to obtain the corrected relation features, and / or correct the attribute features based on the relation features to obtain the corrected attribute features.

[0103] Among them, attribute features include category features obtained in the process of identifying the category information of objects in an image, and pose features obtained in the process of identifying the pose information of objects in an image; relation features are features obtained in the process of identifying the relationship information between objects in an image.

[0104] In the actual implementation process, when recognizing images, the relation features are corrected based on the attribute features, and / or the attribute features are corrected based on the relation features, but the parameters of each sub-network are not changed.

[0105] (2) Input the corrected relation features and / or corrected attribute features into the next layer of the intermediate layer of each sub-network to continue identification, and obtain attribute information and relation information.

[0106] Specifically, the corrected first attribute information is input into the next layer of the intermediate layer of the category recognition network for further recognition, obtaining the object's category information; the corrected second attribute information is input into the next layer of the intermediate layer of the pose recognition network for further recognition, obtaining the object's pose information; and the corrected relation features are input into the next layer of the intermediate layer of the relation recognition network for further recognition, obtaining the relation information between objects. In the specific implementation, the category recognition network, pose recognition network, and relation recognition network can all use CNN (Convolutional Neural Networks), Faster R-CNN, and YOLO (You Only Look Once: Unified, Real-Time Object Detection), etc., and the specific network type used is not limited here.

[0107] Taking the correction of the output information of a certain layer of a class recognition network as an example, the class features output by a certain layer of a class recognition network can be combined with the partial output of the pose features of a certain layer of a pose recognition network and the partial output of the relation features of a certain layer of a relation recognition network to obtain the corrected class features of a certain layer of the class recognition network. Then, the corrected class features are input into the next layer of the class recognition network to obtain the final class information of the object.

[0108] One possible implementation of this application embodiment, which corrects relation features based on attribute features to obtain corrected relation features, and / or corrects attribute features based on relation features to obtain corrected attribute features, may include:

[0109] a. Obtain the corrected category features based on the category features, pose features, relation features, and a preset first weight coefficient array in the attribute features;

[0110] b. Obtain the corrected pose features based on category features, pose features, relation features, and a preset second weight coefficient array;

[0111] c. Obtain the corrected relation features based on category features, pose features, relation features, and a preset third weight coefficient array.

[0112] The first set of weight coefficients may include the weight coefficients of the category features, the weight coefficients of the posture features, and the weight coefficients of the relation features during the process of correcting the category features; similarly, the second set of weight coefficients may include the weight coefficients of the category features, the weight coefficients of the posture features, and the weight coefficients of the relation features during the process of correcting the posture features; and the third set of weight coefficients may include the weight coefficients of the category features, the weight coefficients of the posture features, and the weight coefficients of the relation features during the process of correcting the relation features.

[0113] Taking the correction of categorical features as an example, the correction can be performed in the following way:

[0114] The first weighting coefficient array is [a 11 ,a 12 ,a 13 The corrected category features are:

[0115] (1)

[0116] In the formula: Used to represent category features; Used for pose features; Used for pose features; Used to represent corrected category features.

[0117] Similarly, the corrected attitude features can be calculated based on the second weight coefficient array, and the corrected relation features can be calculated based on the third weight coefficient array.

[0118] The values ​​of the first, second, and third weight coefficient arrays can be set according to the importance of the category features, pose features, and relationship features.

[0119] In the above embodiments, the category features, posture features, and relationship features are mutually corrected to obtain corrected category features corresponding to the category features, corrected posture features corresponding to the posture features, and corrected relationship features corresponding to the relationship features. This is to facilitate the correction of category information, posture information, and relationship information by combining posture information and relationship information, thereby improving the accuracy of the category information, posture information, and relationship information between objects.

[0120] One possible implementation of this application embodiment, before identifying the feature region based on multiple sub-networks of the recognition neural network, may further include:

[0121] The initial recognition neural network is trained based on multiple sample images to obtain the recognition neural network; each sample image has corresponding object attribute information and relationship information between objects.

[0122] Specifically, multiple sample images with pre-set object attribute information and inter-object relationship information are input into an initial recognition neural network. The initial recognition neural network includes three interconnected sub-networks that perform mutual information correction during the recognition process. Based on the real object attribute information and real inter-object relationship information set in the sample images, as well as the attribute information and relationship information obtained from the initial recognition neural network output, the parameters of the initial recognition neural network are adjusted. For example, the loss value between the real attribute information, the real relationship information, and the attribute information and relationship information obtained from the initial recognition neural network can be calculated. The parameters of the initial recognition neural network are adjusted based on the loss value until the calculated loss value is less than a preset threshold, thus obtaining the trained recognition neural network.

[0123] It should be noted that during the training process, the recognition information between the three sub-networks of the initial neural network is mutually corrected, and the parameters between the three sub-networks of the initial neural network are constantly adjusted; however, when the trained neural network recognizes images, the three sub-networks transfer information during the recognition process, that is, the recognition information is mutually corrected, but the network parameters of the three sub-networks do not change.

[0124] Alternatively, the initial recognition neural network can be trained a preset number of times to obtain the trained recognition neural network. The specific training method for the initial recognition neural network is not limited here.

[0125] The aforementioned information acquisition method uses multiple sub-networks of a neural network to detect feature regions, thereby acquiring the attribute features of objects in the image and the relationship features between objects. The multiple sub-networks work together and exchange information during the detection process, that is, they correct the relationship features based on the attribute features, and obtain the relationship information between objects based on the corrected relationship features, and / or correct the attribute features based on the relationship features, and obtain the attribute information of objects based on the corrected attribute features, thereby more accurately identifying the attribute information of objects and the relationship information between objects in the target image.

[0126] Furthermore, the category features, posture features, and relationship features are mutually corrected to obtain corrected category features corresponding to the category features, corrected posture features corresponding to the posture features, and corrected relationship features corresponding to the relationship features. This facilitates the correction of category information, posture information, and relationship information by combining posture information and relationship information, thereby improving the accuracy of the category information, posture information, and relationship information between objects.

[0127] To better understand the above information acquisition method, the following details an example of information acquisition according to the present invention:

[0128] In one example, such as Figure 8 As shown, the information acquisition method provided in this application may include the following steps:

[0129] 1) Obtain the image to be identified, extract image features based on the VGG16 network, and obtain shared features;

[0130] 2) The Faster R-CNN network is used to perform object recognition on shared features to obtain feature regions;

[0131] 3) Based on the feature region, extract the candidate object region, the region surrounding the candidate object, and the region containing the associated object pair; where the candidate object region represents the region where the object is located in the feature region; the region surrounding the candidate object represents the region surrounding the object in the feature region; and the region containing the associated object pair represents the region containing the associated object pair in the feature region.

[0132] 4) Input the selected object region, the region surrounding the candidate object, and the region where the associated object pairs are located into the category recognition network to detect object features, the pose recognition network to detect the pose features of the object, and the relationship recognition network to detect the scene graph features of the object;

[0133] 5) During the detection process, the category recognition network, pose recognition network, and relationship recognition network perform feature corrections on each other, i.e., information correction;

[0134] 6) Detect the attribute information of the network output objects, i.e., the object category, such as a person, a hat, or a kite; output posture information, and output the relationship information between objects, i.e., output scene graph, such as a person wearing a hat, a person flying a kite, and a person standing on grass, etc.

[0135] In the example above, the category recognition network, pose recognition network, and relationship recognition network are interconnected and perform mutual information correction during the detection process. This means that attribute information and relationship information between objects can be mutually corrected, thereby more accurately identifying the attribute information of objects and the relationship information between objects in the target image.

[0136] The image detection method presented in this application plays a crucial role in 3D scene understanding, encompassing object detection, pose estimation, and inter-object relationship recognition. The joint estimation module in this application provides high-accuracy results, fully utilizing the entire scene and inter-object relationships compared to existing independent training methods. Its detection results can be applied not only to augmented reality systems but also to smart homes, autonomous driving, and security applications.

[0137] Furthermore, the joint estimation module can also provide necessary information as input for other applications. For example, in smart homes, it can utilize the object relationships identified in this application, such as... Figure 9 As shown, it can recognize events such as "person-fall-to the ground" and issue an alarm to remind the user.

[0138] When objects are occluded, the system can better identify the category and pose of the occluded object by utilizing information from surrounding objects. For example... Figure 10 As shown, Figure 10 Chair 2 on the right is largely obscured by the table in front and chair 1 on the left, making it difficult for existing methods to identify its object category and three-dimensional pose. Using the joint training module in this invention, its three-dimensional pose and object category can be identified more accurately.

[0139] This application also proposes a virtual object prediction module, which predicts the possible position and posture of virtual objects in the scene, as well as their relationship with surrounding objects, based on the posture relationships of real objects in the scene, so that virtual objects can have realistic and natural interactions with the surrounding environment.

[0140] To make the objectives, technical solutions, and advantages of this application clearer, the prior art for image processing will be briefly described below with reference to the accompanying drawings.

[0141] When an augmented reality system adds a virtual character to a chair in a real-world scene next to a bookshelf, the character might sit in the chair and read a book to interact realistically. If the chair faces a table with a laptop, the virtual character can sit in the chair and use the computer on the table. However, if the chair has its back to the table but faces a television, the virtual character might sit in the chair and watch TV. In general, this system estimates the possible positions, postures, and actions of virtual objects based on the categories, postures, and relationships of real-world objects in the scene, thus achieving natural interaction between the virtual and real worlds.

[0142] When AR systems need to, for example Figure 11 When adding a virtual character or object as shown on the right to the scene shown on the left, such as adding a virtual character to the sofa in the image below, the virtual character is usually rendered directly on the sofa without adjusting its pose according to the scene. The final generated image looks like... Figure 12 As shown, the virtual character stands directly on the sofa, blending unnaturally with the surrounding scene.

[0143] However, using the image processing method of this application, it is possible to combine the surrounding scene to render a person sitting on a sofa reading a book, such as... Figure 13 As shown, the display effect will be more realistic and natural.

[0144] The image processing method of this application will be described in detail below with reference to the embodiments and accompanying drawings.

[0145] This application provides one possible implementation method, such as... Figure 14 As shown, an image processing method is provided, which may include the following steps:

[0146] Step S1401: Obtain the attribute information and relationship information of the objects in the image;

[0147] Step S1402: Add virtual objects to the image based on attribute information and relationship information.

[0148] Specifically, the information acquisition method described in the above embodiments can be used to obtain the attribute information and / or relationship information of objects in the image.

[0149] Specifically, adding virtual objects to an image based on attribute and relationship information can include:

[0150] (1) Based on the attribute information and relationship information, obtain the virtual position information, virtual posture information and virtual action information of the virtual object;

[0151] In the specific implementation process, attribute information and relationship information can be input into the rendering prediction network to obtain virtual position information, virtual pose information and virtual action information of virtual objects that can be rendered into the image.

[0152] Among them, virtual position information is used to represent the position where the virtual object can be rendered, virtual pose information is used to represent the rotation angle of the virtual object, and virtual motion information is used to represent the motion of the virtual object.

[0153] Specifically, virtual objects can include virtual characters or virtual objects.

[0154] Specifically, when virtual objects are rendered to an image with the predicted virtual position, pose, and motion information, a realistic and natural scene can be obtained.

[0155] In the specific implementation process, the rendering prediction network includes three sub-networks, which are used to predict the position information of the object, the pose information of the virtual object, and the action information of the virtual object, respectively.

[0156] Of the three sub-networks: the position regression network uses object features as input and predicts the appropriate position of the virtual object through multiple convolutional, pooling, and fully connected layers; the pose prediction network is a regression network used to estimate the 3D pose of the virtual object in the scene; and the action candidate network predicts the relationship between the virtual object and its surrounding objects, and its output is a scene graph containing both the virtual object and the real object.

[0157] (2) Add virtual objects to the image based on virtual position information, virtual posture information and virtual action information.

[0158] Specifically, the virtual position information output by the rendering prediction network can include at least one position, the virtual pose information can include different poses of the virtual object at various positions, and the virtual action information can include at least one action of the virtual object. When multiple poses, positions, and actions are predicted, the user can select one pose, position, and action from the predicted multiple poses, positions, and actions, and render the virtual object according to the selected pose, position, and action.

[0159] In the specific implementation process, based on virtual position information, virtual posture information, and virtual action information, the corresponding virtual objects are rendered in the image. This can be done by rendering the corresponding virtual objects in the image in a form that conforms to the virtual position information, virtual posture information, and virtual action information.

[0160] In the above embodiments, by inputting attribute information and relationship information into the rendering prediction network, virtual position information, virtual pose information and virtual action information of virtual objects that can be rendered into the image are obtained. The possible position, pose and action of the virtual object can be estimated according to the category, pose and relationship of the real object in the image, thereby realizing natural interaction between virtual and reality.

[0161] This application provides a possible implementation method in which the three sub-networks of the rendering prediction network can also be combined with each other. In the process of predicting the position information of the object, the pose information of the virtual object, and the action information of the virtual object respectively, they can also perform information correction with each other, so that the interaction between the input virtual object and reality is more natural.

[0162] like Figure 15 As shown, in one example, a color image containing a depth image is acquired, and the color image is input into a joint estimation module for estimation, that is, the information acquisition method of the above embodiment is used for identification to obtain the category information, pose information and relationship information of the object in the color image, that is, the category, three-dimensional pose and scene graph shown in the figure. The category information, pose information and relationship information of the object are input into a rendering prediction network, that is, the virtual object prediction module in the figure, for prediction to obtain the virtual position information, virtual pose information and virtual action information of the virtual object (i.e. the virtual object in the figure) that can be rendered into the image.

[0163] by Figure 16 For example, in the scene shown in the upper left image, by using the joint estimation module of this application, i.e., the information acquisition method of this application, it is possible to obtain the category, 3D pose, and relationships between objects in the image. When the AR system needs to add a virtual object (the yellow Pokemon) to the scene, the virtual object prediction module, i.e., using the image processing method of this application, will use the previous recognition results to predict the possible location, pose, and relationship with surrounding real objects of the virtual object (upper right image). This prediction result will be sent to the CG engine, and the virtual object will be rendered relatively naturally in the real scene (lower image).

[0164] This application provides a possible implementation method, and the image processing method may further include:

[0165] (1) For each of the multiple sample images, obtain the scene parts other than the preset object in the sample image;

[0166] (2) The initial rendering prediction network is trained by taking the attribute information and relationship information of the objects in the scene as input and the position information, pose information and action information of the preset objects as output, and the rendering prediction network is obtained.

[0167] Specifically, for each sample image, the preset object in the sample image is separated from other scene parts, and then the attribute information and relationship information of other scene parts are obtained, as well as the position information, pose information and action information of the preset object, and the initial rendering prediction network is trained.

[0168] For example, if a sample image includes a person sitting in a chair, the person and the chair can be separated to obtain the chair's attribute information and the relationship between the chair and the ground. The person's position, posture, and action information can also be obtained. The initial rendering prediction network is trained using the person's actual position, posture, and action information as output and the chair's attribute information and the relationship between the chair and the ground as input, thus obtaining the rendering prediction network.

[0169] To achieve this functionality, we first designed a method for generating a virtual object database. Specifically, we first select data containing people from the existing dataset, i.e., extract data of preset objects. Next, we use the joint estimation module mentioned earlier, i.e., the information acquisition method described above, to extract object category, pose, and relationship information (i.e., the object attribute information and relationship information mentioned above) from the real data. Finally, we separate the human-related information from other information and use them as the target (human position, pose, and relationship) and input (position, pose, and relationship of other objects) of the initial rendering prediction network output, respectively, to generate a new dataset.

[0170] In its implementation, the virtual object prediction network, given an input image, first uses a combination of modules at various levels to extract the categories, poses, and scene graph of real objects in the scene. (The scene graph can be understood as an N×N matrix, where N is the number of identified objects. Each row and column of the matrix corresponds to an object, and each element corresponds to a relationship between objects.) These features are then used as input to the virtual object prediction module. This module comprises three sub-networks: virtual object location regression, pose prediction, and action candidate. Ultimately, it outputs candidate locations, poses, and actions that allow virtual objects to blend naturally with the scene.

[0171] To better understand the image processing method described above, the following is a detailed example of an image processing method of the present invention:

[0172] like Figure 17 As shown, in one example, the image processing method of this application includes the following steps:

[0173] 1) Virtual reality devices capture images;

[0174] 2) Perform joint estimation on the images, i.e., acquire information to obtain the attribute information and relationship information of the objects in the images;

[0175] 3) When the virtual reality device receives a rendering instruction for a virtual object, that is, when it receives a control command, it inputs the object's attribute information and relationship information into the virtual object prediction network, that is, the rendering prediction network performs prediction, that is, the virtual object prediction shown in the figure, and obtains the predicted virtual position information, virtual posture information and virtual action information of the virtual object.

[0176] 4) The CG engine renders virtual objects in the image based on the predicted virtual position information, virtual pose information, and virtual motion information, i.e., the rendered virtual objects in the image.

[0177] In the example above, the joint estimation module acquires color and depth images (RGB-D) captured by the AR device as input data. It then uses a deep learning network to calculate the attributes (category, pose) of objects in the scene and the relationships between them (corresponding to the object attributes and relationships in the diagram above). When the AR system receives a control command from the user or the system to add virtual objects to the scene, the virtual object prediction module takes the object attributes and relationships obtained from the previous module as input. Through the deep learning network, it outputs the virtual object's position, pose, and interactions with surrounding objects (corresponding to the motion prediction in the diagram above). Finally, the CG engine renders the virtual objects in the scene based on the virtual object prediction results.

[0178] Visual feature extraction includes: object recognition, single object visual feature extraction (corresponding to the current object feature extraction in the above figure), nearby object visual feature extraction (corresponding to the surrounding object feature extraction in the above figure), whole image feature extraction (corresponding to the complete image feature extraction in the above figure), object size and position feature extraction, and inter-object relationship feature extraction, etc.

[0179] To better understand the image processing method described above, the following is a detailed example of an image processing method of the present invention:

[0180] like Figure 18 As shown, in one example, the image processing method of this application includes the following steps:

[0181] 1) Obtain the input image and input it into the CNN network for feature extraction to obtain a shared feature map;

[0182] 2) Input the shared feature map into the RPN network for target recognition to obtain the feature region;

[0183] 3) Input the feature regions into the joint estimation network, i.e. the recognition neural network, to obtain the attribute information of the objects in the input image and the relationship information between the objects, i.e., the object category, the object's 3D pose, and the scene graph;

[0184] 4) Input the object's attribute information and relationship information into the virtual object prediction network, i.e., the rendering prediction network, to make predictions and obtain the predicted virtual position information (i.e., the virtual object position prediction in the figure), virtual pose information, and virtual action information of the virtual object; or specify the virtual position information of the virtual object (i.e., the specified virtual object position in the figure), input the object's attribute information and relationship information into the virtual object prediction network, and obtain the virtual pose information and virtual action information.

[0185] To better understand the image processing method described above, the following is a detailed example of an image processing method of the present invention:

[0186] like Figure 19 As shown, in one example, the image processing method of this application includes the following steps:

[0187] 1) Obtain a color image containing a depth image, i.e., the RGB-D image shown in the figure;

[0188] 2) Obtain candidate regions from RGB-D images;

[0189] 3) Joint estimation of candidate regions is performed, that is, three sub-networks are used to identify category features, pose features and scene feature maps respectively;

[0190] 4) During the recognition process, the three sub-networks work together and perform mutual information correction to obtain correction category features, correction pose features, and correction scene feature maps;

[0191] 5) Based on the correction category features, correction pose features, and correction scene feature map, the object category, 3D pose, and scene map in the color image are identified;

[0192] 6) Predict the position, pose and action of virtual objects based on the scene graph (i.e., predict the position and pose of virtual objects in the graph and predict the action of virtual objects).

[0193] The image processing method described above inputs attribute information and relationship information into a rendering prediction network to obtain virtual position information, virtual pose information, and virtual action information of virtual objects that can be rendered into the image. It can estimate the possible position, pose, and action of virtual objects based on the category, pose, and relationship of real objects in the image, thereby realizing natural interaction between virtual and reality.

[0194] The above embodiments describe the information acquisition method from the perspective of the method flow. The following description describes it from the perspective of a virtual module, as shown below:

[0195] This application provides an information acquisition device 200, such as... Figure 20As shown, the device 200 may include a first acquisition module 201 and a correction module 202, wherein:

[0196] The first acquisition module 201 is used to acquire the attribute features of objects in the image and the relationship features between objects;

[0197] The correction module 202 is used to correct relation features based on attribute features, obtain relation information between objects based on the corrected relation features, and / or correct attribute features based on relation features, obtain attribute information of objects based on the corrected attribute features. The aforementioned information acquisition device uses multiple sub-networks of a neural network to detect feature regions, acquiring attribute features of objects and relation features between objects in the image. These multiple sub-networks collaborate and exchange information during the detection process, i.e., correcting relation features based on attribute features, obtaining relation information between objects based on the corrected relation features, and / or correcting attribute features based on relation features, obtaining attribute information of objects based on the corrected attribute features, thereby more accurately identifying the attribute information of objects and the relation information between objects in the target image.

[0198] In one possible implementation of this application embodiment, when the first acquisition module 201 acquires the attribute features of objects in the image and the relationship features between objects, it is specifically used for:

[0199] Images are input into a neural network for recognition. During the recognition process, the intermediate layers of each subnetwork of the neural network output attribute features and relational features.

[0200] The attribute information includes the object's category information and the object's pose information; the neural network includes multiple sub-networks, including a category recognition network for recognizing category information, a pose recognition network for recognizing pose information, and a relationship recognition network for recognizing relationship information.

[0201] In one possible implementation of this application embodiment, when the correction module 202 corrects the relationship features based on the attribute features and obtains the relationship information between objects based on the corrected relationship features, and / or corrects the attribute features based on the relationship features and obtains the attribute information of the objects based on the corrected attribute features, it is specifically used for:

[0202] Corrected relation features are obtained by correcting relation features based on attribute features, and / or corrected attribute features are obtained by correcting attribute features based on relation features;

[0203] The corrected relation features and / or corrected attribute features are respectively input into the next layer of the intermediate layer structure of each sub-network for further identification, to obtain attribute information and relation information. In one possible implementation of this application embodiment, when the correction module 202 corrects the relation features based on the attribute features to obtain corrected relation features, and / or corrects the attribute features based on the relation features to obtain corrected attribute features, it is specifically used for:

[0204] The corrected category features are obtained based on the category features, pose features, relation features, and a preset first weight coefficient array in the attribute features.

[0205] The corrected pose features are obtained based on category features, pose features, relation features, and a preset second weight coefficient array;

[0206] The corrected relation features are obtained based on category features, pose features, relation features, and a preset third weight coefficient array. This application embodiment provides an image processing apparatus 210, such as... Figure 21 As shown, the device 210 may include a second acquisition module 211 and an addition module 212, wherein:

[0207] The second acquisition module 211 is used to acquire attribute information and relationship information of objects in the image;

[0208] Add module 212 to add virtual objects to the image based on attribute and relationship information.

[0209] The aforementioned image processing device, by inputting attribute information and relationship information into a rendering prediction network, obtains virtual position information, virtual pose information, and virtual action information of virtual objects that can be rendered into the image. It can estimate the possible position, pose, and action of virtual objects based on the category, pose, and relationship of real objects in the image, thereby realizing natural interaction between virtual and reality.

[0210] In one possible implementation of this application embodiment, when adding virtual objects to an image based on attribute information and relationship information, the adding module 212 is specifically used for:

[0211] Based on attribute and relationship information, the virtual position, virtual posture, and virtual action information of the virtual object are obtained;

[0212] Virtual objects are added to the image based on virtual location information, virtual pose information, and virtual motion information.

[0213] The image information acquisition device of this disclosure can execute an image information acquisition method provided in the embodiments of this disclosure. The implementation principle is similar. The actions performed by each module in the image information acquisition device in each embodiment of this disclosure correspond to the steps in the image information acquisition method in each embodiment of this disclosure. For detailed functional descriptions of each module of the image information acquisition device, please refer to the descriptions of the corresponding image information acquisition methods shown above, which will not be repeated here.

[0214] The image processing apparatus of this disclosure can execute an image processing method provided in the embodiments of this disclosure. The implementation principle is similar. The actions performed by each module in the image processing apparatus of this disclosure correspond to the steps in the image processing method of this disclosure. For detailed functional descriptions of each module of the image processing apparatus, please refer to the descriptions of the corresponding image processing methods shown above. They will not be repeated here.

[0215] The information acquisition device and image processing device provided in the embodiments of this application have been described above from the perspective of functional modularity. Next, the electronic device provided in the embodiments of this application will be described from the perspective of hardware physicalization, and the computing system of the electronic device will also be described.

[0216] Based on the same principles as the methods shown in the embodiments of this disclosure, the embodiments of this disclosure also provide an electronic device, which may include, but is not limited to: a processor and a memory; the memory for storing computer operation instructions; and the processor for executing the information acquisition method shown in the embodiments by invoking the computer operation instructions. Compared with the prior art, the information acquisition method in this application can more accurately identify the attribute information of objects in an image and the relationship information between objects.

[0217] Based on the same principles as the methods shown in the embodiments of this disclosure, the embodiments of this disclosure also provide an electronic device, which may include, but is not limited to: a processor and a memory; the memory for storing computer operation instructions; and the processor for executing the image processing method shown in the embodiments by invoking the computer operation instructions. Compared with the prior art, the image processing method in this application can realize natural interaction between virtual and reality.

[0218] In one alternative embodiment, an electronic device is provided, such as Figure 22 As shown, Figure 22The illustrated electronic device 2200 includes a processor 2201 and a memory 2203. The processor 2201 and the memory 2203 are connected, for example, via a bus 2202. Optionally, the electronic device 2200 may also include a transceiver 2204. It should be noted that in practical applications, the transceiver 2204 is not limited to one unit, and the structure of this electronic device 2200 does not constitute a limitation on the embodiments of this application.

[0219] Processor 2201 may be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. Processor 2201 may also be a combination that implements computational functions, such as including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

[0220] Bus 2202 may include a pathway for transmitting information between the aforementioned components. Bus 2202 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. Bus 2202 can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 22 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0221] The memory 2203 may be a ROM (Read Only Memory) or other type of static storage device capable of storing static information and instructions, RAM (Random Access Memory) or other type of dynamic storage device capable of storing information and instructions, or an EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto.

[0222] The memory 2203 is used to store application code that executes the solution of this application, and its execution is controlled by the processor 2201. The processor 2201 is used to execute the application code stored in the memory 2203 to implement the content shown in the foregoing method embodiments.

[0223] Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, laptops, digital radio receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and in-vehicle terminals (such as in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 22 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0224] This application provides a computer-readable storage medium storing a computer program that, when run on a computer, enables the computer to execute the corresponding content described in the aforementioned method embodiments. Compared with the prior art, the information acquisition method in this application can more accurately identify the attribute information of objects in an image and the relationship information between objects; the image processing method in this application can realize natural interaction between virtual and reality.

[0225] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0226] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0227] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0228] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

[0229] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0230] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0231] The modules described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a module does not necessarily limit the module itself; for example, an acquisition module can also be described as an "image acquisition module".

[0232] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

Claims

1. A method performed by an electronic device, the method comprising: include: Acquire images; Obtain the category features, pose features, and relationship features of the objects contained in the image; Based on the category features, posture features, and relation features of the object, each of the category features, posture features, and relation features is corrected separately, or the category features, posture features, and relation features are mutually corrected; Based on the corrected category features, corrected pose features, and corrected relation features, the category information, pose information, and relation information of the object are obtained respectively; Based on attribute information and relationship information, the virtual position information, virtual pose information, and virtual action information of the virtual object to be rendered on the image are determined, wherein the attribute information includes the object's category information and pose information.

2. The method of claim 1, wherein, The correction includes: By applying a preset weight to each of the object's category features, posture features, and relationship features, one of the category features, posture features, and relationship features is corrected.

3. The method of claim 1, wherein, Obtain category features, pose features, and relation features, including: The category features, pose features, and relation features are obtained from the intermediate layers of the sub-networks of the neural network corresponding to the category features, pose features, and relation features, respectively.

4. The method of claim 3, wherein, The sub-networks exchange information with each other, and the category features, pose features, and relationship features are transmitted by different sub-networks.

5. The method of claim 1, wherein, The acquisition of the object's category information, pose information, and relationship information includes: The corrected category features, corrected pose features, and corrected relation features are respectively input into the subsequent network structures of each intermediate layer of the neural network corresponding to the corrected category features, corrected pose features, and corrected relation features, and the category information, pose information, and relation information are obtained from the corresponding sub-networks.

6. The method of claim 1, wherein, The pose information includes information about the rotation angle of the object in the image; and... The relationship information includes the action information of the object and / or the connection between the two objects.

7. The method of claim 1, wherein, Also includes: Based on the virtual location information, virtual posture information, and virtual action information, virtual objects are added to the image.

8. The method of claim 7, wherein, The addition of virtual objects includes: When at least one of the virtual location information, virtual posture information, or virtual action information is multiple sets of information, the virtual object is added to the image based on the information selected by the user from the multiple sets of information.

9. The method of claim 7, wherein, The virtual location information includes information indicating the position of the virtual object rendered in the image; The virtual pose information includes information indicating the rotation angle of the virtual object, and, The virtual action information includes information that instructs the virtual object to perform actions.

10. The method according to any one of claims 1-9, characterized in that, The image is a red-green-blue-depth RGB-D image or a red-green-blue RGB image.

11. The method as described in claim 1, characterized in that, The process of determining the virtual position information, virtual pose information, and virtual action information of the virtual object to be rendered on the image based on attribute information and relationship information includes: Using a rendering prediction network, virtual object information is determined based on attribute and relationship information; The process of determining virtual object information includes: determining the virtual position information, virtual posture information, and virtual action information of the virtual object to be rendered on the image based on category information, posture information, and relationship information; The virtual object information represents the relationship between virtual objects in the image and surrounding objects, so as to render virtual objects on the image and allow them to interact with surrounding objects.

12. The method as described in claim 1, characterized in that, The acquisition of the category features, pose features, and relation features of the object, and the correction of each of the category features, pose features, and relation features, are achieved by providing the neural network with input corresponding to the image.

13. The method as described in claim 1, characterized in that, The acquisition of the category features, pose features, and relation features of the object, the correction of each of the category features, pose features, and relation features, and / or the acquisition of category information, pose information, and relation information are performed using a neural network, which is provided with input corresponding to the image; Among them, the acquisition of category features and the acquisition of pose features are carried out in parallel, as well as the acquisition of relation features, and / or, the correction of category features and the correction of pose features are carried out in parallel, as well as the correction of relation features, and / or, the acquisition of category information and the acquisition of pose information are carried out in parallel, as well as the acquisition of relation information.

14. The method as described in claim 1, characterized in that, Also includes: Based on the determined virtual position information, virtual posture information, and virtual action information, the rendering of virtual objects is performed on the image.

15. The method as described in claim 12, characterized in that, The virtual object information includes the available location of the virtual object in the image, as well as the action of the virtual object.

16. An electronic device, characterized in that, It includes: One or more processors; Memory; One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications being configured to: perform the method according to any one of claims 1 to 15.

17. A computer-readable storage medium, characterized in that, The storage medium stores at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the method as described in any one of claims 1 to 15.