Object recognition method, apparatus and electronic device

By introducing cue words into a large language model, and combining a multimodal large language model with visual positioning technology, interference elements in screen images are masked, and object coordinate boxes are accurately identified and displayed. This solves the problem of low target object positioning accuracy in screen scenes and achieves accurate and comprehensive object recognition.

CN122244423APending Publication Date: 2026-06-19VIVO MOBILE COMM CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
VIVO MOBILE COMM CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to distinguish target objects from interfering elements in screen scenarios, resulting in low positioning accuracy and an inability to accurately identify target objects in images. This is especially true when there are interfering elements such as icons, text, and advertisements, which can easily lead to misjudgment or occlusion, causing recognition failure.

Method used

By introducing cue words into the large language model, the model is guided to identify the coordinates of objects in the screen image, and interference elements are blocked. By using a multimodal large language model combined with visual positioning technology, the coordinates of objects are accurately determined and the coordinate box is displayed, thus overcoming the influence of interference and occlusion.

Benefits of technology

It achieves accurate recognition and comprehensive positioning of objects in screen images, improves positioning accuracy, meets users' needs for accurate and comprehensive positioning of multiple objects, reduces user operation steps, and supports natural language question answering and product navigation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244423A_ABST
    Figure CN122244423A_ABST
Patent Text Reader

Abstract

This application discloses an object recognition method, apparatus, and electronic device, belonging to the field of artificial intelligence technology. The method may include receiving a first input from a user while displaying a first image; responding to the first input, inputting the first image and a first prompt word into a large language model to obtain first positioning information output by the large language model; the first positioning information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; and displaying a coordinate box corresponding to the first positioning information in the first image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of artificial intelligence technology, specifically relating to an object recognition method, device, and electronic device. Background Technology

[0002] With the development of multimodal large language models and computer vision technology, visual positioning technology has become the support for realizing cross-modal interaction between images and natural language. This technology, through image-text feature alignment, similarity calculation and other methods, combined with deep neural network models, can predict the position of a target object in an image based on text description.

[0003] However, screen scenes often contain numerous irrelevant and distracting elements such as icons, text, advertisements, and floating buttons. When locating target objects in images based on text descriptions, the aforementioned methods cannot effectively distinguish between target objects and distracting elements. During the recognition process, distracting elements are easily misidentified as target objects, or the target object may be unrecognizable due to occlusion by distracting elements, reducing positioning accuracy and thus failing to accurately identify target objects in images. Summary of the Invention

[0004] The purpose of this application is to provide an object recognition method, apparatus, electronic device, storage medium, chip, and computer program product that can improve the accuracy of recognizing target objects in images.

[0005] In a first aspect, embodiments of this application provide an object identification method, including: While displaying the first image, receive the user's first input; In response to the first input, the first image and the first prompt word are input into the large language model to obtain the first localization information output by the large language model; the first localization information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; The first image displays a coordinate frame corresponding to the first positioning information.

[0006] Secondly, embodiments of this application provide an object recognition device, including: A receiving module is used to receive the user's first input when the first image is displayed; The processing module is used to respond to the first input by inputting the first image and the first prompt word into the large language model to obtain the first localization information output by the large language model; the first localization information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; The display module is used to display a coordinate frame corresponding to the first positioning information in the first image.

[0007] Thirdly, embodiments of this application provide an electronic device including a processor, a memory, and a program or instructions stored in the memory and executable on the processor. When the program or instructions are executed by the processor, they implement the steps of the object recognition method as described in the first aspect.

[0008] Fourthly, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the object identification method as described in the first aspect.

[0009] Fifthly, embodiments of this application provide a chip, which includes a processor and a display interface, the display interface and the processor being coupled together, the processor being used to run programs or instructions to implement the steps of the object recognition method as shown in the first aspect.

[0010] In a sixth aspect, embodiments of this application provide a computer program product stored in a storage medium, which is executed by at least one processor to implement the steps of the object recognition method as described in the first aspect.

[0011] In this embodiment, when displaying a first image, a first input from the user is received. In response to the first input, the first image and a first prompt are input into a large language model to obtain first location information output by the large language model. The first location information consists of the coordinates of each object in the first image, and the first prompt is used to guide the large language model to determine the coordinates of each object in the first image. A coordinate frame corresponding to the first location information is displayed in the first image. Thus, by leveraging the constraint of the first prompt, the large language model is guided to comprehensively identify all objects in the first image. This clarifies that the model's location task is to obtain the coordinates of all objects, effectively distinguishing each object in the first image from irrelevant interfering elements such as icons, text, advertisements, and floating buttons. This prevents the large language model from misjudging interfering elements as objects from the source of the location logic. Furthermore, by using a large language model to accurately determine the coordinate information of each object in the first image based on the first prompt word, it can overcome the occlusion effect of interfering elements and accurately identify the actual position of each occluded object in the first image. This solves the problem of objects being unrecognizable due to occlusion by interfering elements. In addition, it displays the coordinate boxes corresponding to the coordinate information of each object in the first image, making the localization results of the large language model concrete, greatly improving the localization accuracy of each object in the first image, and achieving accurate recognition and comprehensive localization of all objects in the first image, thus meeting the user's needs for accurate and comprehensive localization of multiple objects in the image. Attached Figure Description

[0012] Figure 1 Flowcharts of object recognition methods provided for some embodiments of this application; Figure 2A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 3 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 4 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 5 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 6 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 7 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 8 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 9 A schematic diagram of the interface for an object recognition method provided in some embodiments of this application; Figure 10 A schematic diagram of the structure of an object recognition device provided for some embodiments of this application; Figure 11 A schematic diagram of the structure of an electronic device is provided for some embodiments of this application; Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided for some embodiments of this application. Detailed Implementation

[0013] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.

[0014] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, not limited in number; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship. Embodiments of this application provide an object identification method, apparatus, electronic device, and storage medium.

[0015] With the development of multimodal large language models and visual positioning technology, electronic devices have acquired the ability to understand images and natural language. Existing visual positioning technologies locate target objects described in text within images through image-text feature alignment and similarity calculation. These methods are mostly based on deep neural network structures, using visual neural networks (Transformers) to extract image features, combining them with text encoders to generate text features, and then achieving cross-modal positioning through feature fusion and similarity measurement. However, in the field of screen content processing, existing technologies still mainly rely on Optical Character Recognition (OCR), User Interface (UI) structure parsing, or template matching. These methods are only suitable for structured text and icon recognition, lacking the ability to recognize complex visual elements such as clothing and objects, and are also difficult to combine with semantic reasoning to achieve flexible interactive positioning. At the same time, the traditional multi-step processing flow cannot complete end-to-end multimodal perception from language input to visual understanding to result output.

[0016] Today, users have increasingly higher demands for intelligent interactive experiences. The visual positioning and information understanding of screen content have long since broken through the limitations of text recognition. For example, in general scenarios, users hope to directly locate objects on the screen through verbal descriptions, or to filter out irrelevant information such as icons and text, to achieve accurate target recognition and navigation. However, many shortcomings of existing technologies cannot meet this need. The deficiencies of existing technologies are mainly reflected in four aspects: First, the ability to recognize general objects is insufficient, only able to recognize text or UI elements, and unable to accurately locate complex visual objects such as people and clothing on the screen; second, there is a lack of semantic reasoning ability, only able to match based on direct descriptions, and unable to meet complex positioning needs through semantic reasoning; third, the ability to deal with noise interference is weak, making it difficult to distinguish and filter irrelevant elements such as icons and advertisements on the screen, affecting positioning accuracy; fourth, the information extraction method is singular, only able to extract explicit text content, and unable to achieve cross-modal information retrieval or natural language question answering.

[0017] To address the aforementioned problems, embodiments of this application provide an object recognition method, apparatus, device, storage medium, and program product, aiming to overcome the limitations of existing screen information retrieval and visual positioning methods, achieve a more comprehensive and intelligent understanding and interaction of screen content, and simultaneously achieve intelligent understanding, precise positioning, and visual reasoning of screen content. This can reduce user operation steps during the browsing of electronic devices and directly answer questions or provide shopping links for users.

[0018] The following is in conjunction with the appendix Figures 1 to 12 The object recognition method provided in this application will be described in detail through specific embodiments and application scenarios.

[0019] The terminology used in the implementation section of this application is for the purpose of explaining specific embodiments of this application only, and is not intended to limit this application.

[0020] The terminology used in the embodiments of this invention will be explained below.

[0021] Large Language Model (LLM). A large language model refers to a language model with 3 bytes of parameters, which can be directly embedded in electronic devices to handle various scenario tasks on the device side.

[0022] End-side refers to electronic devices, including but not limited to portable devices such as mobile phones and computers.

[0023] Multimodal large language models refer to large-scale neural network models that can simultaneously process and understand multiple types of data such as text and images, and are used to realize the perception, understanding and generation of cross-modal information.

[0024] A prompt is an instruction given to a large language model to guide it in generating specific, expected outputs such as answers, text, or code. In this embodiment, it refers to a first prompt, a second prompt, etc.

[0025] Visual positioning refers to the technique of accurately determining the spatial location or coordinates of a target object in an image or screen using computer vision technology.

[0026] Screen data retrieval perception refers to the technology that combines screen perception and information retrieval technology to understand screen content and directly locate or extract the information needed by the user.

[0027] Reinforcement learning is a machine learning method that enables large language models to learn optimal decision-making strategies autonomously through environmental interaction, trial and error, and reward feedback mechanisms.

[0028] ViT-Lora training refers to a method that uses low-rank adaptation (LoRA) to efficiently fine-tune a large language model on a specific task, based on the Visual Transformer (ViT) model.

[0029] Data matching refers to the technical means of combining data of different types or sources according to a preset ratio during the training of a large language model in order to optimize the training effect and improve the generalization ability. In the embodiments of this application, data matching may refer to adjusting the number of sample data for each of the N positioning scenarios according to the preset data matching information of N positioning scenarios to obtain training sample data for N positioning scenarios.

[0030] It should be noted that the object recognition method provided in this application can be executed by electronic devices such as mobile phones, tablets, laptops, PDAs, and wearable devices. Some embodiments of this application use electronic devices as the executing entity to illustrate the object recognition method provided in this application. The object recognition method provided in this application can be applied to screen recognition scenarios, enabling intelligent understanding, visual positioning, and interaction of mobile screen content. This includes locating objects on the screen through language description, filtering irrelevant information, accurately identifying targets and completing product navigation, and natural language question answering. Simultaneously, this method can also be applied to assistive scenarios for visually impaired individuals, providing them with support for screen content perception and information acquisition. Furthermore, this method can be extended to intelligent composition scenarios in cameras and photo albums, enriching the application boundaries of multimodal visual positioning technology.

[0031] In some embodiments of this application, the information display object recognition method provided in this application can be applied to scenarios where targets are identified and product navigation is completed. Specifically, when an electronic device displays a first image, such as an outfit image and text, a user browsing the outfit image and text in a lifestyle sharing application sees a white canvas bag in the image and wants to find the same bag and jump to purchase it. At this time, the user long-presses the screen with two fingers, and this operation is the first input. The electronic device responds to the first input and obtains a preset first prompt word. The first prompt word is used to guide the large language model to determine the coordinate information of each object in the outfit image and text. The prompt word content can be set to "identify all objects in the image, ignore text areas, application icons, floating icons and other non-subject elements, and output the coordinate information of each object". Then, the outfit image and text and the first prompt word are input together to the large language model on the device side. Based on the guidance of the first prompt word, the large language model comprehensively identifies each object in the first image, and after filtering out irrelevant information such as title text, blogger avatar, application function icons and other irrelevant information in the image, it outputs the coordinate information of all objects in the first image, such as the coordinates of the white canvas bag. <box> (326,458),(412,596)< / box> The coordinates of other clothing items, ornaments, and other objects in the image are then displayed; these coordinates constitute the first positioning information. The electronic device then directly displays the coordinate boxes corresponding to the first positioning information within the outfit image. Each coordinate box defines the area where the corresponding object in the first image is located, allowing the user to visually see all positioned objects. The coordinate box defining the white canvas bag is a first-type coordinate box; clicking on this first-type coordinate box will redirect the user to the white canvas bag's product shopping interface.

[0032] In other embodiments, the information display object recognition method provided in this application can be applied to scenarios of identifying targets and completing natural language question answering. Specifically, when an electronic device displays a first image, such as a food preparation video, and a user is watching the video on a short video application and sees a retro-patterned ceramic bowl on the screen, wanting to inquire about information related to the ceramic bowl, the user long-presses the screen with two fingers; this operation is the first input. The electronic device responds to this first input by obtaining a preset first prompt word. This first prompt word guides the large language model to determine the coordinate information of each object in the video frame. The prompt word can be set to "identify all objects in the image, ignoring non-subject elements such as video subtitles, play buttons, and like icons, and output the coordinate information of each object." Subsequently, the video frame and the first prompt word are input to the edge large language model. The large language model follows the guidance of the first prompt word, filtering out irrelevant information such as subtitles and video playback controls in the image, comprehensively identifying each object in the first image and outputting the coordinate information of all objects, such as the coordinates of the ceramic bowl. <box> (215,389),(368,542)< / box> The coordinates of kitchen utensils, ingredients, and other objects in the image are the first positioning information. Then, the electronic device directly displays the coordinate boxes corresponding to the first positioning information on the video screen. Each coordinate box defines the area where the corresponding object in the first image is located. The coordinate box defining the ceramic bowl is a second type of coordinate box, which provides a clear interactive direction for the user to initiate natural language question and answer for the ceramic bowl later.

[0033] Furthermore, if the user's first input includes both the first image and user request information, the first prompt word can be reconstructed based on the user request information, as shown in the above embodiment.

[0034] In some embodiments of this application, the object recognition method provided in this application can be applied to scenarios where a target is identified and a product is redirected. Specifically, when a user is browsing outfit images and text in a lifestyle sharing application and sees a white canvas bag in the picture, and wants to find the same bag and jump to purchase it, the user long-presses the screen with two fingers. This operation can serve as the first input to trigger the screen visual positioning task. After receiving this input, the electronic device activates the screen visual positioning function, that is, in response to the first input of long-pressing with two fingers, it obtains the user's demand information "find the white canvas bag in the picture, ignore text and icons" input by the user through voice and / or text; and generates a corresponding first prompt word based on the user demand information. The first prompt word may include the user description instruction "locate the white canvas bag in the picture, ignore non-subject elements such as text areas, application icons, and floating icons, and output target coordinate information". The current outfit image and text, along with the first prompt word, are then input into the on-device large language model. Based on the first prompt word, the large language model accurately identifies the first object—the white canvas bag—in the first image. After filtering out irrelevant information such as the title text, blogger's avatar, and application function icons, it outputs the coordinate information of the canvas bag in the first image (e.g., ...). <box> (326,458),(412,596)< / box> This coordinate information is the first positioning information. Then, the electronic device can display a first interface on the current screen including the outfit image and text, and simultaneously display a first type of coordinate frame at the coordinate location corresponding to the first positioning information. This coordinate frame precisely defines the area where the white canvas bag is located in the first image, allowing the user to intuitively see the targeted product. Thus, if the user clicks on this first type of coordinate frame, they will be redirected to the shopping interface for the white canvas bag.

[0035] In other embodiments, the object recognition method provided in this application can be applied to scenarios of identifying targets and completing natural language question answering. Specifically, a user watches a food preparation video on a short video application. The screen displays a ceramic bowl with a retro pattern. The user wants to inquire about information related to the ceramic bowl. At this time, the user long-presses the screen with two fingers. This operation serves as the first input to trigger the screen visual positioning task. The electronic device receives and activates the screen visual positioning function, i.e., responds to the first input, and obtains the user's input natural language user request information: "What style is the ceramic bowl in the picture? Ignore video subtitles and function buttons." Based on this user request information, a corresponding first prompt word is generated. The first prompt word includes the user description instruction: "Locate the ceramic bowl in the picture, ignore non-subject elements such as video subtitles, play buttons, and like icons, and output target coordinate information." Then, the current screen video image and the first prompt word are input to the edge-side large language model. The large language model follows the constraints of the first prompt word, filtering out irrelevant information such as subtitles and video playback controls in the image, accurately identifying the ceramic bowl and outputting its coordinate information in the first image (e.g., ...). <box> (215,389),(368,542)< / box>This coordinate information is the first positioning information. Then, the electronic device displays the first interface on the current video playback page. While retaining the video image in the interface, a second type of coordinate box is displayed at the coordinate position corresponding to the first positioning information. This coordinate box accurately defines the area where the ceramic bowl is located in the first image, providing a clear interactive direction for the user to subsequently initiate natural language question-and-answer sessions regarding this target.

[0036] The following is combined Figure 1 This application provides a detailed description of an object recognition method based on an embodiment.

[0037] Figure 1 A flowchart of an object recognition method provided for some embodiments of this application.

[0038] like Figure 1 As shown, the object recognition method provided in this application embodiment can be applied to electronic devices or servers. Based on this, the object recognition method may include steps 110 to 130, as detailed below.

[0039] Step 110: While displaying the first image, receive the user's first input; Step 120: In response to the first input, input the first image and the first prompt word into the large language model to obtain the first positioning information output by the large language model; The first positioning information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; Step 130: Display the coordinate box corresponding to the first positioning information in the first image.

[0040] In this way, by leveraging the constraint of the first cue word, the large language model is guided to comprehensively identify all objects in the first image. The model's localization task is clearly defined as acquiring the coordinate information of all objects, effectively distinguishing objects in the first image from irrelevant interfering elements such as icons, text, advertisements, and floating buttons. This prevents the large language model from misclassifying interfering elements as objects from the source of the localization logic. Furthermore, by accurately determining the coordinate information of each object in the first image based on the first cue word, the large language model can overcome the occlusion effect of interfering elements, accurately identifying the actual position of occluded objects in the first image. This solves the problem of objects being unrecognizable due to occlusion by interfering elements. Moreover, a coordinate bounding box corresponding to the coordinate information of each object is displayed in the first image, visualizing the localization results of the large language model and significantly improving the localization accuracy of each object in the first image. This achieves accurate identification and comprehensive localization of all objects in the first image, meeting users' needs for accurate and comprehensive localization of multiple objects in images.

[0041] The steps described above are explained in detail below.

[0042] Regarding step 110, in some embodiments of this application, considering that the initial large language model lacks dedicated adaptation capability for screen visual positioning and that screen scenes have diverse positioning requirements, and that it is difficult to guarantee positioning accuracy without targeted training data, embodiments of this application provide a dedicated sample data construction method for screen visual positioning tasks. Based on this, before step 110, embodiments of this application also adjust the large language model. Therefore, the object recognition method may also include steps 1601 and 1602.

[0043] Step 1601: Obtain sample images for the screen visual positioning task; the screen visual positioning task corresponds to N positioning scenarios, and different positioning scenarios correspond to different annotation requirements. The annotation requirements are used to constrain the corresponding quantitative relationship between the sample description instructions and the sample objects in the sample images. The sample description instructions are instructions used to describe the sample objects, and N is an integer greater than or equal to 2.

[0044] In this step, taking the screen visual positioning task corresponding to four positioning scenarios as an example, i.e., N=4, the four positioning scenarios in this embodiment can include single-object description scenario, same-category multi-object description scenario, multi-category multi-object description scenario, and no-object description scenario. All sample images are taken from screenshots of various applications on the device side. Based on this, diverse screen images on the device side can be collected as sample images, covering 100,000 images in total, including screenshots of outfit pictures and text from lifestyle sharing applications, screenshots of food playback from short video applications, screenshots of chat interfaces from social applications, and screenshots of e-commerce product displays from shopping applications, covering the mainstream scenarios of daily screen browsing. The screen visual localization task is defined to correspond to four localization scenarios. The annotation requirements for different scenarios impose differentiated constraints on the quantitative relationship between sample description instructions and sample objects in the sample image. Sample description instructions are natural language instructions used to describe sample objects in the first image. The specific quantitative relationship constraint rules are as follows: Single object description scenario: 1 sample description instruction corresponds to 1 sample object in the sample image; Same category multiple object description scenario: 1 sample description instruction corresponds to at least 2 sample objects of the same category in the sample image; Multi-category multiple object description scenario: 1 sample description instruction corresponds to at least 2 sample objects of different categories in the sample image; No object description scenario: 1 sample description instruction has no corresponding sample object in the sample image.

[0045] Step 1602: Based on the sample images and the annotation requirements for each positioning scenario, determine the sample data for each positioning scenario. The sample data is used to adjust the initial large language model. The adjusted initial large language model is used to identify objects in the first image based on user requirements and the first prompt word. The sample data for each positioning scenario includes the first sample image in the sample images and the annotation information corresponding to the positioning scenario. The annotation information includes sample description instructions and sample positioning information. The sample positioning information is used to characterize the position of the sample object in the first sample image. The sample object includes at least interfering objects in the screen scene and reference positioning objects.

[0046] In this step, for the 100,000 sample images collected in step 1601, and combining the annotation requirements of the four positioning scenarios, scene matching, object annotation, and instruction writing are performed on each image to generate scene-specific sample data. Each set of sample data includes a sample image and annotation information. The annotation information includes sample description instructions, coordinate information of the sample object in the image, etc., which are used to adjust the initial large language model. The adjusted large language model can accurately identify objects in the first image based on user requirement information and the first prompt word. The sample data production process for each scenario is as follows: Scenario 1: Single Object Description Scenario. Select a sample image containing a single core object, such as an e-commerce screenshot showing only a water cup. Following the annotation requirement of "one instruction corresponding to one object," write the sample description instruction "white ceramic water cup with handle," and annotate the coordinates of the water cup in the image. <box> (256,312),(489,658)< / box> The e-commerce screenshot, sample description instructions, and coordinate information are used as one sample data point for a single object description scenario.

[0047] Scenario 2: Description of multiple objects within the same category. Select a sample image containing multiple objects of the same category, such as a screenshot of a beauty product image showing multiple lipsticks. Following the labeling requirement that "one instruction corresponds to at least two objects of the same category," write the sample description instruction "lipstick with metal decoration on the tube," and label the coordinate information of the three lipsticks in the image that match the description. Use this beauty product screenshot, the sample description instruction, and the three coordinate information as one sample data point for the description of multiple objects within the same category.

[0048] Scenario 3: Multi-category, multi-object description scenario. Select sample images containing multiple objects of different categories, such as a screenshot of a daily life scene showing desktop ornaments, including a mug, a cartoon doll, and a wooden photo frame. According to the labeling requirement of "one instruction corresponding to at least two objects of different categories", write one sample description instruction "a mug with a cartoon pattern and a pink plush cartoon doll", and label the coordinate information of the two objects of different categories respectively. Use this daily life screenshot, one sample description instruction, and two coordinate information as one sample data for the multi-category, multi-object description scenario.

[0049] Scenario 4: No Object Description Scenario Select a sample image that does not contain a specified object, such as a screenshot of a home showing only green plants and no electronic products. According to the labeling requirement of "1 instruction with no corresponding object", write a sample description instruction "black wireless Bluetooth headphones" and label the location result as "none" to indicate that there is no corresponding object. This home screenshot, sample description instruction and "none" label are used as 1 sample data for the no object description scenario.

[0050] In this way, the annotation and sample data production of the four scenes can be completed one by one for 100,000 sample images, and finally the sample data set of each positioning scene can be obtained. All sample data will be used together for the customized training of the initial 3B multimodal large language model, so that the model has the ability to recognize different screen positioning scenes.

[0051] It should be noted that, in order to improve the general object localization capability of the large language model in real-world application scenarios, the constructed sample data not only covers conventional, locatable objects such as cars, electronic products, sunglasses, and clothing, but also explicitly includes highly interfering elements in the screen scene, such as icons, text, and interface decorations. Furthermore, the data is systematically divided and proportioned from the perspective of modal relationships and annotation structure. Based on this, interfering objects can include at least one of the following: interface function objects, text information objects, and image quality-related objects. Specifically, interface function objects refer to interactive visual components inherent to the screen, such as application icons, floating icons, function buttons, and UI interface decorations. Text information objects refer to various explicit text content on the screen, such as title text, descriptive text, timestamps, and comment text. Image quality-related objects refer to blurred and / or low-resolution areas of the background in the screen image, visual background parts without actual target value. Reference positioning objects refer to conventional positioning objects, such as cars, skirts, and hats. Therefore, multi-scene training samples including screen interference objects can be constructed, providing a data foundation for the initial model to adapt to screen visual localization tasks and improving the model's adaptability to screen scenes and its ability to distinguish targets.

[0052] It should be noted that, in order to clarify the specific positioning scenario types of the screen visual positioning task and to standardize the quantitative relationship between sample description instructions and sample objects in different scenarios, this application provides specific definitions and corresponding quantitative relationship constraint rules for four types of positioning scenarios. Based on this, the N positioning scenarios in this application include single-object description scenarios, multi-object description scenarios of the same category, multi-object description scenarios of multiple categories, and no-object description scenarios. Specifically, a single-object description scenario corresponds to the first annotation requirement information, which constrains that one sample description instruction corresponds to one sample object in the sample image. A multi-object description scenario of the same category corresponds to the second annotation requirement information, which constrains that one sample description instruction corresponds to at least two sample objects of the same category in the sample image. A multi-object description scenario of multiple categories corresponds to the third annotation requirement information, which constrains that at least two sample description instructions correspond to at least two sample objects of different categories in the sample image. A no-object description scenario corresponds to the fourth annotation requirement information, which constrains that at least one sample description instruction does not have a corresponding sample object in the sample image.

[0053] Unlike existing expert models that primarily rely on a single detection paradigm—strong alignment between text and boxes—this application abstracts practical applications into four scenarios: single-object description scenario (1v1), same-category multi-object description scenario (1vN), multi-category multi-object description scenario (NvN), and no-object description scenario (NvNone). The following combines… Figure 2 The sample data for each positioning scenario is explained.

[0054] A single-object description scenario (1v1) refers to a scenario where one sample description instruction precisely corresponds to one sample object in a sample image. The sample description instruction must include detailed feature descriptions of the object to achieve accurate localization of a single target. For example... Figure 2 As shown, a screenshot including a person is used as the sample image. The sample description instruction is "a girl wearing an orange dress is lying on the grass with a yellow flower covering her face in front of her". This single instruction only points to this one girl in the image, the only sample object. The model needs to output the unique coordinate information of the girl in the first image according to the instruction. This type of sample data is used to improve the model's single-target detail recognition and accurate localization capabilities.

[0055] In a 1vN (1 vs. N) scenario, a single sample description instruction corresponds to at least two sample objects of the same category in a sample image. The instruction focuses on the category features of the sample objects, achieving full coverage localization of multiple targets within the same category. For example... Figure 2As shown, a screenshot containing flowers and plants is used as a sample image. The sample description instruction is "orange and yellow flowers". This single instruction points to multiple sample objects of the same category, namely all orange and yellow flowers in the image. The large language model needs to output the coordinate information of each flower in the image that matches the description according to the instruction. This type of sample data is used to improve the model's ability to detect dense objects of the same category and to locate them in detail.

[0056] Multi-class, multi-object description (NvN) scenarios refer to a single sample description instruction that corresponds to at least two sample objects of different classes in a sample image. The instruction covers multi-class features, enabling fine-grained localization of multiple classes and multiple targets. For example... Figure 2 As shown, screenshots containing various elements such as people, flowers, text, and icons are used as sample images. The sample description instructions are "orange and yellow flowers", "laughing girl", "March 17th", and "icon". Multiple instructions point to the four different sample objects in the image: flowers, girls, text, and icons. The model needs to output the coordinate information of the corresponding category object according to each instruction. This type of sample data includes tens of thousands of category labels, which are used to improve the model's fine-grained recognition and multi-category multi-target localization capabilities.

[0057] A no-object description scenario (NvNone) refers to a scenario where at least one sample description instruction exists, but there is no corresponding sample object in the sample image. The model must identify "no corresponding object" and output the label "none," thus constraining the model's over-generation and illusionary behavior. For example... Figure 2The example shown uses a screenshot containing only text, icons, and greenery, without vehicles, as a sample image. The sample description command is "red car," but there is no corresponding sample object in the image. The model should output the location result "none." This type of negative sample data is used to prevent the model from giving random answers and improves the accuracy and controllability of the model's judgment in scenarios where the target does not exist. This broad inclusion of screen data enhances screen perception and recognition capabilities, allowing for the filtering of irrelevant content on the screen. For example, adding "block icons and text" to the prompt can resolve many errors that occur in video perception and recognition. Among the other four scene classification datasets, the 1v1 dataset contains very rich image detail descriptions, which can improve attribute recognition capabilities; the 1vN dataset contains detailed box listings for a specific category, mostly densely detected data, used to improve the model's ability to locate images in detail; the NvN dataset includes multiple categories with more than 20,000 category labels, enhancing fine-grained recognition and detection; the None dataset requires the model to explicitly output "none" when the target does not exist, used to constrain the excessive generation and illusion behavior of large language models. This is a capability that is generally lacking in traditional detection data but is crucial for multimodal large language models. By reasonably allocating the above multi-structured data during the training phase, rather than sampling from a single detection distribution, the model not only obtains higher accuracy in general object localization capabilities, but also significantly enhances its robustness and controllability in complex semantics, multi-target, and target-free scenarios.

[0058] Therefore, four core scenarios for standardized screen visual positioning can be identified, the sample annotation logic for each scenario can be clarified, the standardization of training data can be ensured, and the model can adapt to diverse screen positioning needs such as single object, multiple object, and no object.

[0059] In some embodiments of this application, to address the problem that the sample description instructions for single-object description scenarios are too simple and lack sufficient feature characterization, resulting in weak model capabilities for detailed identification and localization of single objects, this application provides a method for expanding sample description instructions for single-object scenarios. Based on this, N localization scenarios include single-object description scenarios, and the single-object description scenario corresponds to the first annotation requirement information. Step 1602 specifically includes steps 16021 to 16024.

[0060] Step 16021: Based on the first sample positioning information in the first annotation requirement information, extract the sample sub-image located by the first sample positioning information from the first sample image.

[0061] In this step, using a single-object description scenario (1v1) as the background, a screenshot displaying clothing content is selected as the first sample image. The sample object is the "white knitted cardigan" in the image. The sample description instruction is expanded using the GPT-4o multimodal language model. Based on this, the first annotation requirement information corresponding to the single-object description scenario includes the first sample description instruction as "white knitted cardigan," and the first sample positioning information as the coordinate frame of the cardigan in the first sample image. <box> (186,255),(428,592)< / box> Based on the coordinate frame information, the area containing only the white knitted cardigan is precisely cropped from the first sample image displaying the outfit content, resulting in a sample sub-image. This sub-image contains no irrelevant elements such as screen text, icons, background figures, etc., and only retains the target sample object.

[0062] Step 16022: Input the sample sub-image and the third prompt word into the multimodal language model to obtain the sample augmentation description instruction output by the multimodal language model. The third prompt word is used to instruct the multimodal language model to generate the sample augmentation description instruction based on the features of the sample objects in the sample sub-image. The sample augmentation description instruction has more feature words for the sample objects than the first sample description instruction in the first annotation requirement information.

[0063] In this step, the cropped sample sub-image of a "white knitted cardigan" is input into the GPT-4o multimodal language model along with a third prompt, such as the preset "precisely describe the appearance details, style features, and texture of the object in the image, generating detailed natural language description instructions, with feature characterization down to dimensions such as neckline, cuffs, cut, and decorations." Constrained by the third prompt, the model extracts details and expands the description based on the visual features of the white knitted cardigan in the sample sub-image, outputting the expanded description instruction as "a round-neck, loose-fitting white knitted cardigan with tapered cuffs, vertical stripes on the body, and small tassel decorations at the hem." This expanded description instruction contains far more feature words such as round neck, loose fit, tapered cuffs, vertical stripes, and small tassel decorations than the first sample description instruction, such as white, knitted, and cardigan, achieving a refined characterization of the sample object's features.

[0064] Step 16023: Replace the first sample description instruction in the first annotation requirement information with the sample expansion description instruction to obtain the second annotation requirement information.

[0065] In this step, the original first sample description instruction in the first annotation requirement information, such as "white knitted cardigan," is replaced with the sample augmentation description instruction output by GPT-4o, such as "a white knitted cardigan with a round neck and loose fit, fitted cuffs, vertical stripes on the body, and small tassels at the hem." After the replacement, the second annotation requirement information is obtained, which retains the first sample location information from the first annotation requirement information. <box> (186,255),(428,592)< / box>Only the sample description instructions are updated to ensure that the location information remains unchanged while the description information is refined.

[0066] Step 16024: Based on the second annotation requirement information and the first sample image in the sample images, determine the sample data for the single object description scene.

[0067] In this step, the second annotation requirement information, such as the expanded sample description instruction and the original sample location information, is associated and bound with the corresponding first sample image, i.e., the screenshot of the outfit content, to form a complete set of single-object description scene sample data with expanded features. The format of this sample data is: {Sample image: outfit screenshot, Sample description instruction: a white knitted cardigan with a round neck and loose fit, fitted cuffs, vertical stripes on the body, and small tassel decoration at the hem, Sample location information:} <box> (186,255),(428,592)< / box>}

[0068] Therefore, following the steps described above, all the original sample data of the single-object description scene are expanded with instructions and reconstructed one by one, and finally a set of single-object scene sample data with rich feature information and complete detail is obtained. This type of data accounts for 27% of the overall training data and is used to improve the model's ability to recognize details, perform visual reasoning and accurately locate single-target objects, enrich the feature information of single-object description instructions, improve the diversity of training data, enhance the model's ability to recognize details and accurately locate single-target objects, and improve the positioning accuracy of 1v1 scenes.

[0069] In some embodiments of this application, considering that the proportion of sample data in different positioning scenarios directly affects the positioning accuracy, recall rate and generalization ability of the model, and that the lack of reasonable proportion can easily lead to poor performance of the model in some scenarios, this application provides a training sample data adjustment method based on a preset proportion. Based on this, the number of samples for each positioning scenario can be adjusted. Based on this, before step 1602, the object recognition method may also include steps 1603 and 1604.

[0070] Step 1603: Adjust the number of sample data for each positioning scenario according to the preset data matching information to obtain training sample data for N positioning scenarios. The data matching information is used to control the proportion of sample data for each positioning scenario in the total sample data.

[0071] In this step, prior to step 1603, it is necessary to determine the preset data ratio information. Based on this, the preset data ratio information can be determined through the following steps.

[0072] Specifically, after completing the production of raw sample data for four localization scenarios—single-object description, same-category multi-object description, multi-category multi-object description, and no-object description—multiple rounds of matching algorithm optimization and training testing were conducted on the sample data for the four scenarios to find the optimal data ratio that balances the model's localization accuracy and recall. Initially, multiple matching schemes were set, such as 20% single-object data, 30% same-category multi-object, 45% multi-category multi-object, and 5% no-object; or 25% single-object, 30% same-category multi-object, 40% multi-category multi-object, and 5% no-object. Sample data from each matching scheme were then used to train the large language model. Screen visual localization capability tests were performed on the models trained with each matching scheme, focusing on indicators such as single-target detail recognition accuracy, same-category multi-target detection accuracy, multi-category multi-target recall, and accuracy in judging no-target scenarios. To address issues identified during testing, adjustments were made to the data allocation. For instance, when the proportion of multi-category, multi-object data exceeded 40%, the overall localization accuracy of the model significantly decreased; when the proportion was below 40%, the recall rate for multi-object scenes was insufficient; when the proportion of multi-object data of the same category increased to 29%, the model achieved the best detection accuracy for repeated objects of the same category; when the proportion of single-object data was 27%, the model's visual reasoning and attribute recognition capabilities were strongest; and when the proportion of no-object data was 4%, it effectively constrained the model's hallucination behavior without negatively impacting the learning efficiency of positive samples due to excessive negative samples. After dozens of rounds of training, testing, and optimization with different allocations, the optimal data allocation for the four types of scenarios was finally determined, i.e., the preset data allocation information: 27% for single-object description scenarios, 29% for multi-object description scenarios of the same category, 40% for multi-category, multi-object description scenarios, and 4% for no-object description scenarios.

[0073] Based on this, according to the 27%:29%:40%:4% matching rule determined in step 1603, the original sample data of the four types of positioning scenarios that have been produced are filtered and adjusted in quantity to integrate and obtain the screen visual positioning task training sample data for the initial 3B multimodal large language model training. The total number of original sample data of the four types of scenarios is counted. This example is based on 1 million original sample data. The amount of sample data to be retained for each type of scenario is calculated according to the matching ratio as an example: 270,000 data for single object description scenarios, 290,000 data for multi-object description scenarios of the same category, 400,000 data for multi-category multi-object description scenarios, and 40,000 data for scenarios without object description.

[0074] The raw sample data for each scenario undergoes quality screening. For single-object description scenarios, high-quality data augmented with GPT-4o details is retained to ensure rich feature representation. For multi-object description scenarios within the same category, densely populated detection samples are retained to improve the model's accuracy in detecting objects of the same category. For multi-category, multi-object description scenarios, 20,000 fine-grained samples with category labels are retained to match real-world multi-object localization applications. For scenarios without object description, negative samples of common irrelevant elements on the screen are retained to ensure the model can accurately determine the absence of a corresponding object and output "nothing." The sample data for the four scenarios is extracted and integrated based on the computational quantity, removing low-quality and duplicate samples to obtain 1 million training samples that meet the required ratio (e.g., 270,000:290,000:400,000:40,000). The integrated sample data is then standardized according to training requirements, unifying the storage structure of the sample data, including sample images, sample description instructions, and sample localization information, forming the final training sample data for the screen visual localization task that can be directly used for initial large-scale language model training.

[0075] Step 1604: Using training sample data, adjust the task parameters corresponding to the screen visual localization task in the initial large language model to obtain the adjusted large language model.

[0076] Therefore, by reasonably allocating sample data from various scenarios, the model can learn in a balanced manner across multiple scenarios, taking into account both positioning accuracy and recall, thereby improving the overall positioning performance of the model in complex screen scenarios.

[0077] In some embodiments of this application, considering the problem that training the initial model directly using only sample data without combining the user demand information corresponding to the sample leads to a disconnect between the model training and the actual user usage scenario, this application provides a model training method that combines sample user demand information. Based on this, the initial large language model is trained based on image-text pair sample data, which includes sample images and the descriptive text corresponding to the sample images. The initial large language model is used to perform cross-modal object alignment between images and text. Based on this, the above step 1604 may specifically include steps 16041 and 16042.

[0078] Step 16041: Based on the training sample data, determine the sample user demand information corresponding to each sample description instruction.

[0079] Step 16042: Train the initial large language model based on the sample user demand information and training sample data to obtain the large language model.

[0080] In this step, the initial large language model can be progressively trained in three stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning, based on the training sample data, namely the training sample data of N positioning scenarios and the sample user demand information corresponding to the sample description instructions. The specific three stages will be explained in detail.

[0081] The first stage involves continued pre-training. The goal of this stage is to enable the initial large language model system to learn high-quality cross-modal representations of images and text, strengthening the model's ability to extract and match image and text features, and improving the consistency and alignment accuracy between visual and textual representations. This lays the foundation for accurate inference in subsequent screen visual localization. Based on this, approximately 10 million high-quality image-text pairs are introduced, incorporating user demand information and basic sample features corresponding to each of the N localization scenarios, ensuring the pre-training data aligns with the task attributes of screen visual localization. Based on the initial large language model's basic network structure, image-text pairs are input into the model, allowing it to learn the correspondence between image content and text descriptions in large-scale data. The focus is on optimizing the model's visual encoder and text encoder to achieve efficient fusion and accurate matching of cross-modal features. The model possesses stable cross-modal representation capabilities, accurately extracting visual features from the first image and textual features from the sample description instructions, achieving basic visual and textual feature alignment, and initially identifying the semantic relationship between images and text.

[0082] The second stage, the SFT stage, aims to adapt the model to specific screen visual localization tasks, building upon pre-training. This stage develops advanced capabilities such as referential detection, visual question answering, and bounding box regression, enabling the model to accurately locate target objects in sample images and output coordinate information based on sample description instructions and user needs. To achieve this, training sample data for N localization scenarios is used. This data includes sample images, sample description instructions, sample localization information, and corresponding user needs, covering all scenarios: single-object 1v1, same-category multi-object 1vN, multi-category multi-object NvN, and no-object NvNone. This allows the conversational instruction data to be input into the pre-trained large language model for supervised fine-tuning training, including bounding box regression training. The model learns to predict the bounding box position of sample objects in sample images based on sample description instructions and user needs. The shared visual encoder is unfrozen and fine-tuned in conjunction with other modules, breaking the fixed parameter limitations of the visual encoder and ensuring that visual feature extraction is fully aligned with the screen visual localization task objectives. This improves the model's ability to recognize features of complex and highly interfering visual elements in screen scenes.

[0083] In this way, the model has a dedicated ability for screen visual positioning, which can realize the identification detection and positioning box generation of target objects in the first image. It can complete the basic sample object positioning information in four positioning scenarios based on sample user demand information and sample description instructions. At the same time, it has a preliminary visual question answering ability and can respond to simple instructions related to the positioning results.

[0084] The third stage, reinforcement learning, aims to structurally constrain and finely optimize the model's localization output through the trial-and-error and reward feedback mechanisms of reinforcement learning. This improves the model's localization accuracy, recall, and output format standardization in complex screen scenarios, making the model's localization behavior more aligned with actual user needs. To this end, 50,000 high-precision dialogue data points are introduced, all from training samples across N localization scenarios. These samples are bound to user demand information corresponding to the sample description instructions, resulting in higher data quality and scenarios more closely aligned with actual screen interactions. The data also includes numerous challenging examples of highly interfering screen elements and complex multi-target localization. The semantic requirements of the user demand information can be integrated into the reward calculation to ensure the model output matches user intent. Specifically, a localization accuracy reward is provided, calculated using the Hungarian matching algorithm to determine the IoU between the predicted and ground truth bounding boxes, combined with a text semantic encoder to calculate category semantic similarity, reinforcing the reward weight for high-matching results. Next, a retrieval completeness reward is provided, penalizing missed detections and measuring the model's coverage of target objects to ensure high recall in multi-target localization scenarios. Furthermore, an output format reward can be added, constraining the model to output standardized localization results, such as…<obj_ref> Category Name< / obj_ref> <box> (x1,y1),(x2,y2)< / box> Formatting errors are penalized with a score. Based on this, multi-round sampling inference and reward feedback can be performed. That is, a single sample is sampled and inferred four times with different parameters, resulting in multiple sets of different localization outputs. The comprehensive reward value of each set of results is calculated according to the reward function, and the round result with the highest reward value is selected to guide the model to update parameters in the direction of high reward. In this way, the task parameters of the model can be iteratively optimized based on the results of reward feedback, allowing the model to continuously adjust its localization strategy and improve its ability to resist interference, locate multiple targets, and identify targets in screen scenes.

[0085] Therefore, by training the model to fit the actual user needs and scenarios, the model's ability to align user needs with the first image across modalities is strengthened. This enhances the model's understanding and execution of user location commands in practical applications, leading to a qualitative improvement in the model's screen visual positioning capabilities. While maintaining high recall, the model also achieves high positioning accuracy, effectively filtering out irrelevant interference elements such as icons, text, and advertisements on the screen, accurately identifying complex visual objects. The model's output positioning results are in a standardized format with strong parsing capabilities, and can accurately output "nothing" in object-free scenarios, avoiding hallucination behavior. This makes the model suitable for practical application scenarios of edge-side screen visual positioning.

[0086] In this embodiment of the application, in order to solve the problem that the traditional model training method of updating parameters by one inference is prone to poor training effect and lacks quantitative evaluation of localization results and parameter optimization guidance, this embodiment of the application provides a model training optimization method with multi-round sampling inference and reward value guidance. Based on this, the above step 16042 may include steps 160421 to 160425, as shown below.

[0087] Step 160421: Input the sample user demand information corresponding to each sample description instruction, the training sample data of N positioning scenarios, and the fourth prompt word into the initial large language model M times. This yields the predicted information output by the initial large language model in each of the M inputs. The predicted information includes the sample predicted positioning information and the sample predicted description instruction corresponding to the sample predicted positioning information. The fourth prompt word constrains the initial large language model to identify sample objects related to the sample user description instruction in the sample image based on the sample user demand information and the training sample data, and outputs predicted information related to the sample objects. The sample predicted positioning information includes the sample predicted coordinate information of the sample object in the sample image.

[0088] In this step, based on training samples of multi-class, multi-object scene description (NvN), M=4 sampling inferences are set, the fourth prompt word is a customized constraint instruction for the visual localization task, and the initial large language model is an edge-side large language model that has completed further pre-training and supervised fine-tuning. Based on this, the selected training samples are screenshots of clothing images, and the user's requirement information is "output the coordinates of the girl wearing a black top, cup, bag, shoes, skirt, and sweatshirt in the image"; the sample description instruction is consistent with the user's requirement information, clearly specifying the need to locate 6 types of sample objects; the sample localization information includes the precise coordinate information of the 6 types of objects, in the format of...<obj_ref> category< / obj_ref> <box> (x1,y1),(x2,y2)< / box> The fourth prompt: "Based on the sample user's demand information, identify the target sample object in the sample image, output the sample prediction description command and the corresponding sample prediction coordinate information. The coordinate information must be..."<obj_ref> category< / obj_ref> <box> (x1,y1),(x2,y2)< / box> Formatted output, ensuring that target categories are not omitted and that irrelevant objects are not output.

[0089] Step 160422: Determine the overlap of sample positioning information for each output with the sample positioning information in the training sample data of N positioning scenarios.

[0090] In this step, the sample user demand information, the aforementioned multi-category, multi-object training sample data, and the fourth prompt word are integrated into the training input. This input is fed into the initial large language model four times, with different temperature parameters set for each sample to control the diversity of the model's output. Each model output includes the sample's predicted location information and the predicted description of the instruction, such as... Figure 3 As shown, the results of the four inference attempts are as follows: First attempt (Rollout 1): The predicted description instruction was "cup, bag, shoes"; the predicted location information consisted of the predicted coordinates of three object categories, and the format met the requirements. Second attempt (Rollout 2): The predicted description instruction was "girl wearing a black top, cup, bag, skirt"; the predicted location information consisted of the predicted coordinates of four object categories, and the format met the requirements. Third attempt (Rollout 3): The predicted description instruction was "girl wearing a black top, bag, skirt"; the predicted location information consisted of the predicted coordinates of three object categories, and the format met the requirements. Fourth attempt (Rollout 4): The predicted description instruction was "girl wearing a black top, bag, skirt, sweatshirt, shoes"; the predicted location information consisted of the predicted coordinates of five object categories, and the format met the requirements. It should be noted that "cup" was not identified in any of the four predictions, representing a true missed detection during sample training. The other identified categories were consistent with the sample description instructions, and no irrelevant objects were output.

[0091] Step 160423: Through the text semantic encoder associated with the initial large language model, determine the semantic similarity between the sample prediction description instruction for each time and the sample description instructions in the training sample data of N localization scenarios. The sample prediction description instruction for each time is the instruction describing the sample object corresponding to the sample prediction localization information output each time.

[0092] In this step, the predicted description instructions from the four inference attempts and the sample description instructions from the training samples are input together into the text semantic encoder to calculate their semantic similarity. This measure measures the semantic matching degree between the target category identified by the model and the sample requirements. The value can range from 0 to 1, with higher values ​​indicating higher matching degree. Since the four predicted description instructions are all subsets of the sample description instructions and have no semantic bias, the semantic similarity results calculated by the encoder for the four inference attempts are: Rollout1: semantic similarity 1.0; Rollout2: semantic similarity 1.0; Rollout3: semantic similarity 1.0; Rollout4: semantic similarity 1.0. If there is category bias in the predicted description instructions, such as recognizing "hoodie" as "sweater," the semantic similarity will decrease according to the degree of bias.

[0093] Step 160424: Determine the model reward value for each sample based on the overlap of sample location information and the semantic similarity for each sample.

[0094] In this step, the overlap of sample location information is the core, and the basic reward score is calculated by combining semantic similarity; the output format of the predicted information is checked, and all four inferences in this case conform to the requirements.<obj_ref> category< / obj_ref> <box> (x1,y1),(x2,y2)< / box> The format reward score is set to a maximum of 1.0, with a weight of 0.2. The final model reward value is calculated using the formula reward_combined = (1-0.2)F1 + 0.2, where F1 is a weighted average of positioning accuracy and retrieval completeness, and is positively correlated with the overlap of sample positioning information. After comprehensive calculation, the model reward values ​​for the four inferences are as follows: Rollout1: 0.32; Rollout2: 0.58; Rollout3: 0.47; Rollout4: 0.77.

[0095] Step 160425: Adjust the task parameters corresponding to the screen visual positioning task in the initial large language model according to the task parameter values ​​corresponding to the target round, and obtain the large language model. The target round is the round with the highest model reward value among M rounds.

[0096] In this step, the model reward values ​​of the four inferences are ranked, and Rollout4 (the fourth inference) is determined as the target round, with a reward value of 0.77, the highest among the four rounds. The model inference strategy for this round is the optimal strategy. Using the task parameter values ​​corresponding to the target round (Rollout4) as the optimization basis, parameters related to the screen visual localization task in the initial large language model, such as the feature extraction parameters of the visual encoder ViT, text-visual feature fusion parameters, and bounding box regression parameters, are updated and adjusted: The accurate recognition and localization parameters for "girl, bag, skirt, sweatshirt, shoes" in this round are retained, enhancing the model's ability to detect multiple categories of clothing objects; for the missed detection of "cup" in this round, the feature extraction weights for small items are optimized to reduce missed detections in subsequent training; the format output parameters of this round are fixed to ensure that subsequent model outputs always conform to the standardized coordinate format. After the above parameter adjustments, a large language model that has completed the reinforcement learning training phase is obtained. This model shows improved localization accuracy and retrieval completeness in multi-category, multi-object description scenarios, making it more suitable for the practical application needs of screen visual localization.

[0097] Therefore, by obtaining diverse outputs through multi-round sampling inference, and combining the outputs of the model with the quantitative evaluation of localization overlap and semantic similarity, the optimal round for updating parameters is selected, which accurately guides the optimization of model parameters and improves the accuracy of model localization and semantic consistency.

[0098] In this embodiment, considering that non-standard format of the predicted information output by the model can lead to unparseable results, and that existing reward values ​​do not consider format factors and are difficult to constrain the standardized results of the model output, this application embodiment provides a method for adjusting the model reward value based on format differences. Based on this, the above step 160424 may specifically include: Based on the overlap of sample location information and the semantic similarity of each sample, the reward value of the candidate model is determined for each sample. The reward function is used to adjust the reward value of the candidate model based on the difference between the information format of the predicted information output by the initial large language model and the preset format, so as to obtain the model reward value for each time.

[0099] In this embodiment, the reward function comprehensively measures both positioning accuracy and retrieval completeness. Additionally, to ensure instruction following and structured analyzability, the correctness of the output format is also rewarded, as detailed below: The reward function for positioning accuracy, i.e., using the Hungarian matching algorithm, is shown in formulas (1) to (4) below, to calculate the overlap between the sample positioning information and the predicted sample positioning information. In the matching score, the overlap of sample location information (IoU) is considered. In addition, the semantic similarity of categories is calculated using a text semantic encoder. At the same time, a smooth sigmoid weight function is used in this step to strengthen the matching of high IoU and avoid the discontinuous gradient caused by hard threshold. in, This is a soft threshold for IoU, which can be set to 0.5. After calculating semantic similarity, it can be used... The smoothness is controlled by a weight that enhances the contribution of high IoU matches to the rewards.

[0100] Overall sample location information overlap semantic similarity With smoothing weights Then, through Hungarian matching, the predicted location information of the sample and the sample location information are matched one by one.

[0101] in, To establish a one-to-one matching relationship between the sample's predicted location information and the sample's location information. The model reward value for the final positioning accuracy is defined as follows: .

[0102] Based on this, the reward function for retrieval completeness measures the model's ability to cover the target. It also uses the Hungarian matching method to penalize low-matching pairs. The calculation method is similar to that of the reward function for localization accuracy. The difference is that in the final calculation, the number of successfully matched ground truths (IOU>0.5 is considered a successful match) is divided by the total number of ground truths, which mainly penalizes the problem of missed detections.

[0103] Next, the reward function for the output format is used to check whether the model's output format meets the requirements. Since the output format affects the parsing results, a poor localization performance will result if parsing fails. Therefore, the model needs to output a clear and strictly compliant target format. In this part, if there is a format matching problem, a penalty score is added. The target format is "<obj_ref> xxx< / obj_ref> <box> (523,591),(817,829)< / box> If the output format is missing< / obj_ref> Or the coordinate format is incorrect. <box> [523,591,817,829]< / box> Then, a penalty will be imposed based on the proportion of characters with incorrect format. Finally, the above three parts will be weighted and calculated. First, the reward result of positioning accuracy and the reward result of detection integrity will be calculated according to the F1 method to obtain the score, as shown in the following formula (5): Among them, the F1 score is often used to evaluate the results of object detection. Using this to calculate the reward score can make the model closer to our human evaluation index. Finally, a weighting of the format reward is added to the F1 reward score. According to the experimental results, setting the weight of the format reward to 0.2 will result in a better overall model performance. The specific formula is shown in formula (6) below: In this step, by using continued pre-training (e.g., 10 million pairs of data), supervised fine-tuning (e.g., 1.06 million instruction data), and reinforcement learning (e.g., 50,000 high-precision dialogue data), a multi-dimensional reward mechanism based on precision and recall (i.e., output format) is introduced. This allows the model to simultaneously optimize the accuracy, coverage, and parsing of object detection during reinforcement learning training. Compared to traditional methods that rely solely on supervised loss, this reward-based optimization approach more directly guides the model towards consistency with task evaluation metrics, significantly improving the model's robustness and generalization ability in open scenarios.

[0104] Therefore, the standardization of the output format is incorporated into the reward value evaluation system, which constrains the model output to conform to the preset format, ensures the parsability of the positioning information, and avoids the problem of positioning results becoming invalid due to format errors.

[0105] Based on this, involving step 110, the object recognition method provided in this application embodiment can be applied to scenarios where a target is identified and product navigation is completed. Specifically, as... Figure 4 As shown, when a user is browsing outfit pictures and text on the first interface 401 of a lifestyle sharing application, they see a white canvas bag in the picture and want to find the same bag and jump to purchase it. At this time, the user long-presses the screen with two fingers. This operation is the first input to trigger the screen visual positioning task. After receiving this input, the electronic device activates the screen visual positioning function, that is, in response to the first input of long-pressing with two fingers, it obtains the user's demand information input by voice and / or text: "Find the canvas shoes + sweatshirt + long skirt in the picture, and hide the text and icons."

[0106] In some embodiments of this application, the information display object recognition method provided in this application can be applied to scenarios where a target is identified and a product is redirected. Specifically, when an electronic device displays a first image, such as an outfit image and text, a user browsing the outfit image and text in a lifestyle sharing application sees a white canvas bag in the image and wants to find the same bag and make a purchase. At this time, the user long-presses the screen with two fingers, which is the first input. The electronic device responds to the first input and obtains a preset first prompt word. The first prompt word is used to guide the large language model to determine the coordinate information of each object in the outfit image and text. The prompt word can be set to "identify all objects in the image, ignore text areas, application icons, floating icons and other non-subject elements, and output the coordinate information of each object". Then, the outfit image and text and the first prompt word are input together into the large language model on the device side. Based on the guidance of the first prompt word, the large language model comprehensively identifies each object in the first image, and after filtering out irrelevant information such as title text, blogger avatar, and application function icons in the image, it outputs the coordinate information of all objects in the first image, such as the coordinates of the white canvas bag. <box> (326,458),(412,596)< / box> The coordinates of other clothing items, ornaments, and other objects in the image are then displayed; these coordinates constitute the first positioning information. The electronic device then directly displays the coordinate boxes corresponding to the first positioning information within the outfit image. Each coordinate box defines the area where the corresponding object in the first image is located, allowing the user to visually see all positioned objects. The coordinate box defining the white canvas bag is a first-type coordinate box; clicking on this first-type coordinate box will redirect the user to the white canvas bag's product shopping interface.

[0107] In some other embodiments of this application, in order to solve the problem that user demand information is in natural language form and cannot be directly used as model input, and that there is a lack of standardized methods for constructing prompt words, this application provides a method for converting user demand into standard prompt words. Based on this, before step 120, the input content of the first input may also include user demand information. Based on this, the first prompt word can be determined through the following steps 1501 and 1502.

[0108] Step 1501: Extract information for describing the object from the user requirement information to obtain the user description instruction corresponding to the user requirement information.

[0109] In this step, taking the actual interaction scenario of a user browsing outfit images and text on a lifestyle sharing application as the background, the user's need is "help me find the same bag." Based on the reasoning logic of a large language model, the entire process of extracting descriptive instructions from the user's needs and generating the first prompt word by filling in the prompt word template is explained, fully covering core semantics such as region masking mechanisms, Prompt control, and structured output. Based on this, when a user browses screenshots of outfit images and text in a lifestyle sharing application, a screen visual positioning task is triggered, and the user inputs the need information: "help me find the same bag." The electronic device performs semantic parsing on this natural language user need information, extracting the core information describing the object, focusing on the core demand of "same bag," filtering out auxiliary semantics such as "help me find," which lack object descriptive attributes, and finally obtaining the user description instruction: "bag," which precisely corresponds to the user's need information. It's important to note that if the user's request is more complex, such as "What brand of sunglasses is the person in this picture wearing?", the electronic device will extract the core descriptive information "the person is wearing sunglasses" as the user description instruction. If it's "Find the small red handbag", then "the small red handbag" will be extracted as the user description instruction, ensuring that the instruction only retains descriptive content related to the target object.

[0110] Step 1502: Fill the prompt word template with the user description command to obtain the first prompt word. The prompt word template is a preset prompt word template for the large language model to perform the screen visual positioning task.

[0111] In this step, the electronic device has a built-in prompt word template for the large language model to perform screen visual localization tasks. The template content is as follows: "Please perform visual understanding and target localization tasks on the image, focusing only on the main objects relevant to the user's needs {global object detection or referential detection}, ignoring the following areas: blurry or low-resolution areas in the background, text areas, floating icons, function buttons, application icons, and other non-subject elements. Based on a comprehensive understanding of visual semantics and user instructions, locate the target object most relevant to the description and output the results in the following format:"<obj_ref> Category Name< / obj_ref><box> (x1,y1),(x2,y2)< / box> ".

[0112] Based on this, the system dynamically fills the position of "{global object detection or referential detection}" in the prompt word template with the user description instruction "package" obtained in step 1501 (adapting to the user's current "find the same bag" operation scenario and semantic input), and finally generates the first prompt word, the content of which is as follows: Please perform visual understanding and target localization tasks within the image. Focus only on the main object package relevant to the user's needs, ignoring the following areas: blurry or low-resolution areas in the background, text areas, floating icons, function buttons, application icons, and other non-primary elements. Based on a comprehensive understanding of visual semantics and user instructions, locate the target object most relevant to the description and output the results in the following format:<obj_ref> Category Name< / obj_ref> <box> (x1,y1),(x2,y2)< / box> .

[0113] This enables structured parsing of users' natural language needs and standardized generation of prompt words, ensuring that the model can accurately understand users' location needs and improving the effectiveness of prompt words in constraining the model.

[0114] It should be noted that this first prompt word relies on the powerful visual reasoning and instruction following capabilities of the large language model, and can achieve region masking during the inference stage without the need for additional masking models or post-processing modules: when the model performs inference, it will perform semantic-level visual filtering based on this prompt, automatically reducing the attention weight of non-target regions such as text, icons, and blurred backgrounds in the first image, focusing only on the core target of "package"; the final output is a structured and parsable localization result (<obj_ref> Bag< / obj_ref> <box> (x1,y1),(x2,y2)< / box> This provides a unified data interface for subsequent retrieval of similar bag information, shopping redirection, and voice response, significantly improving the accuracy and semantic consistency of screen visual positioning.

[0115] If the user's request is "What brand of sunglasses is the person in this picture wearing?", the user description instruction extracted in step 1501 is "sunglasses the person is wearing", and the first prompt word generated after filling in the information in step 1502 is: "Please perform a visual understanding and target localization task on the image. Focus only on the main object related to the user's needs—the sunglasses worn by the person. Ignore the following areas: blurry or low-resolution areas in the background, text areas, floating icons, function buttons, application icons, and other non-primary elements. Based on the combined visual semantics and user instructions, locate the target object most relevant to the description and output the results in the following format:"<obj_ref> Category Name< / obj_ref> <box> (x1,y1),(x2,y2)< / box> ".

[0116] As can be seen, this first prompt word can also guide the 3B model to ignore irrelevant elements on the screen and accurately locate the sunglasses target, verifying the universality of the prompt word template in adapting to different user needs.

[0117] Involves step 130, such as Figure 5 As shown, the electronic device can display a first interface 601 including the outfit image and text on the current screen, and at the same time, display a coordinate frame at the coordinate position corresponding to the first positioning information. The coordinate frame accurately defines the area 602 where the hoodie is located, the area 603 where the bag is located, the area 604 where the long skirt is located, and the area 605 where the canvas shoes are located in the first image, while blocking text and icons.

[0118] In addition, the coordinate frame in the embodiments of this application can be at least one of the following types: a first type of coordinate frame and a second type of coordinate frame.

[0119] In some embodiments of this application, in order to meet the user's needs for depth information acquisition and application navigation of the positioning target if screen visual positioning alone cannot be achieved, this application provides an interactive navigation design for the positioning results to enrich the application value of screen visual positioning. Based on this, the first image is displayed on the first interface, and the coordinate frame includes a first type of coordinate frame. After step 130, the object recognition method may also include steps 1401 and 1402.

[0120] Step 1401: Receive the user's second input for the first type of coordinate frame.

[0121] In this step, such as Figure 6 As shown, the user can click on the first type of coordinate box 601, which means the electronic device receives the second input.

[0122] Step 1402: In response to the second input, switch from the first interface to the second interface of the first application. The second interface includes object information corresponding to the first object defined by the first type of coordinate frame.

[0123] In this step, in response to the second input, the user can switch from the current interface 601 to a second interface of a shopping application, which includes shopping information for shoes.

[0124] This enables a quick jump from the visual positioning interface to the corresponding application information interface, allowing direct access to relevant information about the target object and improving user interaction experience and operational efficiency.

[0125] In some other embodiments of this application, considering that users may have a need to ask natural language questions about the located object and obtain personalized responses after completing the target location, this application provides an interactive question-and-answer function based on the location result. Based on this, the first image is displayed on the first interface, the coordinate frame includes a second type of coordinate frame, the first interface includes an interactive area, and after step 130, the object recognition method may further include steps 1403 to 1406.

[0126] Step 1403: Receive the user's third input for the second type of coordinate frame.

[0127] In this step, such as Figure 6 As shown, the user can click on the second type of coordinate box 601, which means the electronic device receives the third input.

[0128] Step 1404: In response to the third input, candidate query information of the first object defined by the second type of coordinate frame is displayed in the interactive area of ​​the first interface.

[0129] In this step, such as Figure 7 As shown, an interactive area 606 can be displayed, in which the following can be displayed: Figure 7 The displayed prompt message encourages the user to enter a question in the interactive area. Alternatively, as shown below... Figure 8 As shown, an interactive area 606 can be displayed, in which candidate query information related to the first object can be displayed, such as "What brand is this shoe?".

[0130] Step 1405: Receive the user's fourth input to the interactive area.

[0131] In the steps, such as Figure 9 As shown, if the user clicks the interactive control 901 in the interactive area 606, the system receives the user's fourth input to the interactive area.

[0132] Step 1406: In response to the fourth input, input the query information, the first image, the first location information, and the second prompt word into the large language model to obtain the response information corresponding to the query information output by the large language model.

[0133] In this step, the query information, the first image, the first location information, and the second prompt word are input into the large language model to obtain the response information "This shoe is Nike" output by the large language model, which corresponds to the query information. The response information "This shoe is Nike" is then displayed in the interaction area 606.

[0134] The query information includes information entered by the user in the interaction area or candidate query information. The second prompt word is used to guide the large language model to determine the first object in the first image based on the first positioning information, and to determine the response information corresponding to the first object based on the query information.

[0135] This allows users to initiate targeted inquiries about the location target, and the model combines the location information with the first image to generate accurate responses, achieving integrated interaction between location and question answering, and improving the intelligence of screen content understanding.

[0136] The object recognition method provided in this application can be executed by an object recognition device. This application uses an object recognition device performing a display as an example to illustrate the apparatus of the object recognition method provided in this application.

[0137] This application also provides an object recognition device. (Specifically combined with...) Figure 10 Please provide a detailed explanation.

[0138] Figure 10 This is a schematic diagram of the structure of an object recognition device provided for some embodiments of this application.

[0139] like Figure 10 As shown, the object recognition device 100 can be applied to electronic devices, and the object recognition device 100 may specifically include: The receiving module 1001 is used to receive the user's first input when the first image is displayed; The input module 1002 is used to respond to the first input by inputting the first image and the first prompt word into the large language model to obtain the first localization information output by the large language model; the first localization information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; Display module 1003 is used to display a coordinate frame corresponding to the first positioning information in the first image.

[0140] The object recognition device 100 in the embodiments of this application will be described in detail below.

[0141] In some embodiments of this application, the receiving module 1001 may also be used to receive a second input from the user to the first type of coordinate frame when the coordinate frame includes a first type of coordinate frame and the first image is displayed on the first interface. The display module 1003 can also be used to switch the display from the first interface to a second interface of the first application in response to a second input. The second interface includes object information corresponding to the first object defined by the first type of coordinate frame.

[0142] In some embodiments of this application, the receiving module 1001 may also be used to receive a third input from the user to the second type of coordinate frame when the coordinate frame includes a second type of coordinate frame and the first image is displayed on the first interface; The display module 1003 can also be used to display candidate query information of a first object defined by a second type of coordinate frame in the interactive area of ​​the first interface in response to a third input. The receiving module 1001 can also be used to receive a fourth input from the user to the interactive area; The input module 1002 can also be used to, in response to the fourth input, input the query information, the first image, the first positioning information and the second prompt word into the large language model, and obtain the response information output by the large language model corresponding to the query information; The query information includes information entered by the user in the interaction area or candidate query information. The second prompt word is used to guide the large language model to determine the first object in the first image based on the first positioning information, and to determine the response information corresponding to the first object based on the query information.

[0143] In some embodiments of this application, the object recognition device 100 may further include an acquisition module for acquiring sample images of a screen visual positioning task; the screen visual positioning task corresponds to N positioning scenarios, and different positioning scenarios correspond to different annotation requirement information. The annotation requirement information is used to constrain the corresponding quantitative relationship between the sample description instruction and the sample object in the sample image. The sample description instruction is an instruction used to describe the sample object, and N is an integer greater than or equal to 2. The object recognition device 100 may further include a determination module, which is used to determine sample data for each positioning scenario based on the sample image and the annotation requirement information corresponding to each positioning scenario. The sample data is used to adjust the initial large language model, and the adjusted initial large language model is used to recognize the object in the first image based on the first prompt word. The sample data for each positioning scenario includes a first sample image in the sample images and annotation information corresponding to the positioning scenario. The annotation information includes sample description instructions and sample positioning information. The sample positioning information is used to characterize the position of the sample object in the first sample image. The sample object includes at least the interference object in the screen scene and the reference positioning object.

[0144] In some embodiments of this application, the N positioning scenarios include single-object description scenarios, multi-object description scenarios of the same category, multi-object description scenarios of multiple categories, and no-object description scenarios; wherein, The single-object description scenario corresponds to the first annotation requirement information, which is constrained to a sample object in a sample image corresponding to a sample description instruction. The scenario of describing multiple objects of the same category corresponds to the second annotation requirement information. The second annotation requirement information is constrained to be that one sample description instruction corresponds to at least two sample objects of the same category in the sample image. The multi-category, multi-object description scenario corresponds to the third annotation requirement information. The third annotation requirement information is constrained to at least two sample objects of different categories in the sample image corresponding to at least two sample description instructions. The scenario without object description corresponds to the fourth annotation requirement information, which is constrained by the requirement that at least one sample description instruction does not have a corresponding sample object in the sample image.

[0145] In some embodiments of this application, the object recognition device 100 may further include an adjustment module, which is used to adjust the number of sample data for each positioning scenario according to a preset data matching information to obtain training sample data for N positioning scenarios. The data matching information is used to control the proportion of sample data for each positioning scenario in the total sample data. By using training sample data, the task parameters corresponding to the screen visual localization task in the initial large language model are adjusted to obtain the adjusted large language model.

[0146] In some embodiments of this application, the object recognition device 100 may further include a cropping module, which is used to crop a sample sub-image located by the first sample positioning information from a first sample image according to the first sample positioning information in the first sample positioning information when N positioning scenarios include a single object description scenario and the single object description scenario corresponds to the first annotation requirement information. The input module 1002 is used to input the sample sub-image and the third prompt word into the multimodal language model to obtain the sample expansion description instruction output by the multimodal language model. The third prompt word is used to instruct the multimodal language model to generate the sample expansion description instruction based on the features of the sample objects in the sample sub-image. The sample expansion description instruction has more feature words for the sample objects than the first sample description instruction in the first annotation requirement information. The object recognition device 100 may also include a replacement module, used to replace the first sample description instruction in the first annotation requirement information with a sample expansion description instruction to obtain the second annotation requirement information; The object recognition device 100 may further include a determination module for determining sample data of a single object description scene based on the second annotation requirement information and the first sample image in the sample image.

[0147] In some embodiments of this application, the object recognition device 100 may further include a determination module, which is used to determine the sample user demand information corresponding to each sample description instruction based on the training sample data, wherein the initial large language model is trained based on image-text pair sample data, the image-text pair sample data includes sample images and descriptive text corresponding to the sample images, and the initial large language model is used to perform cross-modal object alignment of images and text. The object recognition device 100 may also include a training module for training an initial large language model based on sample user demand information and training sample data to obtain a large language model.

[0148] The object recognition device in this application embodiment can be an electronic device or a component within an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices besides a terminal. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, PDA, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM, or self-service machine, etc. This application embodiment does not specifically limit the device.

[0149] The object recognition device in this application embodiment can be a device with an operating system. This operating system can be Android, iOS, or other possible operating systems; this application embodiment does not specifically limit the specific operating system used.

[0150] The object recognition device provided in this application embodiment can achieve... Figures 1 to 9 The various processes implemented in the object recognition method embodiments shown achieve the same technical effect, and will not be described again here to avoid repetition.

[0151] Based on this, the object recognition device provided in this application, when displaying a first image, receives a first input from the user; in response to the first input, it inputs the first image and a first prompt word into a large language model to obtain first positioning information output by the large language model; the first positioning information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; a coordinate box corresponding to the first positioning information is displayed in the first image. In this way, by leveraging the constraint effect of the first prompt word, the large language model is guided to comprehensively recognize each object in the first image, clarifying that the model's positioning task is to obtain the coordinate information of all objects, effectively distinguishing each object in the first image from irrelevant interfering elements such as icons, text, advertisements, and floating buttons, thus preventing the large language model from misjudging interfering elements as objects from the source of the positioning logic. Furthermore, by using a large language model to accurately determine the coordinate information of each object in the first image based on the first prompt word, it can overcome the occlusion effect of interfering elements and accurately identify the actual position of each occluded object in the first image. This solves the problem of objects being unrecognizable due to occlusion by interfering elements. In addition, it displays the coordinate boxes corresponding to the coordinate information of each object in the first image, making the localization results of the large language model concrete, greatly improving the localization accuracy of each object in the first image, and achieving accurate recognition and comprehensive localization of all objects in the first image, thus meeting the user's needs for accurate and comprehensive localization of multiple objects in the image.

[0152] Optional, such as Figure 11 As shown, this application embodiment also provides an electronic device 110, including a processor 1101 and a memory 1102. The memory 1102 stores a program or instructions that can run on the processor 1101. When the program or instructions are executed by the processor 1101, they implement the various steps of the above-described object recognition method embodiment and can achieve the same technical effect. To avoid repetition, they will not be described again here.

[0153] It should be noted that the electronic devices in the embodiments of this application include the aforementioned mobile electronic devices and non-mobile electronic devices.

[0154] Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided for some embodiments of this application.

[0155] The electronic device 1200 includes, but is not limited to, the following components: radio frequency unit 1201, network module 1202, audio output unit 1203, input unit 1204, sensor 1205, display unit 1206, user input unit 1207, interface unit 1208, memory 1209, processor 1210, etc.

[0156] Those skilled in the art will understand that the electronic device 1200 may also include a power supply (such as a battery) for supplying power to various components. The power supply may be logically connected to the processor 1210 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system. Figure 12 The electronic device structure shown does not constitute a limitation on the electronic device. The electronic device may include more or fewer components than shown, or combine certain components, or have different component arrangements, which will not be elaborated here.

[0157] In this embodiment, the user input unit 1207 is used to receive first input from the user when the first image is displayed. The processor 1210 is used to, in response to the first input, input the first image and a first prompt word into a large language model to obtain first location information output by the large language model; the first location information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image. The display unit 1206 is used to display a coordinate frame corresponding to the first location information in the first image.

[0158] Therefore, by leveraging the constraint of the first cue word, the large language model is guided to comprehensively identify all objects in the first image. The model's localization task is clearly defined as acquiring the coordinate information of all objects, effectively distinguishing objects in the first image from irrelevant interfering elements such as icons, text, advertisements, and floating buttons. This prevents the large language model from misclassifying interfering elements as objects from the source of the localization logic. Furthermore, by accurately determining the coordinate information of each object in the first image based on the first cue word, the large language model can overcome the occlusion effect of interfering elements, accurately identifying the actual position of occluded objects in the first image. This solves the problem of objects being unrecognizable due to occlusion by interfering elements. Moreover, a coordinate bounding box corresponding to the coordinate information of each object is displayed in the first image, visualizing the localization results of the large language model and significantly improving the localization accuracy of each object in the first image. This achieves accurate identification and comprehensive localization of all objects in the first image, meeting users' needs for accurate and comprehensive localization of multiple objects in images.

[0159] The electronic device 1200 will be described in detail below.

[0160] In some embodiments of this application, the user input unit 1207 can also be used to receive a second input from the user to the first type of coordinate frame when the coordinate frame includes a first type of coordinate frame and the first image is displayed on the first interface. The display unit 1206 can also be used to switch from the first interface to a second interface of the first application in response to a second input. The second interface includes object information corresponding to the first object defined by the first type of coordinate frame.

[0161] In some embodiments of this application, the user input unit 1207 can also be used to receive a third input from the user to the second type of coordinate frame when the coordinate frame includes a second type of coordinate frame and the first image is displayed on the first interface. The display unit 1206 can also be used to display candidate query information of a first object defined by a second type of coordinate frame in the interactive area of ​​the first interface in response to a third input. The user input unit 1207 can also be used to receive a fourth input from the user to the interactive area; The processor 1210 can also be used to, in response to a fourth input, input query information, a first image, a first location information, and a second prompt word into a large language model to obtain response information output by the large language model corresponding to the query information; The query information includes information entered by the user in the interaction area or candidate query information. The second prompt word is used to guide the large language model to determine the first object in the first image based on the first positioning information, and to determine the response information corresponding to the first object based on the query information.

[0162] In some embodiments of this application, processor 1210 is configured to extract information for describing an object from user demand information when the user demand information is information input by the user, and obtain user description instructions corresponding to the user demand information. The user description command is filled into the prompt word template to obtain the first prompt word. The prompt word template is a preset prompt word template for the large language model to perform screen visual positioning tasks.

[0163] In some embodiments of this application, processor 1210 is used to acquire sample images of screen visual positioning tasks; screen visual positioning tasks correspond to N positioning scenarios, different positioning scenarios correspond to different annotation requirements, and annotation requirements are used to constrain the corresponding quantitative relationship between sample description instructions and sample objects in sample images. Sample description instructions are instructions used to describe sample objects, and N is an integer greater than or equal to 2. Based on the sample images and the annotation requirements for each positioning scenario, the sample data for each positioning scenario is determined. The sample data is used to adjust the initial large language model, and the adjusted initial large language model is used to identify objects in the first image based on the first prompt word. The sample data for each positioning scenario includes a first sample image in the sample images and annotation information corresponding to the positioning scenario. The annotation information includes sample description instructions and sample positioning information. The sample positioning information is used to characterize the position of the sample object in the first sample image. The sample object includes at least the interference object in the screen scene and the reference positioning object.

[0164] In some embodiments of this application, the N positioning scenarios include single-object description scenarios, multi-object description scenarios of the same category, multi-object description scenarios of multiple categories, and no-object description scenarios; wherein, The single-object description scenario corresponds to the first annotation requirement information, which is constrained to a sample object in a sample image corresponding to a sample description instruction. The scenario of describing multiple objects of the same category corresponds to the second annotation requirement information. The second annotation requirement information is constrained to be that one sample description instruction corresponds to at least two sample objects of the same category in the sample image. The multi-category, multi-object description scenario corresponds to the third annotation requirement information. The third annotation requirement information is constrained to at least two sample objects of different categories in the sample image corresponding to at least two sample description instructions. The scenario without object description corresponds to the fourth annotation requirement information, which is constrained by the requirement that at least one sample description instruction does not have a corresponding sample object in the sample image.

[0165] In some embodiments of this application, the object recognition device 100 may further include an adjustment module, which is used to adjust the number of sample data for each positioning scenario according to a preset data matching information to obtain training sample data for N positioning scenarios. The data matching information is used to control the proportion of sample data for each positioning scenario in the total sample data. By using training sample data, the task parameters corresponding to the screen visual localization task in the initial large language model are adjusted to obtain the adjusted large language model.

[0166] In some embodiments of this application, the processor 1210 is configured to, in N positioning scenarios including a single object description scenario, and the single object description scenario corresponding to a first annotation requirement information, extract a sample sub-image located by the first sample positioning information from a first sample image based on the first sample positioning information in the first annotation requirement information. The sample sub-image and the third prompt word are input into the multimodal language model to obtain the sample expansion description instruction output by the multimodal language model. The third prompt word is used to instruct the multimodal language model to generate the sample expansion description instruction based on the features of the sample objects in the sample sub-image. The sample expansion description instruction has more feature words for the sample objects than the first sample description instruction in the first annotation requirement information. Replace the first sample description instruction in the first annotation requirement information with the sample expansion description instruction to obtain the second annotation requirement information; Based on the second annotation requirement information and the first sample image in the sample images, the sample data for the single object description scene are determined.

[0167] In some embodiments of this application, the processor 1210 is used to determine the sample user demand information corresponding to each sample description instruction based on the training sample data, wherein the initial large language model is trained based on image-text pair sample data, the image-text pair sample data includes sample images and descriptive text corresponding to the sample images, and the initial large language model is used to perform cross-modal object alignment of images and text. Based on the sample user demand information and training sample data of N positioning scenarios, the initial large language model is trained to obtain the large language model.

[0168] It should be understood that the input unit 1204 may include a graphics processing unit (GPU) 12041 and a microphone 12042. The GPU 12041 processes image information of still images or videos acquired by an image capture device (such as a camera) in video capture mode or image capture mode. The display unit 1206 may include a display panel, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like. The user input unit 1207 includes at least one of a touch panel 12071 and other input devices 12072. The touch panel 12071 is also called a touch screen. The touch panel 12071 may include a touch detection device and a touch display. Other input devices 12072 may include, but are not limited to, a physical keyboard, function keys (such as volume display buttons, power buttons, etc.), a trackball, a mouse, and a joystick, which will not be described in detail here.

[0169] The memory 1209 can be used to store software programs and various information. The memory 1209 may primarily include a first storage area for storing programs or instructions and a second storage area for storing information. The first storage area may store the operating system, application programs or instructions required for at least one function (such as sound playback, image playback, etc.). Furthermore, the memory 1209 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DRRAM). The memory 1209 in this embodiment includes, but is not limited to, these and any other suitable types of memory.

[0170] Processor 1210 may include one or more processing units; optionally, processor 1210 integrates an application processor and a modem processor, wherein the application processor mainly handles operations involving the operating system, user interface, and applications, and the modem processor mainly handles wireless display signals, such as a baseband processor. It is understood that the aforementioned modem processor may also not be integrated into processor 1210.

[0171] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above-described object recognition method embodiments and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0172] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0173] In addition, this application embodiment provides another chip, which includes a processor and a display interface. The display interface and the processor are coupled. The processor is used to run programs or instructions to implement the various processes of the above-described object recognition method embodiments and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0174] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.

[0175] This application provides a computer program product, which is stored in a storage medium and executed by at least one processor to implement the various processes of the object recognition method embodiment described above, and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0176] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0177] Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

[0178] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a computer software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods of the various embodiments of this application.

[0179] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A method of object recognition, characterized by, include: While displaying the first image, receive the user's first input; In response to the first input, the first image and the first prompt word are input into the large language model to obtain the first localization information output by the large language model; the first localization information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; The first image displays a coordinate frame corresponding to the first positioning information.

2. The method of claim 1, wherein, The coordinate frame includes a first type of coordinate frame; the first image is displayed on the first interface; the method further includes: Receive the user's second input for the first type of coordinate frame; In response to the second input, the display switches from the first interface to a second interface of the first application, the second interface including object information corresponding to the first object defined by the first type of coordinate frame.

3. The method according to claim 1 or 2, characterized in that, The coordinate frame includes a second type of coordinate frame; the first image is displayed on the first interface; the method further includes: Receive the user's third input for the second type of coordinate frame; In response to the third input, candidate query information of the first object defined by the second type of coordinate frame is displayed in the interactive area of ​​the first interface; Receive a fourth input from the user into the interactive area; In response to the fourth input, the query information, the first image, the first location information, and the second prompt word are input into the large language model to obtain the response information output by the large language model corresponding to the query information; The query information includes information input by the user in the interaction area or the candidate query information. The second prompt word is used to guide the large language model to determine the first object in the first image based on the first positioning information, and to determine the response information corresponding to the first object based on the query information.

4. The method of claim 1, wherein, Before receiving the user's first input, the method further includes: Obtain sample images for screen visual positioning tasks; the screen visual positioning tasks correspond to N positioning scenarios, and different positioning scenarios correspond to different annotation requirements. The annotation requirements are used to constrain the corresponding quantitative relationship between sample description instructions and sample objects in the sample images. The sample description instructions are instructions used to describe sample objects, and N is an integer greater than or equal to 2. Based on the sample images and the annotation requirements for each positioning scenario, sample data for each positioning scenario is determined. The sample data is used to adjust the initial large language model. The adjusted initial large language model is used to identify objects in the first image based on the first prompt word. The sample data for each positioning scenario includes a first sample image in the sample image and annotation information corresponding to the positioning scenario. The annotation information includes sample description instructions and sample positioning information. The sample positioning information is used to characterize the position of the sample object in the first sample image. The sample object includes at least interference objects in the screen scene and reference positioning objects.

5. The method of claim 4, wherein, The N positioning scenarios include single-object description scenarios, multi-object description scenarios of the same category, multi-object description scenarios of multiple categories, and no-object description scenarios; among them... The single object description scenario corresponds to the first annotation requirement information, and the first annotation requirement information constrains a sample description instruction to correspond to a sample object in the sample image; The description scenario of multiple objects of the same category corresponds to the second annotation requirement information, and the second annotation requirement information constrains a sample description instruction to correspond to at least two sample objects of the same category in the sample image; The multi-category, multi-object description scenario corresponds to the third annotation requirement information, and the third annotation requirement information constrains at least two sample description instructions to correspond to at least two sample objects of different categories in the sample image; The no-object description scenario corresponds to the fourth annotation requirement information, which constrains at least one sample description instruction to not have a corresponding sample object in the sample image.

6. The method of claim 4, wherein, The method further includes: According to the preset data matching information, the number of sample data for each positioning scenario is adjusted to obtain the training sample data for the N positioning scenarios. The data matching information is used to control the proportion of sample data for each positioning scenario in the total sample data. Using the training sample data, the task parameters corresponding to the screen visual localization task in the initial large language model are adjusted to obtain the adjusted large language model.

7. The method of claim 4, wherein, The N positioning scenarios include a single object description scenario, and the single object description scenario corresponds to the first annotation requirement information; The step of determining sample data for each positioning scenario based on the sample images and the annotation requirements information corresponding to each positioning scenario includes: Based on the first sample location information in the first annotation requirement information, extract the sample sub-image located by the first sample location information from the first sample image; The sample sub-image and the third prompt word are input into the multimodal language model to obtain the sample expansion description instruction output by the multimodal language model. The third prompt word is used to instruct the multimodal language model to generate the sample expansion description instruction based on the features of the sample object in the sample sub-image. The sample expansion description instruction has more feature words for the sample object than the first sample description instruction in the first annotation requirement information. Replace the first sample description instruction in the first annotation requirement information with the sample expansion description instruction to obtain the second annotation requirement information; Based on the second annotation requirement information and the first sample image in the sample images, the sample data of the single object description scene is determined.

8. The method of claim 6, wherein, The initial large language model is trained based on image-text pair sample data, which includes sample images and corresponding descriptive text. The initial large language model is used to perform cross-modal object alignment between images and text. The step of adjusting the task parameters corresponding to the screen visual localization task in the initial large language model using the training sample data to obtain the adjusted large language model includes: Based on the training sample data, determine the sample user demand information corresponding to each sample description instruction; Based on the sample user demand information and the training sample data, the initial large language model is trained to obtain the large language model.

9. An object recognition apparatus characterized by comprising: include: A receiving module is used to receive the user's first input when the first image is displayed; The processing module is configured to respond to the first input by inputting the first image and the first prompt word into the large language model to obtain the first localization information output by the large language model; the first localization information is the coordinate information of each object in the first image, and the first prompt word is used to guide the large language model to determine the coordinate information of each object in the first image; The display module is used to display a coordinate frame corresponding to the first positioning information in the first image.

10. An electronic device, comprising: include: A processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the object recognition method as described in any one of claims 1-8.