Interaction method and apparatus, electronic device, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By displaying captured images on electronic devices and receiving voice interaction information, and using neural network models for recognition and response, the problem of unintelligent human-computer interaction in traditional technologies has been solved, achieving a more intelligent and richer interaction method.

CN122308664APending Publication Date: 2026-06-30GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
Filing Date: 2024-12-31
Publication Date: 2026-06-30

Application Information

Patent Timeline

31 Dec 2024

Application

30 Jun 2026

Publication

CN122308664A

IPC: G06F3/0481; G06F3/04842; G06V20/60; G06F40/289; G06V10/82; G10L15/26; G10L15/22

AI Tagging

Technology Topics

Computer graphics (images)Computer vision

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In traditional technologies, electronic devices cannot obtain user input commands after recognizing objects in an image, resulting in insufficient intelligence in human-computer interaction.

Method used

By displaying the camera feed from an electronic device, receiving voice interaction information, and highlighting relevant information when a response is detected in the feed, a neural network model is used for recognition and response.

Benefits of technology

It enables human-computer interaction based on real-time display of captured images, enhancing the richness and intelligence of interactive information. Users can directly obtain information from images through voice interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122308664A_ABST

Patent Text Reader

Abstract

This application relates to an interaction method, apparatus, electronic device, and storage medium. The method includes: displaying a captured image from an electronic device; receiving voice interaction information, the voice interaction information being information addressed to the captured image; and, upon determining that the captured image contains image information capable of responding to the voice interaction information, highlighting the image information in the captured image. This method can improve the intelligence of human-computer interaction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of Internet technology, and in particular to an interaction method, apparatus, electronic device and storage medium. Background Technology

[0002] With the development of artificial intelligence technology, more and more electronic devices can identify and analyze objects in input images and output information about related items in the images to help users select desired items.

[0003] However, traditional technologies still suffer from insufficient intelligence in human-computer interaction. Summary of the Invention

[0004] This application provides an interaction method, device, electronic device, and storage medium that can improve the intelligence of human-computer interaction.

[0005] In a first aspect, embodiments of this application provide an interaction method, the method comprising:

[0006] Displays the camera feed from the electronic device;

[0007] Receive voice interaction information, wherein the voice interaction information is information presented in response to the captured image;

[0008] If it is determined that there is image information in the captured image that can be used to respond to the voice interaction information, the image information is highlighted in the captured image.

[0009] Secondly, embodiments of this application provide an interactive device, the device comprising:

[0010] The first display module is used to display the captured images from the electronic device;

[0011] A receiving module is used to receive voice interaction information, which is information presented in response to the captured image.

[0012] The second display module is used to highlight the image information in the captured image when it is determined that there is image information in the captured image that can be used to respond to the voice interaction information.

[0013] Thirdly, embodiments of this application provide an electronic device, including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the steps of the interaction method as described in the first aspect.

[0014] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in the first aspect.

[0015] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in the first aspect.

[0016] The aforementioned interactive methods, devices, electronic devices, and storage media, by displaying the captured image from the electronic device, allow users to intuitively view the captured image. The real-time display of the captured image compensates for the limitations of text descriptions, enriching the interactive information. Furthermore, users can provide voice interaction information based on the captured image. During the real-time display of the captured image, different voice interaction information can be determined based on different captured images. Thus, when it is determined that there is image information in the captured image that can be used to respond to the voice interaction information, this image information is highlighted in the captured image, enabling human-computer interaction to unfold based on the real-time displayed captured image. This real-time display of the captured image makes the human-computer interaction between the electronic device and the user more intelligent. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a diagram illustrating the application environment of the interaction method in one embodiment;

[0019] Figure 2 Here is a flowchart of an interaction method in one embodiment;

[0020] Figure 3 This is a schematic diagram of the display interface in one embodiment;

[0021] Figure 4 This is a schematic diagram of the display interface in another embodiment;

[0022] Figure 5 This is a schematic diagram of the display interface in another embodiment;

[0023] Figure 6 This is a schematic diagram of the display interface in another embodiment;

[0024] Figure 7This is a schematic diagram of the display interface in another embodiment;

[0025] Figure 8 This is a schematic diagram of the display interface in another embodiment;

[0026] Figure 9 This is a schematic diagram of the display interface in another embodiment;

[0027] Figure 10 This is a schematic diagram of the display interface in another embodiment;

[0028] Figure 11 This is a schematic diagram of the display interface in another embodiment;

[0029] Figure 12 A flowchart of the interaction method in another embodiment;

[0030] Figure 13 This is a schematic diagram of the display interface in another embodiment;

[0031] Figure 14 This is a structural block diagram of the interactive device in one embodiment;

[0032] Figure 15 This is a schematic diagram of the internal structure of a computer device in one embodiment. Detailed Implementation

[0033] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0034] In traditional technologies, electronic devices can identify and analyze objects in input static images and output information about the relevant objects to the user. For example, if an electronic device is pointed at a static image that may contain an image of a certain type of flower, the device can identify the information about the flower in the static image. However, traditional technologies can only perform recognition based on a single image and rely on their own recognition logic to identify the input image. By broadly identifying the image information contained in the image and outputting the recognition results for the user to select the desired item, they cannot obtain user input commands and accurately identify the image based on the input commands. This results in traditional technologies having a problem with insufficient intelligence in human-computer interaction.

[0035] The interaction method based on the captured image provided in this application embodiment can be applied to, for example... Figure 1In the application environment shown, electronic device 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart TVs, smart in-vehicle devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc.

[0036] In one embodiment, such as Figure 2 As shown, an interaction method is provided, which is applied to Figure 1 Taking an electronic device as an example, the explanation includes the following steps:

[0037] S201 displays the image captured by the electronic device.

[0038] Optionally, in this embodiment, the electronic device can be triggered by an application, causing it to display the captured image in response to a trigger operation on the application. The application in this embodiment can be a smart assistant application within the electronic device, capable of natural language recognition and reasoning interaction; for example, it could be an XX Assistant. Optionally, the application triggering method can include at least one of voice triggering, gesture triggering, and biometric information triggering. Optionally, the application triggering operation in this embodiment can be a wake-up operation or an application launch operation, etc. Optionally, in response to the application triggering operation, a control command can be sent to the camera of the electronic device, instructing the camera to capture real-time footage, acquire the captured image from the image sensor, and then display the captured image on the display interface of the electronic device.

[0039] Optionally, in this embodiment, in response to a triggering operation for the application, it can be... Figure 3 The full-screen display shown displays the captured image on the electronic device's screen in full screen on the device's display interface. Alternatively, it can also be displayed in... Figure 4 The half-screen display shown displays the camera's captured image on a portion of the electronic device's screen. Optionally, the captured image displayed in real-time on the screen can be the original image captured by the electronic device's camera, or it can be the original image captured by the camera with a filter added.

[0040] As an example, the captured images of the electronic device in this embodiment can be real-life scenes, such as real-time captured images of menus, supermarkets, landscapes, bakeries, etc. This embodiment does not limit the type of captured images.

[0041] Optionally, in this embodiment, while displaying the camera's captured image on the display interface, the electronic device can also use an AI model to identify the items in the captured image, generate descriptive information about the items in the captured image, and display the generated descriptive information on the display interface as well. For example, the descriptive information can be displayed floating on the captured image, or it can be displayed on the captured image in the form of bullet comments. Furthermore, the generated descriptive information can also include recommendation information about the captured image, which can be used to recommend items in the captured image and items related to the captured image. For example, if the captured image is a display of bread in a bakery, the generated descriptive information can include the name and calorie information of each type of bread in the captured image. As another example, if the captured image is a scenic spot, the generated descriptive information can include not only introduction and recommendation information for the scenic spot, but also introduction and recommendation information for other scenic spots in the same location.

[0042] S202 receives voice interaction information, which is information presented in response to the captured image.

[0043] The voice interaction information refers to the information provided by the user in response to the captured image. For example, if the captured image is a menu, the user's voice interaction information could be, "Help me see which dish on the menu is a local specialty," or "Help me find the drink with the lowest alcohol content here." Similarly, if the captured image is a scenic spot, the user's voice interaction information could be, "Help me identify where this scenery is." And if the captured image is of a bakery, the user's voice interaction information could be, "Help me find the bread with the lowest sugar content here," and so on. It should be noted that in this embodiment, when the electronic device receives a user's trigger operation on the application, it can send a control command to the microphone, instructing the microphone to pick up sound. This allows the user to obtain the aforementioned voice interaction information through the microphone. Consequently, the user does not need to trigger other controls when inputting voice interaction information; they can directly input the voice interaction information, making the operation more concise.

[0044] S203: If it is determined that there is image information in the captured image that can be used to respond to voice interaction information, the image information is highlighted in the captured image.

[0045] Optionally, in this embodiment, both the voice interaction information and the displayed captured image can be input into a neural network model. The neural network model identifies the voice interaction information and the captured image to determine whether there is image information in the captured image that can be used to respond to the voice interaction information. For example, the neural network model can perform word segmentation on the voice interaction information, then extract keywords from the segmented voice interaction information, and use the neural network model to identify items in the captured image. Based on the keyword information and the item information in the captured image, the neural network model determines the user's intent and judges whether there is image information in the captured image that can be used to respond to the voice interaction information. Further, if it is determined that there is image information in the captured image that can be used to respond to the voice interaction information, the image information used to respond to the voice interaction information can be highlighted in the captured image. For example, this can be achieved by... Figure 5 As shown, the screen information used to respond to voice interaction information is circled to highlight it. Alternatively, the font color and size of the screen information used to respond to voice interaction information can be adjusted, and the font color and size of the item information corresponding to the feedback information can be changed to highlight the screen information used to respond to voice interaction information.

[0046] Optionally, in this embodiment, the response information for voice interaction can also be described in the form of text description. For example, if the real-time displayed captured image is... Figure 5 The menu shown has the following voice interaction information: Figure 6 When the message "Help me find the drink with the lowest alcohol content" is displayed, the text description of the response message in the voice interaction can be as described above. Figure 5The menu shown in the image does not directly indicate the alcohol content of each cocktail. However, generally speaking, a mojito… Optionally, in this embodiment, the text description of the voice interaction response can be displayed in a floating manner. Furthermore, the transparency of the floating text description of the voice interaction response can be increased, and the background text of the text description can be blurred. Additionally, when displaying the text description of the voice interaction response, the already displayed text description can be overwritten with new text descriptions at preset time intervals. It is understood that using an overlay display method can reduce the screen space occupied by the text description, allowing more screen space to be used to display the captured image. As an optional implementation, while displaying the voice interaction response in text form on the captured image, the response can also be read aloud via voice.

[0047] In this embodiment, as an optional implementation scenario, when a user makes a voice interaction request regarding the captured image while operating the electronic device, if there is image information in the current image that can be used to respond to the voice interaction request, then the image information used to respond to the voice interaction request can be highlighted in the currently displayed captured image.

[0048] In this embodiment, as another optional implementation scenario, after a user takes a picture of a scene using an electronic device, the electronic device can use this picture as short-term memory data. If the user then makes a voice interaction request regarding the picture, and after analyzing the picture, it is determined that there is no picture in the current picture that can be used to respond to the voice interaction request, but there is picture in a previous picture that can be used to respond to the voice interaction request, then the electronic device can generate location guidance information based on the location information of the item requested in the voice interaction request in the pictures within the historical time period, and display the location guidance information on the display interface. The location guidance information guides the user to rotate the camera so that the electronic device can re-capture the previously captured picture. If the item requested in the voice interaction request appears in the picture, the picture information that responds to the voice interaction request is highlighted in the picture.

[0049] Optionally, in this embodiment, the screen information used to respond to voice interaction information may include information about the item queried in the voice interaction information and / or location guidance information of the item. The location guidance information is generated based on the location information of the item in the captured images within a historical time period, assuming the item queried in the voice interaction information appears in such images. As an example, if the electronic device is shooting at a bakery, the real-time displayed image may be... Figure 7 The image shows a bakery display. When a user takes a picture of the bread display using an electronic device, and the device receives a voice interaction request from the user asking, "Find the bread with the lowest sugar content here," the device analyzes the voice interaction and the captured image. It determines that there is no image in the current captured image that can be used to respond to the voice interaction. However, there is an image of a "salted bread" in the image taken before the voice interaction was received. The device then generates location guidance information for this "salted bread" based on its previous location in the captured image. For example, the generated location guidance information could be "Look to the right." The image information used to respond to the voice interaction would then be... Figure 8 The screen shown can display the type of item queried in the voice interaction, "salt bread," and location guidance information such as "Look to the right." Furthermore, the user can operate the electronic device based on the displayed information. For example, by rotating the electronic device according to the item's location guidance information, the camera can be turned to the right. After the real-time displayed screen changes, the electronic device can also generate further instructions based on the real-time display to guide the user to the location of the queried item. For example, the generated further instructions could be... Figure 9 The phrase "Yes, it's right here" is shown in the image. When the electronic device rewinds the captured footage based on the location guidance information of an item, if the item requested in the voice interaction is displayed in the current captured footage, the requested bread can be highlighted or darkened to emphasize the visual information responding to the voice interaction. In essence, in this scenario, the application can view a segment of real-time footage captured by the electronic device while following the voice interaction, storing this information as short-term memory. If the short-term memory contains the item requested in the voice interaction, the location guidance information for that item will be displayed on the screen, guiding the user to operate the electronic device and helping them find the requested item.

[0050] In the above-mentioned interaction method, by displaying the camera's image, users can intuitively view the captured image. The real-time display of the captured image compensates for the limitations of text description, making the interactive information richer. In addition, users can make voice interaction requests based on the captured image. During the real-time display of the captured image, different voice interaction requests can be determined based on different captured images. Thus, when it is determined that there is screen information in the captured image that allows the user to respond to the voice interaction request, the screen information that responds to the voice interaction request is highlighted in the captured image. This allows human-computer interaction to unfold based on the real-time display of the captured image, making the human-computer interaction between electronic devices and users more intelligent.

[0051] In some scenarios, the visual information used to respond to voice interaction information may include multiple visual information pieces. Users can trigger a selection operation based on these multiple visual information pieces to choose the visual information they are interested in. In one embodiment, the above method further includes:

[0052] Step A: In response to the selection operation of target image information among multiple image information, display the detailed information of the target image information in the captured image.

[0053] In some scenarios, the information about an item queried in a voice interaction may include multiple visual images. Therefore, the visual information used to respond to the voice interaction may also include multiple visual images. Continuing with the example of a real-time captured image... Figure 5 The menu shown has the following voice interaction information: Figure 6 For example, in the example shown, "Help me find the drink with the lowest alcohol content here," if... Figure 5 The menu shown includes items that support voice interaction, such as the classic mojito and the manhattan cocktail. Therefore, the visual information used to respond to voice interactions can include visuals of both the classic mojito and manhattan cocktails. Users can select a desired visual from the displayed options based on their preferences. For example, if a user prefers the classic mojito, they can select its visual. This selection can be triggered via voice, gestures, or other methods.

[0054] Optionally, in this embodiment, the detailed information of the target image displayed in the captured image may include detailed descriptions of the target image. For example, when the user selects a classic mojito cocktail image as the target image, the detailed information of the target image may include... Figure 10The image shows a detailed description of the classic Mojito cocktail. As an optional implementation, in this embodiment, the detailed description of the classic Mojito cocktail can be presented as... Figure 10 The card shown is displayed in the shooting screen.

[0055] Optionally, in this embodiment, after the user triggers a selection operation on a target image from multiple image options, guidance information can be displayed on the shooting screen to guide the user in confirming their intent regarding the selected target image. As an optional implementation, the guidance information can be displayed floating on the shooting screen. For example, continuing with the user-triggered selection operation as choosing the classic mojito cocktail image from the menu, the guidance information displayed on the shooting screen could be... Figure 10 The message "You've selected a classic mojito, what would you like to do?" indicates that the user's intent could be "I want to buy this classic mojito," or "I don't want to buy this classic mojito or other similar cocktails." Furthermore, after determining the user's intent, the electronic device can display more specific information, such as ingredients that cannot be used with a classic mojito, or the optimal serving temperature for a classic mojito.

[0056] In this embodiment, the items queried in the voice interaction information triggered by the user may be multiple, and the screen information used to respond to the voice interaction information may include multiple screen information. In this scenario, the user can trigger a selection operation on the target screen information among the multiple screen information based on the displayed screen information. This allows the electronic device to respond to the selection operation on the target screen information and display the details of the target screen information in the captured screen. Through human-computer interaction with the user, the details of the target screen information can be displayed more clearly and in detail, allowing the user to obtain more detailed information. At the same time, when the screen information used to respond to the voice interaction information includes multiple screen information, this selection operation can improve the intelligence of human-computer interaction, making it more intelligent.

[0057] In some scenarios, the display interface may include a termination control. When a user needs to terminate the analysis of voice interaction information presented in response to the captured image, they can trigger the termination control to stop the analysis of the currently triggered voice interaction information. In one embodiment, the above method further includes:

[0058] Step B: In response to the triggering operation of the termination control, stop responding to voice interaction information.

[0059] For example, the termination control included in the display interface can be Figure 11 The "X"-shaped control on the left side of the interface shown. In some scenarios, if the user has already obtained the information they expect and no longer needs to obtain information related to the real-time displayed shooting screen through human-computer interaction, the user can trigger the termination control. This allows the electronic device to respond to the termination control trigger operation and stop responding to voice interaction information, i.e., no longer highlighting the screen information used to respond to voice interaction information in the shooting screen. Optionally, in this embodiment, the scenario for triggering the termination control can be a scenario where the user no longer needs to obtain information related to the real-time displayed shooting screen through human-computer interaction, or a scenario where the user needs to switch shooting screens and needs to identify and analyze another shooting screen. This embodiment does not limit the triggering scenario of the termination control. For example, suppose the user switches the real-time displayed shooting screen from... Figure 6 The menu screen switches to the game interface being used by another user. When it is necessary to identify the displayed game interface, the user can trigger the termination control in the displayed interface to instruct the electronic device to stop responding. Figure 6 The system displays voice interaction information related to the menu shown, and responds to new voice interaction information re-entered by the user based on the game interface.

[0060] In this embodiment, the display interface includes a termination control. By triggering the termination control, the electronic device can respond to the triggering operation of the termination control and stop responding to the voice interaction information proposed for the captured screen. This allows the user to control the input voice interaction information based on real-time needs, making the human-computer interaction between the electronic device and the user more intelligent.

[0061] In some scenarios, the display interface may include an exit control, which the user can trigger to stop recording when they need to stop the video capture. In one embodiment, the method further includes:

[0062] Step C: In response to the triggering operation of the exit control, stop recording the screen.

[0063] After a user obtains the desired item information through the application and wishes to stop recording, as an optional implementation, the user can trigger an exit control on the display interface. This allows the electronic device to respond to the triggering of the exit control and stop recording. For example, the exit control included in the display interface can be... Figure 11The controls on the right side of the interface shown. Optionally, the user can trigger the exit control by clicking or double-clicking. Understandably, in response to the exit control trigger, after the electronic device stops recording, the application may be closed, requiring the user to reopen it for future use; alternatively, the application may simply exit but continue running in the background, requiring the user to re-open it when needed.

[0064] In this embodiment, the display interface includes an exit control. When the user needs to stop shooting, the user can trigger the exit control, enabling the electronic device to respond to the trigger operation of the exit control and stop shooting, making the use of the application more flexible and convenient.

[0065] In some scenarios, to ensure users clearly understand the content of the proposed voice interaction information, the voice interaction information can also be displayed as text on the display interface. In one embodiment, such as... Figure 12 As shown, the above method also includes:

[0066] S301 converts voice interaction information into text information.

[0067] Optionally, in this embodiment, voice interaction information from the user regarding the captured image can be obtained from the microphone, and then the voice interaction information can be converted into corresponding text information. Optionally, in this embodiment, the voice interaction information can be converted using a neural network model, or voice recognition technology can be used to convert the voice interaction information into text information.

[0068] S302 displays text information in the display bar of the display interface; the display bar floats on the shooting screen in the display interface.

[0069] Optionally, in this embodiment, the display interface may include, for example, Figure 13 The display bar shown can display text information converted from voice interaction information, such as... Figure 13 The text shows a message: "Help me find the drink with the lowest alcohol content here." Optionally, in this embodiment, the display bar can be displayed floating above the shooting screen. It is understood that... Figure 13 The shape of the display bar in this example is just one illustration; the display bar can also be other shapes, such as circles, rectangles, etc. This embodiment does not limit the shape of the display bar. Optionally, in this embodiment, the border of the display bar can be thickened to highlight the display bar.

[0070] In this embodiment, by converting voice interaction information into text information, the text information corresponding to the voice interaction information can be displayed in the display bar of the display interface, so that the user can intuitively and accurately determine the content of the voice interaction information they input through the display bar; in addition, the display bar is displayed floating on the shooting screen in the display interface, and will not occupy too much display interface area.

[0071] To facilitate understanding by those skilled in the art, the interaction method provided in this disclosure will be described in detail below in two different scenarios:

[0072] The first method involves the user operating an electronic device to capture a scene (either a photo or a video). The captured scene (either a photo or a video) is then displayed on the device's screen. The user provides voice interaction information in response to the captured scene. The electronic device analyzes the voice interaction information and the captured scene to determine if there is any scene information in the current frame that can be used to respond to the voice interaction information.

[0073] If it is determined that there is a piece of visual information in the current frame that can be used to respond to voice interaction information, the electronic device can highlight that piece of visual information in the captured image.

[0074] If it is determined that there are multiple screen images in the current frame that can be used to respond to voice interaction information, the electronic device highlights these multiple screen images in the shooting frame. In addition, it can also display text information that can be used to respond to voice interaction information in the display frame. Based on the multiple screen images and text information displayed, if the user triggers a target screen image of interest from the multiple screen images, the electronic device can respond to the selection operation of the target screen image by displaying the details of the target screen image in the shooting frame. Furthermore, the electronic device can also display guidance information in the shooting frame to guide the user to determine their intent regarding the target screen image, such as guiding the user to decide whether to purchase the item displayed in the target screen image, or guiding the user to learn about item information related to the item displayed in the target screen image. Furthermore, when a user wants to terminate the response to the currently requested voice interaction information, the user can trigger the termination control in the display interface, so that the electronic device can respond to the trigger operation of the termination control and stop responding to the voice interaction information. In addition, if the user wants to stop the recording, they can trigger the exit space in the display interface, so that the electronic device responds to the trigger operation of the exit control and stops recording.

[0075] The second method involves the user operating an electronic device to capture images. The captured images are displayed on the device's screen, and the captured images are stored as "short-term memory data" (temporarily stored in the device). The device continues capturing images. If, after capturing a video, the user requests a voice interaction, the device analyzes the voice interaction and the captured images. If it determines that there is no image in the current frame that can respond to the voice interaction, but there is image in a previous frame that can, the device can generate location guidance information based on the location of the item requested in the voice interaction within a historical timeframe. This guidance information is then displayed as text on the device's screen. The location guidance information for the item queried via voice interaction, or the location guidance information for the item queried via voice broadcast, can guide the user to rotate the camera so that the electronic device can re-capture the previously captured image. If the item queried via voice interaction appears in the captured image, the image of the response to the voice interaction is highlighted in the captured image, and details of the item queried via voice interaction are displayed in the captured image. Furthermore, the electronic device can also display guidance information in the captured image to guide the user to determine their intent regarding the item queried via voice interaction, such as guiding the user to decide whether to purchase the item queried via voice interaction, or guiding the user to learn about related item information, etc. Furthermore, when a user wants to terminate the response to the currently requested voice interaction information, the user can trigger the termination control in the display interface, so that the electronic device can respond to the trigger operation of the termination control and stop responding to the voice interaction information. In addition, if the user wants to stop the recording, they can trigger the exit space in the display interface, so that the electronic device responds to the trigger operation of the exit control and stops recording.

[0076] It should be noted that the descriptions of the above steps can be found in the relevant descriptions in the above embodiments, and their effects are similar, so they will not be repeated here.

[0077] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0078] Based on the same inventive concept, this application also provides an interactive device for implementing the interactive method described above. The solution provided by this device is similar to the solution described in the above method; therefore, the specific limitations in one or more interactive device embodiments provided below can be found in the limitations of the interactive method described above, and will not be repeated here.

[0079] In one embodiment, such as Figure 14 As shown, an interactive device is provided, comprising: a first display module, a receiving module, and a second display module, wherein:

[0080] The first display module is used to display the images captured by the electronic device.

[0081] The receiving module is used to receive voice interaction information, which is information presented in response to the captured image.

[0082] The second display module is used to highlight the image information in the captured image when it is determined that there is image information in the captured image that can be used to respond to voice interaction information.

[0083] Optionally, the aforementioned screen information includes information about the item queried in the voice interaction information and / or location guidance information for the item.

[0084] Optionally, the location guidance information for the aforementioned items is generated based on the location information of the items appearing in the captured images within the historical time period when the items queried in the voice interaction information appear.

[0085] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0086] Based on the above embodiments, optionally, the screen information includes multiple screen information, and the device further includes: a third display module, wherein:

[0087] The third display module is used to display detailed information of the target image in the captured image in response to the selection operation of the target image information among multiple image information.

[0088] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0089] Optionally, based on the above embodiments, the device further includes: a fourth display module, wherein:

[0090] The fourth display module is used to display guidance information in the captured image; the guidance information is used to guide the user to determine the intent information for the target image.

[0091] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0092] Optionally, based on the above embodiments, the device further includes: a fifth display module, wherein:

[0093] The fifth display module is used to highlight the screen information that responds to the voice interaction when reviewing the captured image based on the location guidance information of the item. If the item queried in the voice interaction information is displayed in the captured image, the module will highlight the screen information that responds to the voice interaction information.

[0094] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0095] Based on the above embodiments, the display interface includes a termination control. Optionally, the device further includes a first response module, wherein:

[0096] The first response module is used to respond to a trigger operation on the termination control and stop responding to voice interaction information.

[0097] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0098] Based on the above embodiments, the display interface includes an exit control. Optionally, the device further includes a second response module, wherein:

[0099] The second response module is used to stop recording the screen in response to a trigger operation on the exit control.

[0100] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0101] Based on the above embodiments, optionally, the above device further includes: a conversion module and a fifth display module, wherein:

[0102] The conversion module is used to convert voice interaction information into text information.

[0103] The fifth display module is used to display text information in the display bar of the display interface; wherein, the display bar floats on the shooting screen in the display interface.

[0104] The interactive device provided in this embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0105] Each module in the aforementioned interactive device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can invoke and execute the operations corresponding to each module.

[0106] In one embodiment, a computer device is provided, which may be an electronic device, and its internal structure diagram may be as follows: Figure 15 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements an interactive method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0107] Those skilled in the art will understand that Figure 15The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0108] This application also provides a computer-readable storage medium. One or more non-volatile computer-readable storage media containing computer-executable instructions, which, when executed by one or more processors, cause the processors to perform the steps of an interactive method.

[0109] This application also provides a computer program product containing instructions that, when run on a computer, cause the computer to perform an interactive method.

[0110] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0111] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0112] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An interaction method, characterized in that, The method includes: Displays the camera feed from the electronic device; Receive voice interaction information, wherein the voice interaction information is information presented in response to the captured image; If it is determined that there is image information in the captured image that can be used to respond to the voice interaction information, the image information is highlighted in the captured image.

2. The method of claim 1, wherein, The image information includes multiple image information items, and the method further includes: In response to the selection operation of target image information among the plurality of image information, the detailed information of the target image information is displayed in the captured image.

3. The method of claim 2, wherein, The method further includes: Guidance information is displayed in the captured image; the guidance information is used to guide the user to determine the intent information for the target image information.

4. The method according to any one of claims 1 to 3, characterized in that, The screen information includes information about the item queried in the voice interaction information and / or the location guidance information of the item.

5. The method according to claim 4, characterized in that, The location guidance information for the item is generated based on the location information of the item appearing in the captured images within the historical time period when the item is queried in the voice interaction information.

6. The method according to claim 5, characterized in that, The method further includes: Based on the location guidance information of the item, the user is guided to operate the camera of the electronic device to find the item to be searched in the voice interaction information. If the item searched in the voice interaction information appears in the shooting screen, the screen information of responding to the voice interaction information is highlighted.

7. The method according to claim 1, characterized in that, The display interface includes a termination control, and the method further includes: In response to a trigger operation on the termination control, the response to the voice interaction information is stopped.

8. The method according to claim 1, characterized in that, The display interface includes an exit control, and the method further includes: In response to the triggering operation of the exit control, the recording of the screen is stopped.

9. The method according to claim 1, characterized in that, The method further includes: Convert the voice interaction information into text information; The text information is displayed in the display bar of the display interface; wherein, the display bar is displayed floating on the captured image in the display interface.

10. An interactive device, characterized in that, The device includes: The first display module is used to display the captured images from the electronic device; A receiving module is used to receive voice interaction information, which is information presented in response to the captured image. The second display module is used to highlight the image information in the captured image when it is determined that there is image information in the captured image that can be used to respond to the voice interaction information.

11. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the computer program is executed by the processor, the processor performs the steps of the interactive method based on the captured image as described in any one of claims 1 to 9.

12. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 9.

13. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 9.