Method and apparatus for generating reply, device, and program product

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By generating text descriptions of images and combining them with document context, target prompts are determined, and a target model is used to generate rich and detailed responses. This solves the problem of insufficient utilization of image information in existing technologies and improves the accuracy and richness of responses.

WO2026123376A1PCT designated stage Publication Date: 2026-06-18BEIJING ZITIAO NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date: 2024-12-13
Publication Date: 2026-06-18

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to accurately summarize image content and combine it with document text when processing image information, leading to decreased accuracy in generated responses and an inability to effectively utilize image information.

⚗Method used

By generating text descriptions of images and combining them with document context, target prompts are determined. A target model is then used to generate a visually appealing response, utilizing the original images in the document for reference and explanation.

🎯Benefits of technology

It enables accurate mining and integration of image information, generating more accurate text and image responses, and improving the accuracy and richness of the responses.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024139383_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Embodiments of the present disclosure provide a method and apparatus for generating a reply, a device, and a computer program product. The method comprises: on the basis of an image in a document and an image context, generating a textual description of the image. The method further comprises: on the basis of the image, the textual description of the image, and text in the document, determining a target prompt word, the target prompt word being used for reusing the image in an image-containing textual reply to be generated for a user input. The method further comprises: on the basis of the user input, the image, the textual description of the image, the text in the document, and the target prompt word, using a target model to generate an image-containing textual reply. According to the method in the embodiments of the present disclosure, an image context can be used as a context of an image in a document, so that information in the image can be accurately and fully mined, and integration of image and textual information is implemented. Addition of reference content related to text in the document facilitates generating a more accurate reply, and the image can be used to generate an image-containing textual reply.

Need to check novelty before this filing date? Find Prior Art

Description

Methods, apparatus, equipment, and procedures for generating responses Technical Field

[0001] This disclosure relates generally to the field of computers, and more specifically to methods, apparatus, devices, computer-readable storage media, and computer program products for generating responses. Background Technology

[0002] In the research of deep learning technology, language models have developed rapidly. Language models can process and generate natural language text, and their applications are increasingly expanding into various fields. As part of deep learning technology, language models, through training on large-scale text datasets, capture complex patterns and structures in language, thereby achieving a high degree of understanding and generation capabilities of human language.

[0003] Language models are widely used in various fields. For example, in natural language processing tasks such as text classification, sentiment analysis, and question-answering systems, they can assist in writing, generate creative copy, or automatically summarize content. In dialogue systems, they serve as the technical support behind chatbots, enabling intelligent human-computer interaction. Furthermore, they have shown great potential in areas such as code completion and debugging, and educational assistance. Summary of the Invention

[0004] According to exemplary embodiments of this disclosure, a method, apparatus, device, computer storage medium, and computer program product for generating responses are provided.

[0005] In a first aspect of this disclosure, a method for generating a response is provided, comprising generating a text description of an image based on an image in a document and the image's context. The method further comprises determining target prompts based on the image, the image's text description, and the text in the document, the target prompts being used to transform the image into a graphical response generated in response to user input. The method also comprises generating a graphical response using a target model based on user input, the image, the image's text description, the text in the document, and the target prompts.

[0006] In a second aspect of this disclosure, an apparatus for generating responses is provided. The apparatus includes a text description generation module configured to generate a text description of an image based on an image in a document and the image's context. The apparatus also includes a prompt word determination module configured to determine target prompt words based on the image, the image's text description, and the text in the document. The target prompt words are used to convert the image into a text-image response generated in response to user input. The apparatus further includes a text-image response generation module configured to generate a text-image response using a target model based on user input, the image, the image's text description, the text in the document, and the target prompt words.

[0007] In a third aspect of this disclosure, an electronic device is provided, comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method described in the first aspect of this disclosure when executed by the at least one processing unit.

[0008] In a fourth aspect of this disclosure, a computer-readable storage medium is provided having machine-executable instructions stored thereon, which, when executed by a device, cause the device to perform the method described in the first aspect of this disclosure.

[0009] In a fifth aspect of this disclosure, a computer program product is provided, including computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the method described in the first aspect of this disclosure.

[0010] The summary section is provided to introduce a series of concepts in a simplified form, which will be further described in the detailed description below. The summary section is not intended to identify key or essential features of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0011] Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

[0012] Figure 2 shows a flowchart of a method for generating a response according to an embodiment of the present disclosure;

[0013] Figure 3A shows a schematic diagram of a text description of a generated image according to an embodiment of the present disclosure;

[0014] Figure 3B illustrates a schematic diagram of generating a graphic response according to an embodiment of the present disclosure;

[0015] Figure 4 shows a schematic diagram of a method for generating a graphic response according to an embodiment of the present disclosure;

[0016] Figure 5 shows a schematic diagram of a method for generating a response according to an embodiment of the present disclosure;

[0017] Figure 6 shows a schematic diagram of the structure of a graphic response according to an embodiment of the present disclosure;

[0018] Figure 7 shows a schematic block diagram of an example apparatus according to some embodiments of the present disclosure;

[0019] Figure 8 shows a block diagram of an example device that can be used to implement embodiments of the present disclosure.

[0020] In all the accompanying figures, the same or similar reference numerals denote the same or similar elements. Detailed Implementation

[0021] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of these messages or information. It is understood that before using the technical solutions disclosed in the embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0022] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message. As an optional but non-limiting implementation, the prompt message can be sent to the user in the form of a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control for the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0023] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0024] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0025] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects unless explicitly stated. Other explicit and implicit definitions may also be included below.

[0026] When using computer technology to provide question-and-answer services, people can use computers to summarize the content of documents to provide answers. However, when documents contain images, image information typically includes many aspects such as brightness, saturation, numerous physical objects (e.g., a single image can show a beach, sun, sea, chair, ship, balloon, etc.), and colors. Therefore, existing methods often fail to adequately summarize the images to find helpful information, and image processing is usually abandoned. Allowing a language model to process the images unchecked may yield biased information, reducing the accuracy of the answer. Furthermore, abandoning image processing prevents the reuse of images, resulting in answers that are usually plain text.

[0027] In response, this disclosure proposes a method for generating responses. This method utilizes the context of an image as its own within the document, thereby accurately and fully extracting information from the image, integrating text and image information, and adding reference content related to the document's text. This helps generate more accurate responses. Furthermore, since images generated by language models are not always accurate, by using images from the document within the target prompts, the original images in the document can be used to provide more accurate references and explanations for the generated responses, and the target model can be used to generate responses that combine text and images.

[0028] Embodiments of the present disclosure will now be described in further detail with reference to the accompanying drawings, wherein FIG1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. The example environment 100 includes a computing device 110 and a computing device 120. The computing device 110 can process data to provide a response service to user devices (e.g., computing device 120) accessing the computing device 110. In an embodiment, a target model (e.g., a language model) may be deployed on the computing device 110, which can be used to automatically generate text-based responses to user input received from the computing device 120. In an embodiment, a response algorithm may be deployed on the computing device 110, which can be used to automatically generate text-based responses to user input received from the computing device 120. In some embodiments, the computing device 120 and the computing device 110 communicate via a network 130. The network 130 may include a wired network, a wireless network, or a combination thereof, for providing communication between the computing device 120 and the computing device 110. In some embodiments, computing device 120 may be connected to computing device 110 via a data cable. This disclosure does not limit the connection method between computing device 110 and computing device 120.

[0029] In an embodiment where computing device 110 provides a response via a language model, computing device 120 may have an application (e.g., a client program) or a browser capable of invoking the language model installed. Taking system 100 in Figure 1 as an example, computing device 120 can communicate with computing device 110 via network 130, sending user input 122, "What types of headphones are there?", to computing device 110. In this embodiment, computing device 110 can use a search engine to retrieve the question from the internet to obtain multiple reference web pages and prepare a response based on these reference web pages. In this embodiment, computing device 110's local database may store a large number of reference documents, and computing device 110 can recall associated reference documents based on user input to prepare a text and image response.

[0030] After acquiring document 112, computing device 110 can generate a text description 118 for image 116 based on image 116 and image context 114 in document 112. Computing device 110 can use image processing tools to comprehensively process image 116 and image context 114. For example, it can use a language model to comprehensively understand image 116 and image context 114, thereby generating the text description 118 for the image. For instance, image 116 and image context 114 can be input into a language model, which will then generate the text description 118 for image 116. Computing device 110 not only mines image information but also uses image context 114 as a reference for mining image 116, thus enabling it to accurately summarize image content within the context of image 116 to generate the text description 118 for the image, without being interfered with by other information in image 116.

[0031] The computing device can determine the target prompt 134 based on image 116, its text description 118, and the text 132 in document 112. The target prompt 134 is used to transform image 116 into a graphical response generated based on user input. With the text description 118 of the image, the computing device 110 can have more relevant references, thus answering the user's question based on the text description 118 and the text 132. The target model on the computing device 110 can utilize the prompting effect of the target prompt, and further utilize image 116 in document 112, to assist in the presentation of the graphical response 124.

[0032] The computing device 110 can generate a text-image response 124 based on user input 122, image 116, text description 118 of image 116, text 132 in document 112, and target prompt words 134, using a target model. This target model can be a language model deployed on the computing device 110, such as a large language model. The target model can generate relevant prompts for the text-image response based on the target prompt words 134, providing an accurate text-image response to the user input.

[0033] According to the method of the embodiments of this disclosure, the computing device 110 can use the context of the image as the context of the image in the document, thereby accurately and fully mining the information in the image, realizing the integration of text and image information, and adding reference content related to the text of the document. This helps to generate more accurate responses. Furthermore, since the image generated by the language model is not accurate enough, by using the image in the document in the target prompt words, the original image in the document can be used to provide a more accurate reference and explanation for the generated response, and the target model can be used to generate a response with both text and images.

[0034] In one embodiment, computing device 110 can send the graphic response 124 to computing device 120 via network 130 and render the graphic response 124 in the application of computing device 120. In another embodiment, the graphic response 124 is rendered in a multi-level manner. For example, the graphic response 124 could be “#Headphones come in the following types##Over-ear headphones…” and embed the image 116 from the document in an appropriate position within the graphic response 124, where “#” represents a level. For example, an image of over-ear headphones could be embedded within a text description of over-ear headphones.

[0035] As shown in Figure 1, in environment 100, network 130 can be used to transmit data between computing device 110 and computing device 120. Network 130 has a theoretical bandwidth, which refers to the maximum transmission speed supported by network 130. It represents the maximum amount of data that network 130 can transmit under ideal conditions, usually measured in bits per second (bps). For example, if the theoretical bandwidth of network 130 is 100 Mbps, it means that under ideal conditions it can transmit 100 megabits of data per second. However, in reality, due to other factors that may exist in the network (e.g., signal interference, bandwidth sharing, transmission delay, etc.), the actual transmission speed may not reach 100 Mbps.

[0036] As understood by those skilled in the art, instances of computing device 110 can be independent physical servers, server clusters or distributed systems composed of multiple physical servers, or cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Servers can be connected directly or indirectly via wired or wireless communication, and this application does not impose any limitations on this.

[0037] The computing device 120 can be any type of mobile computing device, including mobile computers (e.g., personal digital assistants (PDAs), laptops, notebook computers, tablet computers, netbooks, etc.), mobile phones (e.g., cellular phones, smartphones, etc.), wearable computing devices (e.g., smartwatches, head-mounted devices, including smart glasses, etc.) or other types of mobile devices. In some embodiments, the computing device 120 can also be a fixed computing device, such as a desktop computer, game console, smart TV, etc.

[0038] It should be understood that the architecture and functionality in example environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure. Embodiments of this disclosure can also be applied to other environments with different structures and / or functionalities.

[0039] The processes according to embodiments of this disclosure will be described in detail below with reference to other accompanying drawings. For ease of understanding, the specific data mentioned in the following description are exemplary and not intended to limit the scope of this disclosure. It will be understood that the embodiments described below may also include additional actions not shown and / or actions shown may be omitted, and the scope of this disclosure is not limited in this respect.

[0040] Figure 2 illustrates a flowchart of a method 200 for generating a response according to certain embodiments of the present disclosure. In this embodiment, the method may be performed by a computing device 110. At block 202, a text description of an image is generated based on the image in the document and the image context. The document may be a webpage retrieved by the computing device 110 based on user input, or a document in some format stored locally, containing images and text. In the document, images are often embedded in different paragraphs of the text as illustrations, serving a supplementary explanation function. In this operation, the image context refers to other images or text near the location of a specific image, used to help interpret the content of that specific image. For example, the image context may be text above or below the image. By combining and analyzing the image and its image context, the computing device 110 can interpret the image within its context, which is beneficial for accurately extracting information from the image. Therefore, the generated text description of the image has high accuracy, which lays a good foundation for generating an accurate response.

[0041] In box 204, target prompts are determined based on the image, its text description, and the text in the document. These prompts are used to adapt the image into the generated text-based response to the user input. The target prompt instructs the target model to complete the response task according to the given information. For example, the prompt might tell the target model that it will receive multiple images, and when generating the response, it needs to adapt these images into the text content to place them appropriately and visually interpret the text. Another example is adding "The following is real user input" after the task requirements for the response are written. This tells the target model to generate a text-based response that meets the task requirements based on the user input and the target prompt.

[0042] In box 206, a target model is used to generate a text-image response based on user input, an image, the image's text description, text in a document, and target prompts. User input can be instructions entered by the user after launching the language model application on computing device 120. For example, user input could be "What types of headphones are there?" or "What's the difference between a manta ray and a ray?" Based on the task requirements in the target prompts, the image's text description can be used as reference material for generating the response. Computing device 110 can combine the text and image text descriptions, along with the image, to generate a more accurate response and can insert the image into an appropriate position to obtain a more informative text-image response. In this embodiment, to improve the accuracy of the text-image response and ensure that the image can be accurately inserted into the response, a language model, such as a Large Language Model (LLM), can be used to generate the text-image response. A LLM is a deep learning model trained using a large amount of text data that can generate natural language text or understand the meaning of language text. Therefore, a language model can be used to generate a text-image response based on user input, an image, the image's text description, and text in a document.

[0043] The method according to the embodiments of this disclosure can use the image context as the image's context in the document, thereby accurately and fully mining the information in the image, realizing the integration of image and text information, and adding reference content related to the document's text. This helps to generate more accurate responses. Furthermore, since the images generated by the language model are not accurate enough, by using images from the document in the target prompts, the original images in the document can be used to provide more accurate references and explanations for the generated responses, and the target model can be used to generate responses with both images and text.

[0044] Regarding box 202, Figure 3A illustrates a schematic diagram of generating a text description of an image according to an embodiment of the present disclosure. In the embodiment, after obtaining the document, the image is obtained at 302, and the context of the image is obtained at 304. For example, the context of the image may be the text content of about 200 characters above the image. At 306, the semantic features of the image are mined to read information about various aspects included in the image. For example, by analyzing the image, it can be determined what type of headphones are in the image, what color the headphones are, who is wearing these headphones, what the environment in the image is like, etc. Information unrelated to headphones, such as the brightness of the image, can also be obtained. As an example, the image shows a person wearing headphones listening to music in a public place. At 308, the semantic features of the image and the context of the image can be combined for semantic analysis to describe the relevant information in the image with reference to the context. For example, the context of the image may be "The image below shows a pair of headphones with good sound insulation..." At 310, semantic analysis of the context and language features of the image can be performed to obtain a text description of the image. For example, the text description could be, "This is a pair of headphones that cover the ears, allowing people to enjoy music undisturbed in noisy public environments." In this embodiment, the semantic features of the image are organically combined with its context, allowing for the extraction of detailed information from the image within a specific context.

[0045] In this embodiment, the document can be a webpage retrieved by a search engine based on user input by the computing device 110. Here, the image is a picture within the webpage, and the image context can be information surrounding the image, such as adjacent text, tags, etc. In this embodiment, the webpage's source code is parsed based on multiple target keywords to obtain the image and its context. This step involves using programming techniques to parse the webpage's source code and find preset target keywords. These keywords are image attributes, surrounding text content, or other textual information that helps identify the image and its associated elements. For example, webpages are typically written using Hypertext Markup Language (HTML), a standard markup language used to create webpages. It is one of the fundamental technologies constituting the World Wide Web. HTML uses predefined tags to annotate different parts of a document, such as headings, paragraphs, links, and images. In webpages written in HTML, images and text can be easily mixed and arranged, making the webpage not only user-friendly for human readers but also easy for search engines and other software to parse.

[0046] When a webpage uses Hypertext Markup Language (HTML), the target keyword can be "href". The "href" tag can include a resource link, which often points to the storage address of an image. The selection of the target keyword is to more accurately find and understand the images and their context on the webpage, ensuring the accuracy of subsequent steps. In this embodiment, a text description of the image is generated based on the image and its context. After downloading the image, semantic analysis can be performed on the image and its context to generate a text description. In this embodiment, after downloading the image from the webpage's source code, the image can be stored in memory and read from memory when generating the image. Then, a text-image response is generated based on the read image and other elements. Because the original address of the image is unreliable, the image can be deleted or replaced by a third party. This ensures that the image is not lost due to changes in its original address when generating the text-image response, resulting in a stable text-image response.

[0047] In this embodiment, not all images on the webpage are related; some images may be target-type images, such as announcement or notification images. If multiple images contain a target-type image, the target image needs to be removed. This improves the accuracy of the text and image responses.

[0048] Figure 3B illustrates a schematic diagram of generating a text-image response according to an embodiment of the present disclosure. In this embodiment, the text description of the image includes a title 320. Before generating the text-image response, the title 320, text 321, and image 326 are integrated into a target prompt 330. The target prompt 330 can be a target model that may include detailed task requirements, image conversion instructions, and scene descriptions, etc. The computing device 110 uses the target model 324 to perform a comprehensive summary analysis process based on the target prompt 330, user input 322, title 320, text 321, and image 326 to generate a text-image response 328. These elements together constitute the core components of the text-image response to be generated. In this process, the title, as supplementary reference information, can improve the accuracy of the text-image response. During the analysis process, although the image itself is one of the key information sources, to avoid duplication of effort and redundant information, the computing device 110 may choose not to perform detailed semantic or content analysis on the image again. Instead, computing device 110 can focus on using existing title and text information to determine the content of the response to the user input, and embed image 326 in an appropriate position according to the theme of the image to enhance the intuitiveness and persuasiveness of the response.

[0049] In this embodiment, to ensure that the generated text and image response is both coherent and logical, the title 320 can be integrated into the text 321. This approach not only maintains the consistency and fluency of information delivery but also makes the overall structure more compact. The method in this embodiment achieves an efficient text and image response generation mechanism by optimizing the information processing flow and combining multiple input sources (such as titles, text, user input, and images). This method not only improves the accuracy of automated response services but also ensures that the final output content is rich in dimensions, meeting the user's query needs.

[0050] In this embodiment, the operation on box 202 can be implemented using a lightweight multimodal model. Specifically, if the computing device 110 receives an image, it uses the multimodal model to generate a text description of the image based on the image and its context. This provides users with a fast-responding image-text reply service. In this embodiment, the lightweight multimodal model can be trained using a large language model.

[0051] In this embodiment, during the training phase, a first text description is determined using a multimodal model based on a first image and its context, and a second text description is determined using a first language model based on the first image and its context. The first language model can be a pre-trained language model, such as a large language model. The first language model has a greater number of parameters than the multimodal model, and its network structure is typically much more complex. Because it has more parameters and a more complex network structure, this first language model can capture more nuanced linguistic features and generate high-quality text descriptions.

[0052] In this embodiment, a multimodal model is trained based on text description labels and a second text description. The second text description typically closely approximates the text description labels and serves as a valuable reference, preventing overfitting in the multimodal model. Large language models possess relatively accurate predictive capabilities; therefore, leveraging their predictions to train the multimodal model allows for knowledge transfer, improving accuracy while maintaining a smaller size, fewer parameters, and a simpler network structure.

[0053] Figure 4 illustrates a schematic diagram of a method for generating a text-image response according to an embodiment of the present disclosure. In the embodiment, the document may include multiple sub-documents, each including an image and text. In this embodiment, a title for the image is generated using multimodal modality, and a text-image response is generated using a language model. At 402, the computing device 110 receives user input. For example, the user input is "How many types of headphones are there?" If the document is a webpage, after receiving user input 402, the computing device 110 can invoke a search engine to retrieve multiple webpages via the Internet as multiple sub-documents. These webpages may be webpages introducing various types of headphones, or webpages introducing only one type of headphone. For example, one of the multiple webpages may be about over-ear headphones, one about in-ear headphones, and another about semi-in-ear headphones. At 404, the computing device 110 can extract the text from each of the multiple webpages. At 406, the computing device 110 can extract the image from each of the multiple webpages. At 408, the computing device 110 can extract the image context from the multiple webpages.

[0054] In this embodiment, multiple images from multiple web pages can be stored as an image list, and multiple image contexts can also be stored as an image context list. Then, at 410, the image list and image context list are input into a multimodal model. At 414, the multimodal model can generate a title for each image, resulting in a title list. At 412, target prompts are determined based on the title list, the text list of multiple web pages, and the image list. The user input, the title list, the image list, the text list of multiple web pages, and the target prompts are then input into a language model. The language model performs comprehensive analysis based on the target prompts to generate a text-image response 416. For example, the language model can comprehensively analyze samples, images, and image titles from a web page about over-ear headphones; samples, images, and image titles from a web page about semi-in-ear headphones; and samples, images, and image titles from a web page about in-ear headphones, generating a text-image response introducing these three types of headphones. In this embodiment, comprehensive semantic analysis can be performed on multiple web pages (or multiple sub-documents), which helps generate a more comprehensive text-image response.

[0055] Language models can generate response text word by word when processing text, but they are often slow when processing images. Compared to language models, multimodal models are significantly faster at processing images and can respond quickly. Therefore, using multimodal models to process images and their context to generate titles, and using language models to process titles, user input, text, images, and target prompts to generate image-text responses, can achieve both fast processing speed and high accuracy.

[0056] In this embodiment, the computing device 110 can determine structural features based on the source code of a webpage. For webpages written in Hypertext Markup Language (HTML), this can be achieved by parsing tags at various levels (such as...). 、 <section> 、 <article>By analyzing the positions of tags (such as headings, paragraphs, lists, and links), we can understand the overall layout and hierarchical structure of a page. These tags define different parts of a webpage, such as headings, paragraphs, lists, and links, which together constitute the information architecture of the webpage. By analyzing these tags, we can identify the specific area where an image is located and its relationship with other elements.

[0057] In this embodiment, computing device 110 can determine the location information of an image based on structural features. Similarly, this can be done by determining the image's position within the overall document structure based on the position of a tag (e.g., "href"). In this embodiment, computing device 110 can embed the image's text description into the text based on the image's location information to obtain updated text. This means that instead of simply adding the description to the end of the text, a position that best reflects the image's relevance to its surrounding content is chosen. For example, if the image is at the beginning of a paragraph, the description can be placed at the beginning of the paragraph as an introduction. Similarly, if the image is located at a specific position in the middle of the text, embedding the image's text description there is appropriate. This not only maintains the natural flow of information but also enhances the relevance and readability of the text description. In this embodiment, computing device 110 can generate target prompts based on the image and the updated text. This embodiment enhances the accuracy of the analysis and contextual understanding by further clarifying the position of the image and its text description within the document, a process that also helps improve the quality of the generated image-text response.

[0058] Figure 5 illustrates a schematic diagram of a method for generating a response according to an embodiment of the present disclosure. In this embodiment, images of documents corresponding to high-frequency inputs can be pre-processed, and a text description of the image can be obtained using a multimodal model and stored locally. This way, when the same user input is received again, there is no need to calculate the text description of the image, which helps to achieve a faster response.

[0059] At point 502, computing device 110 can receive user input. Computing device 1110 can record the frequency of each user input. After each received user input, the frequency of that user input is updated. At point 504, a threshold is used to determine if the user input is a high-frequency input. For example, popular news items can be considered high-frequency input. If it is a high-frequency input, meaning the frequency of the user input is higher than the threshold, then at point 506, the text description (such as the title) of the image is read from the offline processing data. For example, the corresponding document is found based on the user input, and the text description of the image within that document is directly read. After reading the offline processing data, the process proceeds directly to point 510, completing the determination of the text description. If it is not a high-frequency input, meaning the frequency of the user input is less than or equal to the threshold, then at point 508, the document is processed online, and the text, image, and image context are determined. At point 510, a multimodal model is invoked to generate the text description of the image based on the image and its context. At point 512, the image's text description can be embedded into the text based on its position or distribution within the document, generating richer references and more readable, updated text. At point 514, the computing device 110 can store the image in memory to ensure it is not affected by updates to the source address. At point 516, the computing device 110 can determine target prompts based on the image, its text description, and the text in the document, and then invoke a language model to generate a text-image response based on user input, the image, and the updated text and target prompts. In this embodiment, user input is processed differently online and offline, which improves the processing efficiency, stability, and cross-platform display capabilities of the image's text description, and is particularly effective for popular user input types.

[0060] Figure 6 illustrates a schematic diagram of the structure of a text-image response according to an embodiment of the present disclosure. In this embodiment, a user can use device 610 to use the response service. In the interface of device 610, the user can launch a target application and input user input 620 "What types of headphones are there?" into the target application. The target application can provide a text-image response 630 according to the method in the above embodiment. In this embodiment, the target application can determine a target prompt word, which also instructs the text-image response to be generated according to a target format. The target application can generate a text-image response according to the target format using a language model based on user input, images, text descriptions of images, and text and target prompt words in a document. The target format has multiple regions 632, 634, 636, and 638, at least one of which has multiple levels, which can be distinguished, for example, by "#". One of the levels includes an image link indicating the location of an image. In this embodiment, the text-image response is generated in Markdown format. In region 638, it includes a first-level heading "#What types of headphones are there?". Section 632 includes a second-level heading "## Over-ear Headphones", followed by the main text, which includes an image of over-ear headphones. Section 634 includes a second-level heading "## In-ear Headphones", followed by the main text, which includes an image of in-ear headphones. Section 636 includes a second-level heading "## Semi-in-ear Headphones", followed by the main text, which includes an image of semi-in-ear headphones.

[0061] This method of rendering text and images not only maintains the readability of the text but also achieves a consistent display effect across different platforms. The language model can intelligently adjust the placement and order of inserted images based on the document's native layout information and the semantic relevance of the images, thereby providing accurate and vivid responses.

[0062] Figure 7 shows a schematic block diagram of an example device 700 according to some embodiments of the present disclosure. Device 700 can be implemented by software, hardware, or a combination of both. As shown in Figure 7, device 700 includes a text description generation module 710, a prompt word determination module 720, and a graphic response generation module 730.

[0063] In some embodiments, the text description generation module 710 can be configured to generate a text description of an image based on the image in the document and the image context. The prompt word determination module 720 can be configured to determine target prompt words based on the image, the image's text description, and the text in the document. The target prompt words are used to convert the image into a graphic response generated in response to user input. The graphic response generation module 730 can be configured to generate a graphic response using a target model based on user input, the image, the image's text description, the text in the document, and the target prompt words.

[0064] In some embodiments, the document includes a webpage, and the text description generation module 710 includes a first parsing module configured to parse the source code of the webpage based on multiple target keywords to obtain an image and an image context. The text description generation module 710 also includes a first generation module configured to generate a text description of the image based on the image and the image context.

[0065] In some embodiments, the text description includes a title, and the device 700 further includes an offline processing module configured to store the image title in memory in response to user input occurring more frequently than a threshold. The device 700 also includes a first acquisition module configured to acquire the image title from memory in response to receiving user input.

[0066] In some embodiments, the text description generation module 710 includes a first model invocation module configured to generate a text description of an image using a multimodal model based on the image and the image context in response to receiving an image.

[0067] In some embodiments, the apparatus 700 further includes a first prediction module configured to determine a first text description using a multimodal model based on a first image and a first image context. The apparatus 700 also includes a second prediction module configured to determine a second text description using a first language model based on the first image and a first image context, wherein the first language model is a trained language model with a greater number of parameters than the multimodal model. The apparatus 700 further includes a training module configured to train the multimodal model based on text description labels and the second text description.

[0068] In some embodiments, the document includes multiple images, and the apparatus 700 further includes a filtering module configured to remove the target image in response to the target image being a target type image among the multiple images.

[0069] In some embodiments, the prompt word determination module 720 includes a structure determination module configured to determine structural features based on the source code of a webpage. The prompt word determination module 720 also includes a location determination module configured to determine the location information of an image based on the structural features. The prompt word determination module 720 further includes an embedding module configured to embed the image's text description into text based on the image's location information to obtain updated text. The prompt word determination module 720 also includes a second generation module configured to generate target prompt words using a target model based on the image and the updated text.

[0070] In some embodiments, the target prompt also instructs the generation of a text-image response according to a target format, and the text-image response generation module 730 includes a format processing module configured to generate a text-image response according to a target format using a target model based on user input, an image, a text description of the image, text in a document, and the target prompt, wherein the target format has multiple regions, at least one of the multiple regions has multiple levels, and one of the multiple levels includes an image link indicating the location of an image.

[0071] In some embodiments, the document includes multiple sub-documents, each sub-document including multiple images and multiple texts. The text description generation module 710 includes a third generation module configured to generate text descriptions for the multiple images based on the multiple images and their contexts. The prompt word determination module 720 includes a fourth generation module configured to generate target prompt words based on the multiple images included in the multiple sub-documents, the multiple text descriptions of the multiple images, and the multiple sub-texts included in the multiple sub-documents.

[0072] The division of modules or units in the embodiments of this disclosure is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods. Furthermore, the functional units in the disclosed embodiments may be integrated into one unit, exist as separate physical entities, or two or more units may be integrated into one unit. The integrated unit described above can be implemented in hardware or as a software functional unit.

[0073] Figure 8 shows a block diagram of an example device 800 that can be used to implement embodiments of the present disclosure. It should be understood that the device 800 shown in Figure 8 is merely exemplary and should not be construed as limiting the functionality and scope of the implementations described herein. For example, device 800 can be used to correspond to computing device 120 described herein in conjunction with Figure 1 and can be used to perform the processes of Figures 1 to 6 described above.

[0074] As shown in Figure 8, device 800 is in the form of a general-purpose computing device. Components of computing device 800 may include, but are not limited to, one or more processors or processing units 810, memory 820, storage devices 830, one or more communication units 840, one or more input devices 850, and one or more output devices 860. Processing unit 810 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 820. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device 800.

[0075] Computing device 800 typically includes multiple computer storage media. Such media can be any available media accessible to computing device 800, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 820 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof). Storage device 830 can be removable or non-removable media and may include machine-readable media, such as flash drives, disks, or any other media capable of storing information and / or data (e.g., training data for training) and accessible within computing device 800.

[0076] The computing device 800 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not shown in FIG8, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks may be provided. In these cases, each drive may be connected to a bus (not shown) via one or more data media interfaces. The memory 820 may include a computer program product 825 having one or more program modules configured to perform various methods or actions of various implementations of the present disclosure.

[0077] The communication unit 840 enables communication with other computing devices via a communication medium. Additionally, the components of the computing device 800 can function as a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the computing device 800 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.

[0078] Input device 850 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 860 can be one or more output devices, such as a monitor, speaker, printer, etc. Computing device 800 can also communicate with one or more external devices (not shown) via communication unit 840 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with computing device 800, or with any device that enables computing device 800 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication can be performed via an input / output (I / O) interface (not shown).

[0079] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is provided that stores a computer program thereon, which, when executed by a processor, implements the methods described above.

[0080] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0081] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0082] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0083] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0084] Various implementations of this disclosure have been described above. The foregoing description is exemplary and not exhaustive, nor is it limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.< / article> < / section>

Claims

1. A method for generating a response, comprising: Based on the images in the document and the image context, generate a text description of the image; Target prompt words are determined based on the image, the text description of the image, and the text in the document. The target prompt words are used to convert the image into a graphic response generated in response to user input. as well as Based on the user input, the image, the text description of the image, the text in the document, and the target prompt words, the image-text response is generated using the target model.

2. The method according to claim 1, wherein the document includes a webpage, and generating a text description of the image based on the image in the document and the image context includes: The source code of the webpage is parsed based on multiple target keywords to obtain the image and the image context; as well as A text description of the image is generated based on the image and its context.

3. The method of claim 2, wherein the text description includes a title, and the method further includes: In response to the frequency of the user input exceeding a threshold, the title of the image is saved in memory; as well as In response to receiving the user input, the title of the image is retrieved from the memory.

4. The method according to claim 2, wherein generating a text description of the image based on the image and the image context includes: In response to receiving the image, a text description of the image is generated using a multimodal model based on the image and its context.

5. The method according to claim 4, further comprising: Based on the first image and its context, the first text description is determined using the multimodal model; Based on the first image and the context of the first image, a second text description is determined using a first language model, wherein the first language model is a trained language model and the number of parameters of the first language model is greater than the number of parameters of the multimodal model. as well as The multimodal model is trained based on the text description labels and the second text description.

6. The method according to claim 1, wherein the document comprises a plurality of images, and the method further comprises: In response to the fact that the target image among the plurality of images is a target type image, the target image is removed.

7. The method according to claim 2, wherein determining the target prompt word based on the image, the text description of the image, and the text in the document comprises: Determine the structural features based on the source code of the webpage; The location information of the image is determined based on the structural features; The text description of the image is embedded into the text based on the image's location information to obtain updated text; as well as The target prompt word is generated based on the image and the updated text.

8. The method of claim 2, wherein the target prompt further instructs the generation of the graphic response according to a target format, and generating the graphic response using a target model based on the user input, the image, the text description of the image, the text in the document, and the target prompt comprises: Based on the user input, the image, the text description of the image, the text in the document, and the target prompt words, the image and text response is generated using the target model according to the target format, wherein the target format has multiple regions, at least one of the multiple regions has multiple levels, and one of the multiple levels includes an image link indicating the location of the image.

9. The method according to claim 1, wherein the document includes multiple sub-documents, the multiple sub-documents include multiple images and multiple texts, and generating a text description of the image based on the image in the document and the image context includes: Based on the multiple images and their contexts, generate text descriptions for the multiple images; Furthermore, the target prompt words are determined based on the image, the text description of the image, and the text in the document, including: The target prompt word is generated based on the multiple images included in the multiple sub-documents, the multiple text descriptions of the multiple images, and the multiple sub-texts included in the multiple sub-documents.

10. An apparatus for generating a response, comprising: The text description generation module is configured to generate a text description of the image based on the image in the document and the image context; as well as The prompt word determination module is configured to determine target prompt words based on the image, the text description of the image, and the text in the document. The target prompt words are used to convert the image into a graphic and text response generated in response to user input. as well as The image and text response generation module is configured to generate the image and text response using a target model based on the user input, the image, the text description of the image, the text in the document, and the target prompt words.

11. An electronic device, comprising: At least one processing unit; At least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 9 when executed by the at least one processing unit.

12. A computer program product having a computer program stored thereon, which, when executed by a processor, implements the method according to any one of claims 1 to 9.