Text generation method and apparatus, and cloud server, electronic device and storage medium

By automatically performing text and image recognition upon detecting and confirming operations on the target interface of an electronic device, and generating text using a large model algorithm, the problem of cumbersome text generation operations in existing technologies is solved, improving convenience and user experience.

WO2026138130A1PCT designated stage Publication Date: 2026-07-02GUANG DONG MING CHUANG SOFTWARE TECH CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GUANG DONG MING CHUANG SOFTWARE TECH CORP
Filing Date
2025-10-28
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

In existing technologies, users need to perform cumbersome operations to generate text corresponding to content through electronic devices, resulting in low convenience and reduced user experience.

Method used

Upon detecting a confirmation action on the target interface, text recognition and image recognition operations are automatically performed. Semantic analysis is performed using large model algorithms to generate target text information, and interactive controls are displayed when a virtual keyboard is detected to improve convenience.

Benefits of technology

It simplifies the process of users generating text, improves convenience, reduces the steps and learning costs of manual input, and enhances the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025130614_02072026_PF_FP_ABST
    Figure CN2025130614_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of artificial intelligence and is applied to an electronic device. Disclosed are a text generation method and apparatus, and a cloud server, an electronic device and a storage medium. The method comprises: when it is detected that an electronic device displays a target interface, in response to a confirmation operation performed by a target user for a target control, performing a text recognition operation and an image recognition operation on display content in the target interface, so as to obtain data content to be analyzed, wherein the target interface is used for receiving input content from the target user, and said data content comprises at least one of text content and image content; and performing semantic analysis on said data content by means of a large model algorithm, so as to obtain target text information, which serves as the input content from the target user. The convenience for a target user to acquire target text information by means of a large model algorithm is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Text generation methods, devices, cloud servers, electronic devices and storage media

[0001] Cross-reference to related applications

[0002] This application claims priority to Chinese Patent Application No. 2024119684980, filed on December 27, 2024, entitled "Text Generation Method, Apparatus, Cloud Server, Electronic Device and Storage Medium", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of artificial intelligence technology, and more specifically, to a text generation method, apparatus, cloud server, electronic device, and storage medium. Background Technology

[0004] Currently, with the development of electronic information technology, users can generate text corresponding to content using electronic devices. However, generating text using electronic devices currently requires users to perform relatively cumbersome operations, resulting in low convenience. Summary of the Invention

[0005] This application proposes a text generation method, apparatus, cloud server, electronic device, and storage medium to improve the above-mentioned deficiencies.

[0006] In a first aspect, embodiments of this application provide a text generation method applied to an electronic device. The method includes: upon detecting that the electronic device displays a target interface, in response to a confirmation operation by a target user on a target control, performing text recognition and image recognition operations on the displayed content of the target interface to obtain data content to be analyzed, wherein the target interface is used to receive input content from the target user, and the data content to be analyzed includes at least one of text content and image content; and performing semantic analysis on the data content to be analyzed using a large model algorithm to obtain target text information as the input content of the target user.

[0007] Secondly, this application also provides a text generation device applied to an electronic device. The device includes: a recognition unit and a text generation unit. The recognition unit is used to perform text recognition and image recognition operations on the displayed content of the target interface in response to a confirmation operation of a target user on a target control when the electronic device detects that a target interface is displayed, thereby obtaining data content to be analyzed. The target interface is used to receive input content from the target user, and the data content to be analyzed includes at least one of text content and image content. The text generation unit is used to perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain target text information as input content from the target user.

[0008] Thirdly, this application also provides a cloud server, which is used to obtain the data content to be analyzed obtained by the electronic device based on the method described in the first aspect, and to perform semantic analysis on the data content to be analyzed through a large model algorithm to obtain target text information as input content of the target user.

[0009] Fourthly, embodiments of this application also provide an electronic device, including: one or more processors; a memory; and one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to perform the method described in the first aspect.

[0010] Fifthly, embodiments of this application also provide a computer-readable storage medium storing processor-executable program code, which, when executed by the processor, causes the processor to perform the above-described method.

[0011] Other features and advantages of the embodiments of this application will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the embodiments of this application. The objects and other advantages of the embodiments of this application may be realized and obtained by means of the structures particularly pointed out in the written description, claims, and drawings. Attached Figure Description

[0012] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 shows a flowchart of the text generation method provided in an embodiment of this application;

[0014] Figure 2 shows a schematic diagram of the target interface provided in an embodiment of this application;

[0015] Figure 3 shows a schematic diagram of the target interface provided in another embodiment of this application;

[0016] Figure 4 illustrates a schematic diagram of the target text information provided in an embodiment of this application;

[0017] Figure 5 shows a flowchart of a text generation method provided in another embodiment of this application;

[0018] Figure 6 shows a flowchart of a text generation method provided in another embodiment of this application;

[0019] Figure 7 shows a flowchart of a text generation method provided in another embodiment of this application;

[0020] Figure 8 shows a flowchart of a text generation method provided in another embodiment of this application;

[0021] Figure 9 shows a flowchart of the implementation of each step in Figure 8;

[0022] Figure 10 shows a schematic diagram of the interaction between the scene intelligence application and the screen recognition application provided in the embodiments of this application;

[0023] Figure 11 shows a flowchart of the interaction between the scene intelligence application and the screen recognition application provided in the embodiment of this application;

[0024] Figure 12 shows a structural block diagram of the method for acquiring data content to be analyzed according to an embodiment of this application;

[0025] Figure 13 shows a structural block diagram of the text generation apparatus provided in an embodiment of this application;

[0026] Figure 14 shows a schematic diagram of the cloud server provided in an embodiment of this application;

[0027] Figure 15 shows a structural block diagram of the electronic device provided in an embodiment of this application;

[0028] Figure 16 shows a structural block diagram of a computer-readable storage medium provided in an embodiment of this application. Detailed Implementation

[0029] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, and not all of them. The components of the embodiments of the present application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of the present application. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without inventive effort are within the scope of protection of the present application.

[0030] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0031] Currently, with the development of electronic information technology, users can generate text corresponding to content using electronic devices. However, generating text using electronic devices currently requires users to perform relatively cumbersome operations, resulting in low convenience. How to improve the convenience of generating text using electronic devices is an urgent problem to be solved.

[0032] Currently, to generate text corresponding to content using electronic devices, users can manually control their devices to run a designated application that uses artificial intelligence (AI) technology to generate text. Then, the content to be analyzed by AI is input into the application; this content can be either an image or text. After receiving the user's input, the application can analyze it using AI technology to obtain the corresponding text.

[0033] However, the inventors discovered in their research that the above-mentioned technical solution of having users manually control electronic devices to run a specified application to generate text corresponding to the content requires users to manually control the running of the specified application. In addition, users also need to input content into the specified application. Users need to perform relatively complex and cumbersome operations, which is not very convenient and reduces the user experience.

[0034] Therefore, in order to overcome the above-mentioned defects, this application provides a text generation method, apparatus, cloud server, electronic device and storage medium.

[0035] Please refer to Figure 1, which illustrates a text generation method provided in an embodiment of this application. This text generation method can be applied to an electronic device, specifically, the processor of the electronic device can be used as the execution entity of the text generation method. The text generation method may include steps S110 and S120.

[0036] The text generation method provided in this application can be applied to electronic devices, such as smartphones, smart tablets, laptops, and servers.

[0037] Step S110: When the target interface is detected to be displayed on the electronic device, in response to the confirmation operation of the target user on the target control, text recognition operation and image recognition operation are performed on the displayed content of the target interface to obtain the data content to be analyzed. The target interface is used to receive the input content of the target user, and the data content to be analyzed includes at least one of text content and image content.

[0038] Electronic devices can run an operating system, which in turn allows them to run various applications to perform their corresponding functions. Each application can have a different display interface, and the target user can view the content displayed on that interface. In some implementations, certain display interfaces can also be used to receive input from the target user. For example, the display interface may include an input box, which is used to receive input from the target user.

[0039] Therefore, to improve the convenience for target users to generate text corresponding to content using electronic devices, the display interface of the electronic device can be detected. For display interfaces that include those for receiving input from the target user, it can be further detected whether the target user has made a confirmation operation on the target control. The user's input can be text information.

[0040] The target control can be a control that a target user can interact with, allowing the target user to input a confirmation operation through the target control. For example, the target control can be a first interactive control displayed on the target interface of the electronic device; or, for example, a physical button located on the side of the electronic device. Thus, the target user can interact with the displayed first interactive control, such as clicking it; or interact with the physical button, such as pressing it, to input a confirmation operation through the target control. Detailed descriptions can be found in subsequent embodiments.

[0041] For example, please refer to Figure 2, which shows a schematic diagram of a target interface provided in an embodiment of this application. In the target interface 200 shown in Figure 2, the target interface 200 is a sharing interface of a social application. The target interface 200 includes an input box 210, in which the target user can enter text information and then share the entered text information.

[0042] For example, please refer to Figure 3, which shows a schematic diagram of a target interface provided in an embodiment of this application. In the target interface 300 shown in Figure 3, the target interface 300 is a sharing interface of a community-type application. The target interface 300 includes an input box 310, in which the target user can enter text information and then share the entered text information.

[0043] As described above, the target user can input a confirmation operation through the target control, and the electronic device can respond to the target user's confirmation operation on the target control by performing text recognition and image recognition operations on the displayed content of the target interface to obtain the data content to be analyzed.

[0044] The content displayed on the target interface may include both image content and text content.

[0045] In other words, after detecting a user's confirmation action on the target control, the electronic device can automatically perform text recognition and image recognition operations on the displayed content of the target interface to obtain the data content to be analyzed. The data content to be analyzed may include at least one of text content and image content.

[0046] Step S120: Perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain the target text information, which serves as the input content for the target user.

[0047] Furthermore, after obtaining the data content to be analyzed, semantic analysis can be performed on the data content to obtain the target text information corresponding to the data content to be analyzed, and the target text can be used as the input content of the target user in the target interface.

[0048] In some implementations, large model algorithms can be invoked to perform semantic analysis on the data to be analyzed. These large model algorithms can call upon multimodal models to understand and process multimodal content (such as text and image content) to generate corresponding text content. This large model algorithm is an algorithm based on artificial intelligence technology.

[0049] Optionally, the large model algorithm can be deployed locally on the electronic device, so that after the electronic device obtains the data content to be analyzed, it can directly call the large model algorithm locally to perform semantic analysis on the data content to be analyzed and obtain the target text information as the input content of the target user.

[0050] Optionally, the large-scale model algorithm can also be deployed on a cloud server, with the electronic device establishing a communication connection with the cloud server. After acquiring the data to be analyzed, the electronic device can upload the data to the cloud server, where the large-scale model algorithm can be invoked to perform semantic analysis on the data, obtaining the target text information, which serves as the input content for the target user. After obtaining the target text information, the cloud server can send it to the electronic device via the communication connection. Deploying the large-scale model algorithm on a cloud server saves storage space and computing resources for the electronic device.

[0051] In some implementations, the acquired target text information can be displayed on the target interface. For example, please refer to Figure 4, which shows a schematic diagram of the target text information provided in the embodiments of this application. The target interface 400 shown in Figure 4 includes target text information 410, which is obtained by semantic analysis of the data content to be analyzed obtained by performing text recognition and image recognition operations on the displayed content of the target interface 200 shown in Figure 2 using a large model algorithm.

[0052] The text generation method provided in this application, upon detecting that the electronic device displays a target interface, responds to a confirmation operation by the target user on a target control, performs text recognition and image recognition operations on the displayed content of the target interface to obtain data content to be analyzed. The target interface is used to receive input content from the target user, and the data content to be analyzed includes at least one of text content and image content. Semantic analysis is then performed on the data content to be analyzed using a large model algorithm to obtain target text information, which is the input content of the target user. In this solution, upon detecting that the electronic device displays a target interface, the target text information corresponding to the displayed content of the target interface can be obtained in response to a confirmation operation by the target user through the target control, improving the convenience for the target user to obtain the target text information.

[0053] Please refer to Figure 5, which illustrates a text generation method provided in an embodiment of this application. This text generation method can be applied to an electronic device, specifically, the processor of the electronic device can be used as the execution entity of the text generation method. The text generation method may include steps S210 to S230.

[0054] Step S210: When the electronic device is detected to be displaying a target interface, a first interactive control is displayed at a designated location on the target interface.

[0055] In the text generation method provided in this application embodiment, the target control can be a first interactive control. Therefore, when the electronic device detects that a target interface is displayed, the first interactive control can be displayed at a designated location on the target interface. The target interface is used to receive input content from the target user.

[0056] The first interactive control can be a control that the target user can interact with, allowing the target user to input a confirmation operation through the first interactive control. For example, if the electronic device is equipped with a touch screen, the target user can interact with the first interactive control through the touch screen. Specifically, the target user can click the first interactive control to input a confirmation operation; or, the target user can click the first interactive control and then slide it to input a confirmation operation.

[0057] Optionally, the electronic device may also be equipped with a voice input unit, such as a microphone, so that the target user can emit voice to interact with the first interactive control. For example, the target user can emit the voice "click the first interactive control", and after the electronic device acquires the voice emitted by the target user through the microphone, it can perform voice recognition on the voice and then confirm that the target user has input a confirmation operation through the first interactive control.

[0058] It should be noted that the target user's interaction with the first interactive control can be regarded as a confirmation operation.

[0059] Furthermore, please refer to Figure 2, which also shows a first interactive control 220. This first interactive control 220 is located above and adjacent to the display area of ​​the virtual keyboard 230 in the target interface 200; that is, the specified position is above and adjacent to the display area of ​​the virtual keyboard 230. The first interactive control 220 shown in Figure 2 is an icon with a text label that reads "Ai Help Writing".

[0060] Additionally, please refer to Figure 3, which also shows the first interactive control 320. This first interactive control 320 is located above all the interactive controls in the target interface 300, that is, at the designated position above all the interactive controls. The first interactive control 320 shown in Figure 3 is an icon with a text label that reads "Ai Copywriting Assistant".

[0061] It should be noted that the appearance, text labels, and designated positions of the first interactive controls shown in the above diagrams are merely examples and do not constitute a specific limitation on this application. In practical applications, adjustments can be made flexibly according to requirements.

[0062] The content displayed on the target interface may include image content and text content. For example, please refer to Figure 2, where the target interface 200 includes image content 280 and text content 290.

[0063] In some implementations, when the target interface is detected to be displayed on the electronic device, the display status of the virtual keyboard can be further obtained to determine whether the first interactive control should be displayed. Specifically, step S210 may also include steps S211 to S214.

[0064] Step S211: When the electronic device is detected to be displaying a target interface, the display status of the virtual keyboard in the target interface is obtained.

[0065] Step S212: Is the virtual keyboard displayed?

[0066] If the target interface is detected to be displayed on the electronic device, the display status of the virtual keyboard on the target interface can be further obtained.

[0067] Understandably, if a target user wants to input content on the target interface, they generally need to use a virtual keyboard. Therefore, if a virtual keyboard is detected on the target interface, it can be determined that the target user is highly likely to need to input content. At this point, a first interactive control can be displayed at a designated location on the target interface.

[0068] In other words, if the virtual keyboard is detected to be displayed, the process can proceed to step S213 to display the first interactive control at a specified location on the target interface; if the virtual keyboard is not detected to be displayed, the process can proceed to step S214 to not display the first interactive control.

[0069] Step S213: If the virtual keyboard is detected to be displayed, display the first interactive control at a specified location on the target interface.

[0070] Optionally, when the first interactive control is displayed at a specified location on the target interface, it is also possible to continuously monitor whether the virtual keyboard has been displayed. If the virtual keyboard is detected to switch from being displayed to not being displayed, the first interactive control can be hidden.

[0071] In some implementations, the designated location may include a position above and adjacent to the display area of ​​the virtual keyboard. For example, referring to FIG2, the first interactive control 220 shown in FIG2 is located above and adjacent to the display area of ​​the virtual keyboard 230 in the target interface 200, which is the designated location.

[0072] Step S214: If the virtual keyboard is not detected, the first interactive control may not be displayed.

[0073] Not displaying the first interactive control can be understood as hiding the first interactive control.

[0074] Step S220: In response to the target user's confirmation operation on the first interactive control, perform text recognition and image recognition operations on the displayed content in the target interface to obtain the data content to be analyzed.

[0075] After detecting a user's confirmation action on the first interactive control, the electronic device can automatically perform text recognition and image recognition operations on the displayed content of the target interface to obtain the data content to be analyzed. The data content to be analyzed may include at least one of text content and image content. For example, taking the target interface 200 shown in Figure 2 as an example, performing text recognition and image recognition operations on the displayed content of the target interface 200 yields data content to be analyzed that includes both text content and image content.

[0076] Step S230: Perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain the target text information, which serves as the input content of the target user.

[0077] In the text generation method provided in this application embodiment, the display of a first interactive control can be determined based on the display status of the virtual keyboard. Therefore, when the virtual keyboard is detected as displayed, the first interactive control is displayed at a designated location on the target interface. The target user can then directly perform a confirmation operation through the first interactive control, triggering subsequent text recognition and image recognition operations on the displayed content of the target interface in response to the user's confirmation operation. This yields the data content to be analyzed, and then semantic analysis is performed on the data content using a large model algorithm to obtain the target text information, which serves as the input content for the target user. In other words, the target user no longer needs to actively search for or control the launch of relevant applications for obtaining the data content to be analyzed through manual skills, and then manually input the displayed content of the target interface into those applications, greatly improving the convenience for users to obtain target text information.

[0078] Please refer to Figure 6, which illustrates a text generation method provided in an embodiment of this application. This text generation method can be applied to an electronic device, specifically, the processor of the electronic device can be used as the execution entity of the text generation method. The text generation method may include steps S310 to S350.

[0079] Step S310: When the target interface is detected to be displayed on the electronic device, in response to the confirmation operation of the target user on the target control, text recognition operation and image recognition operation are performed on the displayed content in the target interface to obtain the data content to be analyzed. The target interface is used to receive the input content of the target user, and the data content to be analyzed includes at least one of text content and image content.

[0080] Step S320: Perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain the target text information, which serves as the input content for the target user.

[0081] Steps S310 and S320 have been described in detail in the foregoing embodiments and will not be repeated here.

[0082] Step S330: Display the second interactive control.

[0083] Step S340: In response to the target user's confirmation operation on the second interactive control, the target text information is filled into the input box in the target interface for receiving the target user's input content.

[0084] In some implementations, after obtaining the target text information, a second interactive control may also be displayed. Similar to the first interactive control, this second interactive control is also a control that the target user can interact with, allowing the target user to input a confirmation operation through the second interactive control. The description of the target user inputting a confirmation operation through the second interactive control is similar to that of the target user inputting a confirmation operation through the first interactive control, and will not be repeated here.

[0085] Therefore, when the electronic device detects a confirmation operation by the target user on the second interactive control, it can respond to the confirmation operation by filling the target text information into the input box in the target interface for receiving the input content of the target user.

[0086] In other words, in the text generation method provided in this application embodiment, after obtaining the target text information, a second interactive control can also be displayed. Thus, when the target user's confirmation operation on the second interactive control is detected, the target text information can be automatically filled into the input box in the target interface for receiving the target user's input content, so as to further improve the convenience for the target user to obtain the target text information using artificial intelligence technology.

[0087] For example, please refer to Figure 4, which shows the target text information 410 and a second interactive control 420. The target user can then use the second interactive control 420 to confirm the input box 210 of the target interface 200 in Figure 2, filling the target text information 410 into it.

[0088] It should be noted that filling the input box with target text information essentially means that the electronic device copies the target text information and pastes it into the input box. In other words, the electronic device automatically completes the operation of entering text information that is identical to the target text information into the input box.

[0089] In some implementations, electronic devices can locate the corresponding text editing (EditText) control by traversing the target interface. For example, the operating system of the electronic device may be configured with accessibility features, allowing it to traverse the view tree nodes of the target interface to determine if a text editing control exists. If a text editing control is found, it can be determined that the text editing control includes an input box for obtaining input from the target user. Therefore, the data interface provided by the accessibility feature can be directly invoked to fill the input box with the target text information obtained through the aforementioned steps.

[0090] Optionally, if it is determined that there are multiple text editing controls corresponding to the target interface, the target text information can be filled into the input box corresponding to the last detected text editing control.

[0091] Step S350: If the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user, the target text information is stored and a prompt message is generated.

[0092] Furthermore, even if the target user confirms the interaction with the second control, the electronic device may fail to successfully fill the target text information into the input box on the target interface used to receive the target user's input. Therefore, if the target text information is not successfully filled into the input box on the target interface used to receive the target user's input, the electronic device may store the target text information and generate a prompt message.

[0093] For example, if an electronic device fails to find a text editing control by traversing the view tree nodes of the target interface, it can determine that the target text information has not been successfully filled into the input box in the target interface used to receive the input content of the target user.

[0094] For example, if an electronic device fails to activate its accessibility function, such as by timing out for a specified duration, specifically 3 seconds, it can be determined that the target text information was not successfully filled into the input box in the target interface used to receive the input content of the target user.

[0095] The electronic device may be equipped with a storage unit to store the target text information. Optionally, the storage unit can generally be divided into volatile storage and non-volatile storage. For example, the electronic device can store the target text information in the operating system's clipboard, which corresponds to a storage area in volatile storage. Alternatively, the electronic device can also store the target text information in a storage area corresponding to non-volatile storage.

[0096] Additionally, a prompt message can be generated. This prompt message can instruct the target user to manually paste the target text information into the input box of the target interface. Since the target text information has already been stored, the user can directly control the electronic device to paste the target text information into the input box of the target interface after seeing the prompt message, without having to first store the target text information in the clipboard, greatly improving convenience.

[0097] The text generation method provided in this application, after performing semantic analysis on the data content to be analyzed using a large model algorithm to obtain target text information as input content for the target user, can also display a second interactive control. In response to the target user's confirmation operation on the second interactive control, the target text information is filled into the input box in the target interface used to receive the target user's input content. That is, the target text information can be automatically filled into the input box in the target interface used to receive the target user's input content, without requiring the target user to manually input the obtained target text information into the input box in the target interface. This reduces the target user's operation steps and learning cost, further improving the convenience for the target user to obtain target text information using artificial intelligence technology.

[0098] Please refer to Figure 7, which illustrates a text generation method provided in an embodiment of this application. This text generation method can be applied to an electronic device, specifically, the processor of the electronic device can be used as the execution entity of the text generation method. The text generation method may include steps S410 to S470.

[0099] Step S410: Detect whether the interface displayed by the electronic device matches a preset reference interface.

[0100] Step S420: If the interface displayed by the electronic device matches the reference interface, it is determined that the target interface displayed by the electronic device is detected.

[0101] In some implementations, a reference interface can be pre-defined; for example, this reference interface can be pre-defined by the application developer. The reference interface may include multiple interfaces. It is understood that the reference interface is the pre-defined interface that needs to be monitored.

[0102] It should be noted that the reference interface is the display interface of an application or mini-program. In the embodiments provided in this application, the application corresponding to the reference interface is not limited to the first-party application of the electronic device, but can be applied to suitable third-party applications in actual applications, which greatly improves the adaptability and flexibility of the text generation method provided in the embodiments of this application.

[0103] Therefore, the interface displayed by the electronic device can be detected, and if the interface displayed by the electronic device matches the reference interface, it can be determined that the target interface displayed by the electronic device has been detected.

[0104] In some implementations, the detection of the interface displayed on an electronic device can be achieved through scene intelligence services. These scene intelligence services can be provided by scene intelligence applications or by the operating system; this application does not impose specific limitations on this.

[0105] Optionally, the scene intelligence service can listen to the displayed interface by specifying a service in the operating system. For example, the specified service can be the OplusAppSwitchManager service.

[0106] Optionally, the scene intelligence service can obtain pre-defined reference interfaces to facilitate subsequent detection of whether the interface displayed by the electronic device matches the pre-defined reference interfaces. The scene intelligence service can obtain these reference interfaces through the SceneServiceProvider interface.

[0107] When the scene intelligence service detects that the interface displayed by the electronic device matches the reference interface, it can be determined that the target interface displayed by the electronic device has been detected. Therefore, the screen recognition application can be activated to perform subsequent steps to obtain the data content to be analyzed. For example, the scene intelligence service can activate the screen recognition application through the scene entry service (SceneEntryService).

[0108] Optionally, reference text can be preset, and then text displayed on the interface of the electronic device can be acquired, and it can be determined whether the text contains the reference text. If the reference text is present, it can be determined that the target interface displayed by the electronic device has been detected.

[0109] It is understood that the reference interface may change in different scenarios. Therefore, in some implementations, step S410 may also include steps S411 to S413.

[0110] Step S411: Identify the scene information corresponding to the interface displayed by the electronic device.

[0111] Step S412: Based on the pre-set scene interface comparison data, determine the reference interface corresponding to the scene information, wherein the scene interface comparison data includes at least one standard scene information and a standard interface corresponding to each standard scene information.

[0112] First, the scene information corresponding to the interface displayed on the electronic device can be identified. For example, if the displayed interface is a chat interface of an instant messaging software, the corresponding scene information can be determined to be a chat scene; if the displayed interface is a sharing interface of a social networking software, the corresponding scene information can be determined to be a sharing scene.

[0113] In some implementations, the interface displayed on an electronic device can be analyzed using a pre-trained model to determine the corresponding scene information. For example, the pre-trained model can be a neural network model.

[0114] Therefore, after obtaining the scene information, a reference interface corresponding to that scene information can be further determined. Specifically, the reference interface corresponding to the scene information can be determined based on pre-set scene interface comparison data. The scene interface comparison data includes at least one standard scene information and a standard interface corresponding to each standard scene information.

[0115] For example, the scene interface comparison data can be a scene interface comparison table. This allows for the search of a standard scene information that matches the scene information within the scene interface comparison data. The standard interface corresponding to this matching standard scene information is then used as a reference interface for the interface displayed on the electronic device. Subsequently, it is possible to detect whether the interface displayed on the electronic device matches the reference interface. In some embodiments, after obtaining the scene interface comparison data, the target user can also update the scene interface comparison data. Specifically, step S412 may further include steps S413 to S416.

[0116] Step S413: Obtain scene interface comparison data.

[0117] Step S414: In response to the target user's update operation on the scene interface comparison data, obtain the specified scene information determined by the target user and the specified interface corresponding to the specified scene information.

[0118] Step S415: Add the specified scene information and the specified interface corresponding to the specified scene information as new standard scene information and a new standard interface matching the new standard scene information to the scene interface comparison data to update the scene interface comparison data.

[0119] Step S416: Based on the updated scene interface comparison data, determine the reference interface that matches the scene information.

[0120] First, we can obtain scene interface comparison data. It's understandable that target users, based on their own needs, may require additional interfaces to serve as the standard interface for certain scene information, provided they are in certain scenarios. Alternatively, users may need to add new scene information and corresponding standard interfaces for those new scenes.

[0121] Therefore, users can update the scene interface comparison data; for example, users can perform the update operation on the scene interface comparison data through an electronic device. Thus, the electronic device can respond to the target user's update operation on the scene interface comparison data, obtaining the specified scene information determined by the target user and the specified interface corresponding to that specified scene information.

[0122] It is understandable that target users can determine specific scene information and the corresponding specific interface through electronic devices.

[0123] For example, a target user can select existing standard scene information from the scene interface comparison data as the specified scene information, and then set the corresponding specified interface for the specified scene information.

[0124] For example, the target user can also set new scene information in addition to the existing standard scene information in the scene interface comparison data as the specified scene information, and then set the corresponding specified interface for the specified scene information.

[0125] Furthermore, the specified scene information and the specified interface corresponding to the specified scene information can be added to the scene interface comparison data as new standard scene information and new standard interface matching the new standard scene information, respectively, to update the scene interface comparison data.

[0126] It should be noted that when updating the scene interface comparison data, if there is already standard scene information in the scene interface comparison data that is consistent with the new standard scene information, the new standard interface that matches the new standard scene information can be directly added to the scene interface comparison data, and the standard interface that corresponds to the consistent standard scene information will also be used as the standard interface corresponding to the consistent standard scene information.

[0127] If no standard scene information matches the new standard scene information in the scene interface comparison data, the new standard scene information and the new standard interface that matches the new standard scene information can be directly added to the scene interface comparison data to update the scene interface comparison data.

[0128] Then, based on the updated scene interface comparison data, a reference interface matching the scene information can be determined.

[0129] In some implementations, scene interface comparison data can be included in the configuration information, allowing the scene intelligence application to subsequently obtain the configuration information and thus the scene interface comparison data. For example, the screen recognition application may include configuration information, which can then be sent to the scene intelligence application. Additionally, the target user can update the configuration information via the screen recognition application, specifically updating the scene interface comparison data within the configuration information.

[0130] Step S413: Detect whether the interface displayed by the electronic device matches the reference interface.

[0131] Then, it is checked whether the interface displayed on the electronic device matches the reference interface.

[0132] Therefore, target users can determine specific scene information and the corresponding interface based on their own needs, and then add the specified scene information and the corresponding interface to the reference data to update the scene interface reference data. This ensures that when the reference interface for the electronic device is subsequently determined using the scene interface reference data, this reference interface can include the reference interface pre-added by the target user according to their own needs, achieving a certain degree of customization of the reference interface.

[0133] Step S430: When the target interface is detected to be displayed on the electronic device, in response to the confirmation operation of the target user on the target control, text recognition operation and image recognition operation are performed on the displayed content of the target interface to obtain the data content to be analyzed.

[0134] After obtaining the target user's confirmation action on the target control, text recognition and image recognition operations can be performed on the displayed content of the target interface to obtain the data content to be analyzed. In some embodiments, the screen recognition application may also include a content extraction service, which can then be used to perform text recognition and image recognition operations on the displayed content of the interface.

[0135] Please refer to Figure 12, which shows a structural block diagram of the acquisition of data content to be analyzed provided in an embodiment of this application. Figure 12 illustrates a screen recognition application 1110, which includes an entry service 1111, an extraction service 1112, and an artificial intelligence business function module 1113. Specifically, after the screen recognition application obtains the target user's confirmation operation on the target control through the entry service 1111, it can perform text recognition and image recognition operations on the displayed content in the target interface through the extraction service 1112 to obtain the data content to be analyzed.

[0136] Specifically, the extraction service 1112 can perform text recognition and image recognition operations on the displayed content of the target interface through the accessibility service 11121, Webview 11122, the image and text extraction framework 11123 corresponding to the H5 page, the image and text extraction framework 11124 corresponding to the Native page, the screenshot service 11125, and optical character recognition 11126 to obtain the data content to be analyzed. Detailed descriptions can be found in subsequent sections.

[0137] Then, the data content to be analyzed is sent to the Aiunit algorithm interface 1120 through the artificial intelligence business function module 1113. The data content to be analyzed is uploaded to the cloud server 1130 through the Aiunit algorithm interface 1120, and the target text information corresponding to the data content to be analyzed is obtained through the cloud server 1130. In addition, during the process of performing text recognition and image recognition operations on the displayed content in the target interface through the extraction service 1112 to obtain the data content to be analyzed, the screen recognition application 1110 can also call the system framework service 1140, which includes accessibility service 1141, image and text extraction framework service 1142, and screenshot service 1143, etc.

[0138] In some implementations, text recognition and image recognition operations can be performed on the displayed content of the target interface based on accessibility services to obtain the data content to be analyzed. That is, the data content to be analyzed can be obtained through the accessibility service 11121 shown in Figure 12.

[0139] In some implementations, the data content to be analyzed can also be obtained through a Uniform Resource Locator (URL). Specifically, step S430 may include steps S4301 to S4305.

[0140] Step S4301: When the electronic device is detected to be displaying a target interface, in response to the target user's confirmation operation on the target control, the first Uniform Resource Locator corresponding to the target interface is obtained.

[0141] Step S4302: Obtain the Hypertext Markup Language used to describe the target page based on the first Uniform Resource Locator.

[0142] Step S4303: Extract the second Uniform Resource Locator corresponding to the first identifier from the Hypertext Markup Language, wherein the first identifier serves as the identity information of the first image, and the first image includes the image content in the display content.

[0143] Step S4304: Obtain the first image based on the second Uniform Resource Locator.

[0144] Step S4305: Extract the text content of the content displayed on the target page from the hypertext markup language as the first text content, and use the first image and the first text content as the data content to be analyzed.

[0145] When the target interface is a page in a browser application, upon detecting that the electronic device is displaying the target interface, in response to the target user's confirmation operation on the target control, the first Uniform Resource Locator corresponding to the target interface can be obtained first.

[0146] For example, the first Uniform Resource Locator (URL) of the browser's target interface can be extracted using a text extraction framework. Specifically, the first URL can be obtained by searching for text edits in the view tree nodes.

[0147] Then, the Hypertext Markup Language (HTML) used to describe the target page can be obtained based on the first Uniform Resource Locator (URL). Specifically, the webpage content corresponding to the first URL can be loaded through the WebView, and the HTML of the target page can be obtained from the webpage content. Alternatively, the HTML of the target page can be obtained from the webpage content after the WebView has finished loading the webpage content (onPageFinish is detected).

[0148] Optionally, the WebView can be made transparent, so that the process of loading the web page content corresponding to the first Uniform Resource Locator is not visible to the target user, thereby improving the user experience.

[0149] Furthermore, after obtaining the Hypertext Markup Language, a second Uniform Resource Locator (URL) corresponding to the first identifier can be extracted from the Hypertext Markup Language. The first identifier serves as the identity information of the first image, which includes the image content in the displayed content.

[0150] For example, filtering in Hypertext Markup Language can be used to obtain text containing... The label represents the first identifier. Additionally, the label includes a second Uniform Resource Locator (URL). Therefore, the first image can be obtained based on the second URL.

[0151] Then, the text content of the content displayed on the target page is extracted from the hypertext markup language as the first text content, and the first image and the first text content are used as the data content to be analyzed. Thus, the data content to be analyzed corresponding to the displayed content of the target interface is obtained. That is, the data content to be analyzed can be obtained through the Webview 11122 shown in Figure 12.

[0152] In some implementation methods, the data to be analyzed can also be obtained directly through a text and image extraction framework. Specifically, step S430 may also include steps S4306 and S4307.

[0153] Step S4306: When the target interface is detected to be displayed on the electronic device, in response to the confirmation operation of the target user on the target control, the image content of the content displayed on the target interface is obtained as the second image through the image and text extraction framework, and the text content of the content displayed on the target interface is obtained as the second text content.

[0154] Step S4307: Use the second image and the second text content as the data content to be analyzed.

[0155] When the target interface is a page in a non-browser application, upon detecting that the electronic device is displaying the target interface, in response to the target user's confirmation operation on the target control, the image and text content of the content displayed in the target interface can be obtained through the image and text extraction framework.

[0156] For example, a non-browser application may include H5 pages or native pages. An H5 page is a web page generated based on HTML5 technology, while a native page is a native application page built using a specific platform's development language and framework (e.g., Android, iOS). Therefore, when the page in the non-browser application is an H5 page, the image content displayed on the target interface can be obtained as a second image, and the text content displayed on the target interface can be obtained as a second text content, using the corresponding image and text extraction framework. Similarly, when the page in the non-browser application is a native page, the image content displayed on the target interface can be obtained as a second image, and the text content displayed on the target interface can be obtained as a second text content, using the corresponding image and text extraction framework.

[0157] Furthermore, the second image and the second text content are used as the data content to be analyzed. Thus, the data content to be analyzed corresponding to the display content of the target interface is obtained. That is, the data content to be analyzed can be obtained through the image and text extraction framework 11123 corresponding to the H5 page and the image and text extraction framework 11124 corresponding to the Native page shown in Figure 12.

[0158] In some implementations, the data content to be analyzed corresponding to the displayed content of the target interface can also be obtained by taking a screenshot. Specifically, step S430 may also include steps S4308 and S4309.

[0159] Step S4308: When the electronic device is detected to be displaying a target interface, in response to the target user's confirmation operation on the target control, a screenshot operation is performed on the content displayed on the target interface to obtain a third image.

[0160] Step S4309: Use the third image as the data content to be analyzed.

[0161] When the target interface is a page of a specific application, upon detecting that the electronic device is displaying the target interface, in response to the target user's confirmation operation on the target control, a screenshot can be directly taken of the displayed content on the target interface to obtain a third image. This third image is then used as the data content to be analyzed. In other words, the data content to be analyzed can be obtained through the screenshot service 11125 shown in Figure 12.

[0162] Among them, the screenshot operation can be achieved through the screenshot function built into the operating system of the electronic device. The screenshot operation can generate an image file containing the displayed content based on the content displayed on the target interface.

[0163] Among them, the specific application can be an application pre-selected by the user.

[0164] In some implementations, optical character recognition (OCR) can also be used to obtain the data content to be analyzed corresponding to the displayed content of the target interface. Specifically, step S430 may also include steps S43010 to S4312.

[0165] Step S4310: When the electronic device is detected to be displaying a target interface, in response to the target user's confirmation operation on the target control, the displayed content on the target interface is text-recognized by optical character recognition to obtain third text content.

[0166] Step S4311: Take a screenshot of the image content displayed on the target interface to generate a fourth image.

[0167] Step S4312: Use the third text content and the fourth image as the data content to be analyzed.

[0168] When the target interface is a page belonging to a mini program, upon detecting that the electronic device is displaying the target interface, in response to the target user's confirmation operation on the target control, optical character recognition (OCR) is used to perform text recognition on the displayed content of the target interface to obtain the third text content. Additionally, a screenshot is taken of the image content displayed on the target interface to generate a fourth image. The screenshot operation on the image content displayed on the target interface can be implemented using a region-based screenshot method.

[0169] Therefore, the third text content and the fourth image are taken as the data content to be analyzed. That is, the data content to be analyzed can be obtained through optical character recognition 11126 shown in Figure 12.

[0170] Among them, mini programs are lightweight applications. Unlike traditional mobile applications, mini programs do not need to be downloaded and installed. Users can run mini programs directly in other applications with mini program interface functions, which greatly saves users' time and storage space.

[0171] Step S440: Perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain the target text information, which serves as the input content for the target user.

[0172] Step S440 has been described in detail in the foregoing embodiments and will not be repeated here.

[0173] In the text generation method provided in this application embodiment, it is possible to detect whether the interface displayed by the electronic device matches a pre-set reference interface; if the interface displayed by the electronic device matches the reference interface, it is determined that the target interface displayed by the electronic device has been detected; when the target interface is detected, in response to the confirmation operation of the target user on the target control, text recognition and image recognition operations are performed on the displayed content of the target interface to obtain the data content to be analyzed. That is, the target user can determine the specified scene information and the specified interface corresponding to the specified scene information according to their own needs, and then add the specified scene information and the specified interface corresponding to the specified scene information to the reference data to update the scene interface reference data. This ensures that when the reference interface corresponding to the interface displayed by the electronic device is subsequently determined through the scene interface reference data, the reference interface can include the reference interface pre-added by the target user according to their own needs, achieving a certain degree of customization of the reference interface.

[0174] Please refer to Figure 8, which illustrates a text generation method provided in an embodiment of this application. This text generation method can be applied to an electronic device, specifically, the processor of the electronic device can be used as the execution entity of the text generation method. The text generation method may include steps S510 to S550.

[0175] Step S510: The scene intelligent application detects whether the interface displayed by the electronic device matches the preset reference interface.

[0176] Specifically, a scene intelligence application can be used to detect whether the interface displayed by the electronic device matches a pre-set reference interface.

[0177] Specifically, please refer to Figure 9, which shows a flowchart of the implementation of each step in Figure 8. Step S510 shown in Figure 8 may include steps S511 and S512 in Figure 9.

[0178] Step S511: Detect whether the interface displayed by the electronic device matches a preset reference interface.

[0179] Step S512: Activate the screen recognition application

[0180] Specifically, the scene intelligence application can activate the screen recognition application when the interface displayed on the electronic device matches the reference interface.

[0181] Step S520: The screen recognition application calls the content extraction service to obtain the data content to be analyzed.

[0182] In the process of obtaining the data content to be analyzed, the screen recognition application can also call up system framework services, which include accessibility services, image and text extraction framework services, and screenshot services.

[0183] Step S520 shown in Figure 8 may include steps S521 to S526 in Figure 9.

[0184] Step S521: Launch the screen recognition application.

[0185] Step S522: Detect whether the virtual keyboard has been displayed.

[0186] Step S523: If the virtual keyboard is detected to be displayed, display the first interactive control at a specified location on the target interface.

[0187] Step S524: The target user performs a confirmation operation on the first interactive control.

[0188] Step S525: Obtain the data to be analyzed.

[0189] Optionally, the shape of the first interactive control can be a bubble.

[0190] In response to a user's confirmation of the first interactive control, text recognition and image recognition operations can be performed on the content displayed on the target interface to obtain the data to be analyzed. After obtaining the data to be analyzed, the first data can be uploaded to the cloud server through the algorithm interface.

[0191] Additionally, please refer to Figure 10, which illustrates the interaction between the scene intelligence application and the screen recognition application provided in this embodiment of the application. Figure 10 includes the scene intelligence application 910 and the screen recognition application 920.

[0192] The scene intelligence application 910 can listen to the reference interface, and then wake up the screen recognition application 920 when the interface displayed on the electronic device matches the reference interface.

[0193] The screen recognition application 920 may include a configuration query unit 921, a configuration manager 922, and a configuration data update unit 923. The configuration query unit 921 can obtain configuration information from the configuration manager 922 and the configuration data update unit 923 through a scene service provider interface. This configuration information includes a preset reference interface. The screen recognition application 920 also includes a database 926. Specifically, the configuration information can be obtained by querying the database 926 through the configuration manager 922 and the configuration data update unit 923. The configuration manager 922 corresponds to a scene rule manager interface; the configuration data update unit 923 corresponds to a scene rule data source interface; and the database 926 corresponds to a scene rule entityDao interface.

[0194] The screen recognition application 920 may also include a configuration synchronization unit 924, which sends configuration information to the scene intelligence application 910 to update the reference interface pre-acquired in the scene intelligence application 910. The configuration information may include scene interface comparison data, allowing the scene intelligence application 910 to determine the reference interface matching the scene information after receiving the configuration information. The configuration synchronization unit 924 corresponds to a SceneServiceManager interface.

[0195] Additionally, the screen recognition application 920 may also include a listening unit 925. This listening unit 925 can monitor the display status of the virtual keyboard via the SceneEntryService interface. Upon detecting that the virtual keyboard is displayed, a first interactive control is displayed at a designated location on the target interface.

[0196] Additionally, please refer to Figures 10 and 11, where Figure 11 shows a flowchart of the interaction between the scene intelligence application and the screen recognition application provided in the embodiments of this application.

[0197] Figure 11 illustrates the steps for both the scene-based intelligent application and the screen recognition application. As shown in Figure 11, the scene-based intelligent application can trigger the wake-up of the screen recognition application, thereby initiating its launch. Optionally, after launch, the screen recognition application can update its configuration information via a configuration synchronization unit.

[0198] The scene intelligence application can also query configuration information from the screen recognition application, which includes multiple reference interfaces. After obtaining the queried configuration information, the screen recognition application can return updated configuration information to the scene intelligence application through the scene rule manager and scene rule data source. Specifically, it can query the database through the scene rule data source. If the database is empty, it can return the pre-set default configuration information to the scene intelligence application; if the database is not empty, it can trigger a data update and send the updated configuration information to the scene intelligence application.

[0199] Furthermore, when the scene intelligence application detects that the electronic device is displaying a target interface, it can determine that the interface has been entered, thereby triggering a listening service in the screen recognition application to monitor the display of the virtual keyboard. Optionally, the screen recognition application can also obtain relevant data from various reference interfaces from the scene rule manager to more accurately monitor the display of the virtual keyboard in the target interface. Thus, when the virtual keyboard is detected to be displayed, a first interactive control can be displayed at a designated location on the target interface. In some embodiments, the first interactive control can also be called a bubble service. The scene intelligence application can call Bindservice to the screen recognition application to trigger the listening service.

[0200] Furthermore, the scene intelligence application can also monitor whether the reference interface has exited. For example, if the reference interface switches to the interface of another application, it can be determined that the reference interface has exited. At this time, the monitoring service can be exited, specifically by unlistening to the display status of the virtual keyboard. Additionally, the first interactive control can be hidden. The scene intelligence application can call UnBindservice to the screen recognition application, causing the screen recognition application to trigger the exit of the monitoring service.

[0201] Optionally, the scene intelligence application might fail to properly call UnBindservice to the screen recognition application, causing the screen recognition application to continuously monitor the virtual keyboard's display. Therefore, a scheduled task can be set within the screen recognition application. This task can be configured to stop monitoring the virtual keyboard's display and the reference interface's exit after a specified duration, and then terminate the relevant processes of the screen recognition application. For example, the specified duration could be 2 hours.

[0202] In some implementations, the screen recognition application can be Ai Master.

[0203] Step S530: Upload the data to be analyzed to the cloud server via the algorithm interface.

[0204] The algorithm interface can be provided by Aiunit.

[0205] Optionally, step S530 shown in FIG8 may include step S531 in FIG9.

[0206] Step S531: After preprocessing the data to be analyzed, upload it to the cloud server.

[0207] Optionally, after acquiring the data to be analyzed, Aiunit can preprocess the image content within the data, such as by performing image enhancement. The preprocessed data is then uploaded to a cloud server.

[0208] Step S540: The cloud server calls the large model algorithm to perform semantic analysis on the data content to be analyzed, and obtains the target text information as the input content of the target user.

[0209] The cloud server can call large model algorithms to perform semantic analysis on the data content to be analyzed, and obtain the target text information as the input content of the target user.

[0210] Furthermore, step S530 shown in FIG8 may also include step S532 in FIG9.

[0211] Step S532: Return the target text information to the screen recognition application.

[0212] Specifically, the cloud server can return the target text information to the screen recognition application through an algorithm interface.

[0213] Furthermore, step S520 shown in FIG8 may also include steps S526 to S528 in FIG9.

[0214] Step S526: Display the target text information.

[0215] Step S527: Display the second interactive control.

[0216] Step S528: Fill the input box with the target text information.

[0217] In response to the target user's confirmation operation on the second interactive control, the target text information is filled into the input box in the target interface for receiving the target user's input content.

[0218] After acquiring the target text information, the electronic device can display it on its screen using a screen recognition application. For example, the target text information can be displayed as a layer overlaid on the target interface.

[0219] For a detailed description of each of the above steps, please refer to the foregoing embodiments; they will not be repeated here.

[0220] Please refer to Figure 13, which shows a structural block diagram of a text generation device provided in an embodiment of this application. The text generation device 1300 includes: a recognition unit 1310 and a text generation unit 1320.

[0221] The recognition unit 1310 is configured to, upon detecting that the electronic device displays a target interface, perform text recognition and image recognition operations on the displayed content of the target interface in response to a confirmation operation by a target user on a target control, to obtain data content to be analyzed. The target interface is used to receive input content from the target user, and the data content to be analyzed includes at least one of text content and image content.

[0222] Optionally, the recognition unit 1310 can also be used to display a first interactive control at a specified position on the target interface when the electronic device is detected to be displaying a target interface; and to perform text recognition and image recognition operations on the displayed content in the target interface in response to the confirmation operation of the target user on the first interactive control, so as to obtain the data content to be analyzed.

[0223] Optionally, the identification unit 1310 can also be used to obtain the display status of the virtual keyboard in the target interface when the electronic device detects that the target interface is displayed; if the virtual keyboard is detected to be displayed, a first interactive control is displayed at a designated location on the target interface. The designated location includes a position above and adjacent to the display area of ​​the virtual keyboard.

[0224] Optionally, the identification unit 1310 can also be used to, upon detecting that the electronic device displays a target interface, in response to a confirmation operation by a target user on a target control, obtain a first Uniform Resource Locator (URL) corresponding to the target interface; obtain a Hypertext Markup Language (HMR) for describing the target page based on the first URL; extract a second URL corresponding to a first identifier from the HMR, wherein the first identifier serves as the identity information of a first image, and the first image includes image content in the displayed content; obtain a first image based on the second URL; extract the text content of the displayed content in the target page from the HMR as the first text content, and use the first image and the first text content as the data content to be analyzed.

[0225] Optionally, the recognition unit 1310 can also be used to, in response to a confirmation operation by a target user on a target control when the electronic device displays a target interface, obtain the image content of the content displayed on the target interface as a second image through an image and text extraction framework, and obtain the text content of the content displayed on the target interface as a second text content; and use the second image and the second text content as the data content to be analyzed.

[0226] Optionally, the identification unit 1310 can also be used to take a screenshot of the content displayed on the target interface in response to the confirmation operation of the target user on the target control when the target interface is detected to be displayed on the electronic device, and obtain a third image; and use the third image as the data content to be analyzed.

[0227] Optionally, the recognition unit 1310 can also be used to, in response to a confirmation operation by a target user on a target control, perform text recognition on the displayed content of the target interface through optical character recognition to obtain third text content when the target interface is detected to display a target interface; perform a screenshot operation on the image content of the displayed content of the target interface to generate a fourth image; and use the third text content and the fourth image as the data content to be analyzed.

[0228] Optionally, the recognition unit 1310 can also be used to detect whether the interface displayed by the electronic device matches a preset reference interface; if the interface displayed by the electronic device matches the reference interface, it is determined that the target interface displayed by the electronic device is detected; if the target interface is detected, in response to the confirmation operation of the target user on the target control, text recognition operation and image recognition operation are performed on the displayed content in the target interface to obtain the data content to be analyzed.

[0229] Optionally, the identification unit 1310 can also be used to identify scene information corresponding to the interface displayed by the electronic device; determine a reference interface corresponding to the scene information based on pre-set scene interface comparison data, wherein the scene interface comparison data includes at least one standard scene information and a standard interface corresponding to each standard scene information; and detect whether the interface displayed by the electronic device matches the reference interface.

[0230] Optionally, the recognition unit 1310 can also be used to acquire scene interface comparison data;

[0231] In response to the target user's update operation on the scene interface comparison data, the specified scene information determined by the target user and the specified interface corresponding to the specified scene information are obtained; the specified scene information and the specified interface corresponding to the specified scene information are respectively used as new standard scene information and new standard interface matched by the new standard scene information, and added to the scene interface comparison data to update the scene interface comparison data; based on the updated scene interface comparison data, the reference interface matched by the scene information is determined.

[0232] The text generation unit 1320 is used to perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain target text information as the input content of the target user.

[0233] Optionally, the text generation unit 1320 can also be used to display a second interactive control; in response to the target user's confirmation operation on the second interactive control, the target text information is filled into the input box in the target interface for receiving the target user's input content.

[0234] Optionally, the text generation unit 1320 can also be used to store the target text information and generate a prompt message if the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user.

[0235] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the above-described apparatus and unit can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0236] In the several embodiments provided in this application, the coupling between the units can be electrical, mechanical or other forms of coupling.

[0237] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0238] Please refer to Figure 14, which shows a cloud server 1400 provided in an embodiment of this application, which is connected to an electronic device 110.

[0239] The cloud server 1400 can be used to acquire the data content to be analyzed obtained by the electronic device 110 based on the text generation methods shown in the foregoing embodiments, and to perform semantic analysis on the data content to be analyzed using a large model algorithm to obtain the target text information as the input content of the target user. Detailed descriptions can be found in the foregoing method embodiments, and will not be repeated here.

[0240] Large model algorithms can be deployed on cloud servers.

[0241] Please refer to Figure 15, which shows a structural block diagram of an electronic device provided in an embodiment of this application. The electronic device 110 may be a smartphone or tablet computer, etc. The electronic device 110 in this application may include one or more of the following components: a processor 111, a memory 112, and one or more application programs, wherein the processor 111 is electrically connected to the memory 112, and the one or more programs are configured to execute the methods described in the foregoing embodiments.

[0242] Processor 111 may include one or more processing cores. Processor 111 connects to various parts within the electronic device 110 using various interfaces and lines, and performs various functions and processes data of the electronic device 110 by running or executing instructions, programs, code sets, or instruction sets stored in memory 112, and by calling data stored in memory 112. Optionally, processor 111 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). Processor 111 may integrate one or more of a central processing unit, a graphics processing unit (GPU), and a modem. Specifically, the CPU mainly handles the operating system, user interface, and computer programs; the GPU is responsible for rendering and drawing the displayed content; and the modem is used for wireless communication. It is understood that the modem may also not be integrated into processor 111 and may be implemented separately through a communication chip. Specifically, the methods described in the foregoing embodiments can be executed by one or more processors 111.

[0243] In some implementations, memory 112 may include random access memory (RAM) or read-only memory (ROM). Memory 112 can be used to store instructions, programs, code, code sets, or instruction sets. Memory 112 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing the various method embodiments described below, etc. The data storage area may also store data created by the electronic device 110 during use.

[0244] Please refer to Figure 16, which shows a structural block diagram of a computer-readable storage medium provided in an embodiment of this application. The computer-readable medium 1600 stores program code that can be called by a processor to execute the methods described in the above method embodiments.

[0245] The computer-readable storage medium 1600 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 1600 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 1600 has storage space for program code 1610 that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code 1610 may, for example, be compressed in a suitable form.

[0246] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

A text generation method characterized by comprising: The method is applied to an electronic device and comprises the following steps: In response to a confirmation operation of a target user on a target control, text recognition and image recognition are performed on display content in a target interface to obtain data content to be analyzed, when it is detected that the electronic device displays the target interface, the target interface is used to receive input content of the target user, and the data content to be analyzed comprises at least one of text content and image content; Semantic analysis is performed on the data content to be analyzed by a large model algorithm to obtain target text information as input content of the target user. The method of claim 1, wherein The target control comprises a first interactive control, and the step of, in response to a confirmation operation of a target user on a target control, performing text recognition and image recognition on display content in a target interface to obtain data content to be analyzed, when it is detected that the electronic device displays the target interface, comprises the following steps: When it is detected that the electronic device displays the target interface, a first interactive control is displayed at a specified position of the target interface; In response to a confirmation operation of a target user on the first interactive control, text recognition and image recognition are performed on display content in the target interface to obtain data content to be analyzed. The method according to claim 2, characterized in that The step of, when it is detected that the electronic device displays the target interface, displaying a first interactive control at a specified position of the target interface, comprises the following steps: When it is detected that the electronic device displays the target interface, the display of a virtual keyboard in the target interface is acquired; If it is detected that the virtual keyboard has been displayed, a first interactive control is displayed at a specified position of the target interface. The method according to claim 3, characterized in that The specified position comprises a position above a display area of the virtual keyboard and adjacent to the virtual keyboard. The method of claim 1, wherein After the step of performing semantic analysis on the data content to be analyzed by a large model algorithm to obtain target text information as input content of the target user, the following steps are further included: A second interactive control is displayed; In response to a confirmation operation of the target user on the second interactive control, the target text information is filled into an input box in the target interface for receiving input content of the target user. The method according to claim 5, characterized in that The step of, in response to a confirmation operation of the target user on the second interactive control, filling the target text information into an input box in the target interface for receiving input content of the target user, comprises the following steps: In response to a confirmation operation of the target user on the second interactive control, a target interface is traversed to determine a text editing control corresponding to the target interface, the text editing control comprises an input box for acquiring input content of the target user; A data interface provided by an accessibility function is called to fill the target text information into the input box. The method according to claim 5, characterized in that The method further comprises the following steps: In a case where the target text information is not successfully filled into an input box in the target interface for receiving input content of the target user, the target text information is stored and prompt information is generated. The method of claim 7, wherein The case that the target text information is stored and prompt information is generated in the case that the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user, comprising: In the case that the text editing control is not found by traversing the view tree node of the target interface, or the accessibility function fails to start, it is determined that the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user; In the case that the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user, the target text information is stored and prompt information is generated. The method of claim 1, wherein The case that the target text information is stored and prompt information is generated in the case that the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user, comprising: In the case that the target text information is not successfully filled into the input box in the target interface for receiving the input content of the target user, the target text information is stored and prompt information is generated. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. The method of claim 1, wherein In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. In the case that the electronic device displays the target interface, the text recognition operation and the image recognition operation are performed on the display content in the target interface in response to the confirmation operation of the target user on the target control, and the data content to be analyzed is obtained. The method of claim 1, wherein ​ ​ ​ The method of claim 1, wherein The text recognition operation and the image recognition operation are performed on the display content in the target interface in response to a confirmation operation of a target user on a target control in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. In a case where it is detected that the electronic device displays a target interface, third text content is obtained by performing text recognition on display content in the target interface through optical character recognition in response to a confirmation operation of a target user on a target control. A fourth image is generated by performing a screenshot operation on image content in the display content in the target interface. The third text content and the fourth image are taken as the data content to be analyzed. The method of claim 1, wherein The text recognition operation and the image recognition operation are performed on the display content in the target interface in response to a confirmation operation of a target user on a target control in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. It is detected whether an interface displayed by the electronic device matches a reference interface set in advance. In a case where the interface displayed by the electronic device matches the reference interface, it is determined that the electronic device displays a target interface. The text recognition operation and the image recognition operation are performed on the display content in the target interface in response to a confirmation operation of a target user on a target control in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. The method of claim 13, wherein The text recognition operation and the image recognition operation are performed on the display content in the target interface in response to a confirmation operation of a target user on a target control in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. Scene information corresponding to the interface displayed by the electronic device is recognized. Based on pre-set scene interface matching data, a reference interface corresponding to the scene information is determined, wherein the scene interface matching data includes at least one standard scene information and a standard interface corresponding to each standard scene information. It is detected whether the interface displayed by the electronic device matches the reference interface. The method of claim 14, wherein The text recognition operation and the image recognition operation are performed on the display content in the target interface in response to a confirmation operation of a target user on a target control in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. Scene interface matching data is obtained. In response to an update operation of the target user on the scene interface matching data, specified scene information determined by the target user and a specified interface corresponding to the specified scene information are obtained. The specified scene information and the specified interface corresponding to the specified scene information are added to the scene interface matching data as new standard scene information and a new standard interface matching the new standard scene information, respectively, to update the scene interface matching data. Based on the updated scene interface matching data, a reference interface matching the scene information is determined. The method of claim 15, wherein The text recognition operation and the image recognition operation are performed on the display content in the target interface in a case where it is detected that the electronic device displays a target interface, and data content to be analyzed is obtained. ​ In the case that the standard scene information in the scene interface matching data is consistent with the new standard scene information, the new standard interface matched with the new standard scene information is added to the scene interface matching data, and the standard interface corresponding to the consistent standard scene information is taken as the standard interface corresponding to the consistent standard scene information. A text generation device characterized by comprising: The device is applied to an electronic device, and the device comprises: An identification unit is configured to, in the case that the electronic device displays a target interface, perform a text recognition operation and an image recognition operation on display content in the target interface in response to a confirmation operation of a target user on a target control, to obtain to-be-analyzed data content, wherein the target interface is configured to receive input content of the target user, and the to-be-analyzed data content comprises at least one of text content and image content. A text generation unit is configured to perform semantic analysis on the to-be-analyzed data content by using a large model algorithm, to obtain target text information as the input content of the target user. A cloud server, characterized in that, The cloud server is configured to obtain to-be-analyzed data content obtained by the electronic device based on the method in any one of claims 1-16, and perform semantic analysis on the to-be-analyzed data content by using a large model algorithm, to obtain target text information as the input content of the target user. An electronic device, characterized by Comprise: One or more processors; Memory; One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are configured to execute the method in any one of claims 1-16. A computer-readable storage medium, characterized by The computer readable storage medium stores program code, and the program code can be called and executed by the processor to execute the method in any one of claims 1-16.