Methods, apparatus, electronic devices, media, and computer programs for acquiring video-related content.
By converting video content into natural language text using a pre-trained model, the method enhances user intent understanding and content comprehensiveness, addressing the limitations of conventional methods in accurately recommending related content.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2024-09-29
- Publication Date
- 2026-06-19
AI Technical Summary
Conventional content acquisition methods fail to accurately capture user intent due to limited information extraction from images and text, leading to insufficient accuracy in recommending related content, especially when videos are involved.
A method and apparatus that converts video content into natural language text using a pre-trained generative model, enabling the acquisition of video-related content including generated and retrieved videos and images, thereby enhancing user intent understanding and content comprehensiveness.
Improves the accuracy and comprehensiveness of content recommendation by leveraging video information, ensuring that the acquired content better matches user expectations and provides a wider range of relevant content options.
Smart Images

Figure 2026520075000001_ABST
Abstract
Description
Technical Field
[0001] Cross-reference to Related Applications This application claims priority to a Chinese patent application filed with the China National Intellectual Property Administration on October 13, 2023, with application number 202311329594.6 and invention title "Method, Apparatus, Electronic Device, and Medium for Obtaining Video-Related Content", the entire content of which is incorporated herein by reference.
[0002] Embodiments of the present disclosure relate to the field of computers, and more specifically, to a method, apparatus, electronic device, and medium for obtaining video-related content.
Background Art
[0003] Content acquisition refers to identifying a piece of text or an image and automatically obtaining content related to the specific text or image according to requirements. In social platforms, content generation and search are widely applied in the field of product recommendation, which means that a user uploads a piece of text or an image on the platform, and the platform obtains content related to the uploaded text or image and recommends it to the user.
[0004] Currently, many related content acquisition methods search for content related to characters or physical objects by identifying characters or physical objects in images. The information contained in characters and / or images is generally limited and cannot fully represent the user's intention, leading to insufficient accuracy of the related content recommended by the platform.
Summary of the Invention
[0005] Embodiments of the present disclosure provide a method, apparatus, electronic device, and computer-readable storage medium for obtaining video-related content.
[0006] A first aspect of this disclosure provides a method for obtaining video-related content. The method includes generating video-related natural language text based on a video. The method further includes obtaining video-related content based on the natural language text, the video-related content including at least one of the generated video, the retrieved video, and the retrieved image text. The method further includes displaying thumbnails corresponding to the natural language text and the video-related content.
[0007] A second aspect of this disclosure provides an apparatus for acquiring video-related content. The apparatus includes a natural language text generation module configured to generate video-related natural language text based on a video. The apparatus further includes a video-related content acquisition module configured to acquire video-related content based on the natural language text, wherein the video-related content includes at least one of the generated video, the retrieved video, and the retrieved image text. The apparatus further includes a video-related content display module configured to display thumbnails corresponding to the natural language text and the video-related content.
[0008] A third aspect of this disclosure provides an electronic device, which includes a processor and a memory coupled to the processor and storing instructions that, when executed by the processor, cause the electronic device to perform the method described in the first aspect.
[0009] A fourth aspect of this disclosure provides a computer-readable storage medium that stores one or more computer instructions that are executed by a processor to implement the method described in the first aspect.
[0010] The inventive sections are provided in a simplified form to introduce a selection of concepts, which are further described in the forms for carrying out the inventions described below. The inventive sections are not intended to identify important or main features of the subject matter for which protection is sought, nor are they intended to limit the scope of the subject matter for which protection is sought. [Brief explanation of the drawing]
[0011] Referring to the following detailed description while linking the drawings will make the above and other features, advantages and aspects of each embodiment of the present disclosure clearer. In the drawings, the same or similar reference numerals represent the same or similar elements, where, [Figure 1] A schematic diagram of an exemplary environment in which a video-related content acquisition method according to several embodiments of this disclosure can be realized is shown. [Figure 2] A flowchart illustrating a method for obtaining video-related content according to some embodiments of this disclosure is shown. [Figure 3] A flowchart illustrating a method for obtaining video-related content according to some embodiments of this disclosure is shown. [Figure 4] A schematic diagram of a process for generating multiple natural language texts according to some embodiments of this disclosure is shown. [Figure 5] A flowchart of the process for generating a first extended word according to some embodiments of this disclosure is shown. [Figure 6A] Figure 6A shows a schematic diagram of displaying video-related content according to some embodiments of the present disclosure, and Figure 6B shows a schematic diagram of the interface of combined-distribution video and video-related content according to some embodiments of the present disclosure. [Figure 6B] Figure 6A shows a schematic diagram of displaying video-related content according to some embodiments of the present disclosure, and Figure 6B shows a schematic diagram of the interface of combined-distribution video and video-related content according to some embodiments of the present disclosure. [Figure 7]A block diagram of an apparatus for acquiring video-related content according to some embodiments of the present disclosure is shown. [Figure 8] Block diagrams of electronic devices according to some embodiments of this disclosure are shown.
[0012] In all drawings, the same or similar reference numerals represent the same or similar elements. [Modes for carrying out the invention]
[0013] The embodiments of this disclosure will be described in more detail below with reference to the drawings. While the drawings show several embodiments of this disclosure, this disclosure can be realized in various forms and should not be construed as being limited to the embodiments described herein. Rather, these embodiments should be understood as being provided to provide a clearer and more complete understanding of this disclosure. The drawings and embodiments of this disclosure are illustrative only and should not be understood as limiting the scope of protection of this disclosure.
[0014] In the descriptions of the embodiments of this disclosure, the term “including” and its synonyms should be understood as open inclusion, i.e., “including, but not limited to.” The term “based on” should be understood as “based at least in part.” The term “one embodiment” or “the embodiment” should be understood as “at least one embodiment.” Unless explicitly stated, the terms “first,” “second,” etc., may refer to different or the same object. Other explicit and implicit definitions may also be included below.
[0015] Most content retrieval methods involve using text and / or images entered by the user. When a user enters text and / or images, the client sends the entered text and / or images to the server. The server identifies the characters and objects in the text and / or images, retrieves the content associated with the identified characters and objects from the database, and then sends the retrieved content to the client for user viewing.
[0016] However, conventional content acquisition methods cannot identify videos, and the information that can be included in images and text is limited. Videos, on the other hand, can contain information such as truthful content, specific scenarios, and audio, and the amount of information that can be included in a video can be adjusted according to the length of the video. Therefore, compared to text and images, videos can more accurately embody the user's intent. Moreover, much of the related content acquired by platforms consists of images and text, and there is a lack of related content in video format, which is limited, and users cannot find the information they most want from the acquired related content.
[0017] To solve the above problems, embodiments of this disclosure provide a method for acquiring video-related content. The method converts the content contained in a video into input text, uses a model (e.g., a pre-trained generative model) to understand the input text and generate output text, and further acquires video-related content based on the understood output text, the video-related content including at least one of the generated video, the retrieved video, and the retrieved image text. The method can improve the accuracy of understanding user intent by using a model to understand the input video text and improving the accuracy of video understanding. Therefore, the video-related content acquired based on the understood output text can better match user expectations and provide the answers the user desires. The video-related content acquired by the method not only includes image text but also further includes the generated video and / or the retrieved video, improving the comprehensiveness of the related content and thereby improving the user experience.
[0018] Figure 1 shows a schematic diagram of an exemplary environment 100 in which a related content acquisition method according to several embodiments of the present disclosure can be implemented. As shown in Figure 1, the exemplary environment 100 may include a server 102, which may be an individual server or a centralized or distributed server cluster (e.g., a cloud). In embodiments of the present disclosure, the video related content may be acquired by the server or by any other device having computing power, and the architecture and functionality of the server 102 described herein are for illustrative purposes only and should be understood not to imply any limitation of the scope of the present disclosure.
[0019] The exemplary environment 100 may further include a client 101, which may be a user terminal, a mobile device, a computer, etc. In an embodiment of the present disclosure, after obtaining video-related content, the server 102 transmits it to the client 101. The client 101 is a device for displaying video-related content, which may include a display screen. The user can view the video-related content transmitted from the server 102 through the display screen.
[0020] Referring to FIG. 1, the server 102 may include a video processing module 103, a pre-training generative model 104, and a related content acquisition module 105. The video processing module 103 extracts information from the user input video, converts the extracted information into text format, and is used to generate the input text of the pre-training generative model 104. In an embodiment of the present disclosure, the pre-training generative model 104 may be one of the models that achieve natural language processing capabilities through pre-training, for example, a large language model. The pre-training generative model 104 is used to process the text input by the video processing module 103. The processing method specifically performs an understanding analysis on the input text, outputs the understood text, and realizes a deep interpretation of the video information. The output text can accurately represent the user's intention. The related content acquisition module 105 receives the text output by the pre-training generative model 104 and is used to obtain video-related content based on the output text.
[0021] In embodiments of the present disclosure, the acquired video-related content includes at least one of the generated video, the retrieved image text, and the retrieved video. The diversified types can improve the probability that a user obtains target content by improving the comprehensiveness of the related content. For example, the video-related content may include a video generated by a model based on text, a video retrieved from a video website, or image text obtained from web pages and applications.
[0022] It should be understood that the architecture and functions in the exemplary environment 100 are described only for exemplary purposes and do not imply any limitation on the scope of the present disclosure. Embodiments of the present disclosure may be used in other environments having different structures and / or functions.
[0023] Hereinafter, the process according to embodiments of the present disclosure will be described in detail in conjunction with FIGS. 2 to 8. For ease of understanding, all the specific data mentioned in the following description are exemplary and not intended to limit the protection scope of the present disclosure. The embodiments described below may further include additional operations not shown and / or omit the operations shown, and it can be understood that the scope of the present disclosure is not limited in this manner.
[0024] Figure 2 shows a flowchart of a method 200 for obtaining video-related content according to some embodiments of the present disclosure. In block 202, natural language text related to the video is generated based on the video. For example, as shown in Figure 1, the video processing module 103 may extract information from a user-input video, convert the extracted information into text format, and generate a first text. In embodiments of the present disclosure, the extracted video information may include objects, people present in the video, text and / or audio information in the video, and the more comprehensive the extracted video information, the more information about the video can be obtained, and the generated first text will be able to express the meaning contained in the video more accurately. The information extracted by the video processing module 103 is often discrete phrases and text paragraphs, and may not fully express the meaning that the user wants to express in the video. A pre-trained generative model 104 can be used to understand the extracted information and generate natural language text. In some embodiments, the first text may be converted into a second text in natural language format, i.e., the discrete phrases and text paragraphs in the first text may be connected into complete sentences to describe the user's intent.
[0025] To facilitate understanding, let's consider an example: If a user wants to know how to purchase a ticket at a train station, the content in the first text may be "train station" and "ticket purchase." The second text output by the pre-trained generative model 104's understanding of the first text may be "How to use an automatic ticket machine?" or "How to purchase a ticket online?". The above examples are illustrative and should be understood as not intended to limit the scope of protection of this disclosure.
[0026] In block 204, video-related content is obtained based on natural language text, and the video-related content includes at least one of the generated video, the searched video, and the searched image text. For example, referring to Figure 1, the related content acquisition module 105, after receiving natural language text, acquires video-related content based on the meaning of the natural language text, and in embodiments of this disclosure, the related content may be obtained by searching in the platform resource library, and the specific process may involve extracting keywords from the natural language text and searching in the platform resource library based on the keywords. The related content may also be a video that is automatically generated based on a second text.
[0027] In block 206, thumbnails corresponding to natural language text and video-related content are displayed. For example, referring to Figure 1, both natural language text and video-related content may be displayed on the display screen of client 101, and a thumbnail of the video-related content related to the natural language text is displayed at the position corresponding to the natural language text. Each thumbnail corresponds to one video-related content, and the thumbnail may be a representative image of the video-related content. The user can view the related content corresponding to the thumbnail by selecting the thumbnail.
[0028] Therefore, according to Method 200 of the embodiment of this disclosure, it is possible to acquire video-related content, and by employing a pre-trained generative model, it is possible to achieve deep interpretation of video information and improve the understanding of user intent. The acquired video-related content includes at least one of the generated video, the searched image text, and the searched video, and the diversified types can improve the probability that the user will acquire the target content by improving the completeness of the related content.
[0029] The process of acquiring video-related content will be explained in detail below, linking Figures 3 to 5. Here, Figure 3 shows the specific steps for acquiring video-related content and is used to give an example of processing video and acquiring related content based on a second text. Figure 4 shows an example including multiple pre-trained generative model and is used to expand the second text in multiple ways to make the acquired related content more comprehensive. Figure 5 shows the process of generating the first expanded word and is used to expand the first text and make the extracted video information more comprehensive.
[0030] Figure 3 shows a flowchart of a method 300 for acquiring video-related content according to some embodiments of the present disclosure. In block 301, for example, as shown in Figure 1, client 101 receives video sent from the user and transmits it to server 102, where video processing module 103 extracts text from the video and generates first text.
[0031] Blocks 302 to 310 illustrate an exemplary video processing process. In block 302, the video processing module 103 may first perform frame identification and analysis on the video, or it may perform frame identification using different time levels as needed. The higher the frequency of frame identification, the more images are obtained, which allows for more information to be obtained from the video and facilitates a more accurate description of the video.
[0032] In block 303, deduplication is performed on the identified video frames. In embodiments of this disclosure, deduplication may employ image deduplication techniques, for example, by performing a similarity comparison between the two video frames before and after a portion of the video frame. If the similarity is higher than a first threshold, the later video frame is deleted from the portion of the video frame. By deduplication of the video and removing redundant video frames, the amount of data processing in the video content identification step is reduced, and processing efficiency can be improved by avoiding repeated content identification of the same video frame.
[0033] In block 304, the content contained in the multiple deduplication frames obtained after deduplication may be identified. In embodiments of this disclosure, the identified video content may include image text in the video, and the image text may be titles, subtitles, etc., in the deduplication frames. The text identification method may employ an Optical Character Recognition (OCR) service to identify character information in the image, including titles, subtitles, point descriptions, etc. In block 305, the identified video content may further include objects in the video, and the objects in the video may be people, houses, cars, and other objects present in the deduplication frame images, as well as various actions, scenarios, etc.
[0034] In block 306, after identifying the subject and image text in the video, the identified subject is converted into subject text, keywords are extracted from the subject text and image text, and a key phrase is obtained. In the process of generating the key phrase, the subject text may be generated using a normal text conversion method, or the key phrase may be generated using a normal keyword extraction method.
[0035] In block 307, in some embodiments, the method for generating the first text further includes extracting audio from the video and separating the audio portion from the video. In embodiments of the present disclosure, audio may be split for audio that is long in length, and in some embodiments, a time-division method is employed for splitting. A time-division method is a simple way to split the audio in the video at predetermined intervals. For example, it may be split every 30 seconds, i.e., the audio is divided into one segment of audio every 30 seconds. The advantage of the time-division method is that it is simple and quick to operate.
[0036] In some embodiments, the spectral method may be used for segmentation. The spectral method is a method for determining signal performance by observing the change in the amplitude of a signal along with its frequency characteristics (amplitude-frequency response). In some cases, it further includes the change in the phase of the signal along with its frequency characteristics (phase-frequency response). Therefore, in some embodiments, by using the spectral method to observe and segment the characteristics of the audio signal, it is possible to ensure that each segmented audio is a complete sentence.
[0037] In block 308, speech recognition is performed on the extracted audio. In embodiments of this disclosure, speech recognition may be performed using Automatic Speech Recognition (ASR) technology, or other common speech recognition technology. Specifically, the choice may be made as needed, and the criterion is that it satisfies the purpose of identifying video audio. Speech recognition may also be performed using a machine learning model. In block 309, the identified audio is converted into text to generate a text paragraph 310. In embodiments of this disclosure, speech-to-text conversion may be performed based on known or future speech-to-text conversion algorithms or technologies in the art. Specifically, the choice may be made as needed, and the criterion is that it satisfies the purpose of speech-to-text conversion.
[0038] In block 311, key phrases and text paragraphs are input to a pre-trained generative model as the first text. The pre-trained generative model understands the first text (i.e., text input) and converts it into a second text in natural language form (i.e., text output), which is to connect discrete phrases and text paragraphs in the first text into complete sentences that describe the user's intent. By understanding the text using the pre-trained generative model, it is possible to generate natural language text that is richer in content and more semantically accurate for subsequent video generation or search.
[0039] In block 312, keywords are extracted from the second text to generate search terms. For example, the second text could be "How do I use an automatic ticket machine?", and the extracted keywords could be "ticket" and "automatic ticket machine". In block 313, the video library is searched. The video library is searched based on the search terms extracted from the second text. The video library is searched based on the search terms, and the found video 314 can be obtained.
[0040] Alternatively or additionally, in block 315, an image text library is searched. The image text library is searched based on the search terms extracted from the second text, and the videos in the image text library may be historical image texts stored on the platform, or image texts stored in an external resource library connected to the platform. The image text library can be searched based on the search terms, and the searched image text 316 can be obtained.
[0041] In block 317, the video-related content obtained in this disclosure further includes a generated video 318, which may be automatically generated based on a second text input using a content model, and the generation of the content model may also be the generation of a model or algorithm of the related content based on other input conditions or instructions, which may be specifically selected as needed.
[0042] In block 319, a recommendation list is calculated and related content is returned. If the retrieved video-related content includes multiple pieces of content, the relevance of each piece of video-related content to the video is determined, and the ranking of each piece of content in the recommendation list is determined based on the relevance. For example, if the video-related content includes two searched videos, the relevance of each of the two searched videos to the video entered by the user is calculated, and the ranking of each piece of video in the recommendation list is determined based on the relevance, with those with a higher relevance ranking higher in the recommendation list and those with a lower relevance ranking lower in the recommendation list.
[0043] Figure 4 shows a schematic diagram of a process 400 for generating multiple natural language texts according to some embodiments of the present disclosure. In embodiments of the present disclosure, multiple pre-trained generative models may be included, specifically a first pre-trained generative model 401, a second pre-trained generative model 402, and a third pre-trained generative model 403, and the specific number of pre-trained generative models may be selected as needed at the time of implementation. The input to the three pre-trained generative models is the first text, and the understanding of the first text by the different models may differ. Therefore, the text output after the first pre-trained generative model 401 understands the first text is the first natural language text 404, the text output after the second pre-trained generative model 402 understands the first text is the second natural language text 405, and the text output after the third pre-trained generative model 403 understands the first text is the third natural language text 406. To make it easier to understand, let's look at an example: For instance, if the first text includes "train station" and "ticket purchase," then the first natural language text 404 could be "How do you buy a ticket at a train station?", the second natural language text 405 could be "How do you use an automatic ticket machine?", and the third natural language text 406 could be "How do you buy a ticket online?".
[0044] By setting up multiple pre-trained generative models, different understandings of the first text can occur, generating multiple versions of the second text containing multiple different sentences, thereby broadening the range of acquired video-related content. In Figure 4, the first video-related content 407 was acquired based on the first natural language text 404, the second video-related content 408 was acquired based on the second natural language text 405, and the third video-related content 409 was acquired based on the third natural language text 406. Expanding the range of video-related content can provide users with a wider range of choices and increase the probability that users will acquire the information they are looking for.
[0045] Figure 5 is a flowchart of a process 500 for generating a first extended word according to some embodiments of the present disclosure. In block 501, a first extended word semantically related to the target text and / or image text is determined based on a predetermined template. In embodiments of the present disclosure, a numeric extended word may include one or more other numbers, and an alphabetic extended word may include one or more other alphabets. According to embodiments of the present disclosure, a letteric extended word may include one or more other letters, and the number or type of extended words can be determined by experience or as needed. An extended word may include at least one of antonyms, synonyms, superordinates, and subordinates.
[0046] In block 502, a phrase is generated based on the target text, image text, and the first extended word. After generating the first extended word, keywords may be extracted from the target text, image text, and the first extended word to generate a key phrase. According to embodiments of this disclosure, keyword extraction can be performed based on known or any suitable word segmentation algorithm of the art. The first extended word can expand the information content of the first text, the first text is input data for a pre-trained generative model, and increasing the input data for the pre-trained generative model can improve the accuracy of understanding the user's intent by improving the accuracy with which the pre-trained generative model outputs text.
[0047] In some embodiments of this disclosure, the amount of information in the speech-to-text is increased by determining a second extended word that is semantically related to the text paragraph based on a pre-configured semantic library, thereby improving the accuracy with which the pre-trained generative model outputs text and thereby improving the accuracy of understanding the user's intent.
[0048] Figure 6A illustrates a schematic diagram of displaying video-related content according to some embodiments of the present disclosure. As shown in Figure 6A, the client device can display a user interface 601 on which a shooting control 602 is displayed, and in response to a user touch operation on the shooting control 602, the client starts shooting. After shooting is completed, the client uploads the recorded video to the server, the server retrieves related content based on the video and returns it to the client, and the video-related content and video playback control 613 are displayed on the user interface 601. In embodiments of the present disclosure, the video-related content includes a first output text (e.g., first question) and a second output text (e.g., second question), and the specific number of output texts may be selected as needed in practice. The first and second output texts represent different meanings. The position corresponding to the first output text displays the first set of thumbnails related to the first output text, and the position corresponding to the second output text displays the second set of thumbnails related to the second output text. Each thumbnail corresponds to one video-related content, and each thumbnail specifically represents a representative image of the video-related content. Based on the meanings expressed in the first and second output texts, the user determines which set of thumbnails best represents the video-related content they intend to view. By selecting the closest thumbnail, the user can view the related content corresponding to that thumbnail.
[0049] In the embodiments of this disclosure, if the user's intention to input a video is a question, the first output text may be a first question such as "How do I buy a ticket at a train station?", and the second output text may be a second question such as "How do I buy a ticket online?", the two texts representing two questions with different meanings, the related content represented by the first set of thumbnails being the answer to the first question, and the related content represented by the second set of thumbnails being the answer to the second question. By distinguishing video-related content obtained through different intentions, users can intuitively browse the information contained in the related content, improving the efficiency of their browsing and searching, and enhancing the user experience.
[0050] In some embodiments of the embodiments of this disclosure, at least one of a first classification control 604, a second classification control 605, and a third classification control 606 is displayed on the user interface 601, wherein the thumbnail displayed in correspondence with the first classification control 604 represents a video generated in the related content, the thumbnail displayed in correspondence with the second classification control 605 represents a video found in the related content, and the thumbnail displayed in correspondence with the third classification control 606 represents an image text found in the related content. For example, in response to a user's touch operation on the first classification control 604, a thumbnail of a generated video is displayed on the user interface 601. The classification control settings distinguish video-related content, further improving the efficiency with which users can find target content.
[0051] In some embodiments of the present disclosure, thumbnails related to video-related content are sorted according to the degree of relevance between the corresponding content and video, for example, the degree of relevance between the content and video corresponding to the first thumbnail 607 is greater than the degree of relevance between the content and video corresponding to the second thumbnail 608.
[0052] In some embodiments of this disclosure, the user interface 601 further displays a sixth control 603, a seventh control 609, an eighth control 610, a first control 611, and a re-recording control 612. In response to the user's touch operation on the sixth control 603, the user-inputted video is delivered. In response to the user's touch operation on the seventh control 609, more video-related content is displayed. In response to the user's touch operation on the eighth control 610, an artificial service interface is displayed. In response to the user's touch operation on the first control 611, a second control, a third control, and / or a fourth control are displayed, the second control being configured to trigger the display of a text question interface when touched, the third control being configured to trigger the display of an image text question interface when touched, and the fourth control being configured to trigger the display of a video question interface when touched, allowing the user to ask questions by text or image text, or to ask questions again using video. In response to a user touch on control 612, the system returns to the page displaying video recording, allowing the user to re-record the video.
[0053] Figure 6B shows schematic diagrams of the interface for combined distribution of video and video-related content according to some embodiments of the present disclosure. After the user selects a first thumbnail, the user interface 601 displays the interface shown in Figure 6B, which displays a video thumbnail 614 and another thumbnail 615. In response to the user's touch operation on the fifth control 616, the video and the related content corresponding to the other thumbnails can be formed as a single content and combined for distribution. In this manner, a question video and related answer videos can be searched, and a single new and complete video can be quickly generated and distributed.
[0054] Figure 7 shows a block diagram of a device 700 for acquiring video-related content according to some embodiments of the present disclosure. As shown in Figure 7, the device 700 includes a natural language generation module 702, a video-related content acquisition module 704, and a video-related content display module 706. The natural language text generation module 702 is configured to generate video-related natural language text based on the video. The video-related content acquisition module 704 is configured to acquire video-related content based on the natural language text, the video-related content including at least one of the generated video, the retrieved video, and the retrieved image text. The video-related content display module 706 is configured to display thumbnails corresponding to the natural language text and the video-related content.
[0055] Figure 8 shows a block diagram of an electronic device 800 according to some embodiments of the present disclosure, and the device 800 may be the device or apparatus described in the embodiments of the present disclosure. As shown in Figure 8, the device 800 includes a central processing unit (CPU) and / or a graphics processing unit (GPU) 801, which can perform various appropriate operations and processes based on computer program instructions stored in read-only memory (ROM) 802 or computer program instructions loaded from storage unit 808 into random access memory (RAM) 803. Various programs and data required for the operation of the device 800 may be further stored in RAM 803. The CPU / GPU 801, ROM 802 and RAM 803 are connected to each other via a bus 804. An input / output (I / O) interface 805 is also connected to the bus 804. Although not shown in Figure 8, the device 800 may further include a coprocessor.
[0056] Multiple components in the device 800 are connected to the I / O interface 805 and include, for example, an input unit 806 such as a keyboard or mouse; an output unit 807 such as various types of displays or speakers; a storage unit 808 such as a magnetic disk or optical disk; and a communication unit 809 such as a network card, modem, or wireless communication transceiver. The communication unit 809 enables the device 800 to exchange information / data with other devices via computer networks such as the Internet and / or various telecommunication networks.
[0057] Each of the methods or processes described above may be executed by the CPU / GPU 801. For example, in some embodiments, the method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as a storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed into the device 800 via the ROM 802 and / or communication unit 809. Once the computer program is loaded into the RAM 803 and executed by the CPU / GPU 801, one or more steps or operations in the method or process described above can be performed.
[0058] In some embodiments, the methods and processes described above may be implemented as a computer program product. The computer program product may include a computer-readable storage medium on which computer-readable program instructions for performing each aspect of the present disclosure are contained.
[0059] A computer-readable storage medium may be a tangible device capable of holding and storing instructions used in an instruction execution device. A computer-readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (a non-exhaustive list) of computer-readable storage media include portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital multipurpose disks (DVDs), memory sticks, flexible disks, mechanical encoders, such as punch cards or grooved projection structures on which instructions are stored, and any suitable combination of the above. The computer-readable storage medium used herein should not be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagated through waveguides or other transmission media (e.g., optical pulses through optical fiber cables), or electrical signals transmitted through wires.
[0060] The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to each computing / processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and / or a wireless network. The network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network, transfers the computer-readable program instructions, and stores them in a computer-readable storage medium in each computing / processing device.
[0061] Computer program instructions for performing the operations of the Disclosure may be assembler instructions, instruction set architecture (ISA) instructions, machine language instructions, machine language-related instructions, microcode, firmware instructions, state setting data, or source code or target code composed of any combination of one or more programming languages, the programming languages including object-oriented programming languages and general procedural programming languages. Computer-readable program instructions may be executed entirely on a user computer, partially on a user computer, as a single, independent software package, partially on a user computer, partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user computer by any type of network, including a local area network (LAN) or wide area network (WAN), or it may be connected to an external computer (for example, connected via the Internet using an Internet service provider). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field-programmable gate array (FPGA), or a programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, and the electronic circuit can realize each aspect of the present disclosure by executing computer-readable program instructions.
[0062] These computer-readable program instructions are provided to a processing unit of a general-purpose computer, a dedicated computer, or other programmable data processing device, thereby enabling the creation of a machine, which, when executed by the processing unit of the computer or other programmable data processing device, creates a device that implements the functions / operations defined in one or more blocks in a flowchart and / or block diagram. These computer-readable program instructions may be stored in a computer-readable storage medium, which, by operating the computer, programmable data processing device, and / or other device in a particular manner, includes a product containing instructions that implement each aspect of the functions / operations defined in one or more blocks in a flowchart and / or block diagram.
[0063] Computer-readable program instructions may be loaded into a computer, another programmable data processing device, or other device, and the instructions executed by the computer, other programmable data processing device, or other device perform a series of operational steps to produce a process that is implemented by the computer, thereby realizing the function / operation defined in one or more blocks in a flowchart and / or block diagram.
[0064] The flowcharts and block diagrams in the drawings illustrate the implementable systematic architectures, functions, and operations of the devices, methods, and computer program products according to several embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, which includes one or more executable instructions for implementing a defined logic function. In some implementations as alternatives, the functions attached to a block may occur in a different order than those attached in the drawings. For example, two consecutive blocks may actually be executed nearly in parallel, or in some cases in reverse order, depending on the functions involved. Note that each block in the block diagram and / or flowchart, and combinations of blocks in the block diagram and / or flowchart, may be implemented by a dedicated hardware-based system that performs the defined function or operation, or by a combination of dedicated hardware and computer instructions.
[0065] While the embodiments of this disclosure have been described above, these descriptions are illustrative, not exhaustive, and do not limit to the embodiments disclosed. Many modifications and changes will be obvious to those skilled in the art without departing from the scope and spirit of the embodiments described. The choice of terms used herein is intended to best interpret the principles, practical applications, or technological improvements of the technology in the market of each embodiment, or to enable those skilled in the art to understand the embodiments disclosed herein.
[0066] The following are some examples of implementations of this disclosure.
[0067] [Example 1] A method for obtaining video-related content, Based on the video, generate natural language text related to the video, Based on the aforementioned natural language text, the system obtains video-related content associated with the video, including at least one of the generated video, the searched video, and the searched image text. A method for obtaining video-related content, including displaying the natural language text and thumbnails corresponding to the video-related content.
[0068] [Example 2] Generating natural language text related to the video is Based on the video, generate a first text containing a phrase related to the video, The method according to Example 1, further comprising generating a second text, which is natural language text, based on the first text.
[0069] [Example 3] Generating the first text related to the video is: Extracting multiple video frames from the aforementioned video, By removing duplicates from the aforementioned multiple video frames, multiple deduplication frames are generated. The method according to Example 1 or 2, comprising generating first text related to the video by identifying at least one deduplication frame among the plurality of deduplication frames.
[0070] [Example 4] Generating the first text related to the video is: Identifying the subject and image text included in the aforementioned deduplication frame, To determine the target text related to the aforementioned target, The method according to any one of Examples 1 to 3, comprising generating a phrase related to the video based on the target text and the image text.
[0071] [Example 5] Generating the first text related to the video is: Extracting audio from the aforementioned video, Perform speech recognition on the aforementioned audio and generate speech-to-text, The method according to any one of Examples 1 to 4, further comprising generating a first text related to the video based on the phrase and the audio text.
[0072] [Example 6] Generating phrases related to the video is Based on a predetermined template, determine a first extended word that is semantically related to the target text and / or the image text, The method according to any one of Examples 1 to 5, comprising generating the phrase based on the target text, the image text, and the first extended word.
[0073] [Example 7] Generating speech-to-text is Converting identified speech into text paragraphs, Based on a pre-configured semantic library, a second extended word semantically related to the text paragraph is determined, The method according to any one of Examples 1 to 6, comprising generating the speech text based on the text paragraph and the second extended word.
[0074] [Example 8] Generating natural language text is Based on the understanding of the first text by the first pre-trained generative model, a first output text is generated, Based on the understanding of the first text by the second pre-trained generative model, a second output text is generated, The method according to any one of Examples 1 to 7, comprising generating a second text based on the first output text and the second output text.
[0075] [Example 9] Obtaining video-related content related to the aforementioned video is The method according to any one of Examples 1 to 8, comprising obtaining the generated video by generating a content model based on the natural language text.
[0076] [Example 10] Obtaining video-related content related to the aforementioned video is Based on the meaning of the aforementioned natural language text, determine the search terms related to the aforementioned natural language text, The method according to any one of Examples 1 to 9, further comprising obtaining the searched video from a video library based on the search term.
[0077] [Example 11] Obtaining video-related content related to the aforementioned video is The method according to any one of Examples 1 to 10, further comprising obtaining the retrieved image text from an image text library based on the search term.
[0078] [Example 12] To determine the degree of relevance between each of the aforementioned video-related contents and the video, The method according to any one of Examples 1 to 11, further comprising determining the ordering of each piece of video-related content based on the degree of relevance.
[0079] [Example 13] The natural language text includes at least the first problem, and the method is To display a first set of thumbnails of video-related content related to the first problem, The method according to any one of Examples 1 to 12, further comprising displaying video-related content corresponding to the selected thumbnail in response to a user selecting a thumbnail from the first set of thumbnails.
[0080] [Example 14] Displaying a first set of thumbnails for video-related content related to the first problem is: Displaying multiple categories related to video-related content, The method according to any one of Examples 1 to 13, further comprising displaying thumbnails of video-related content associated with the first classification in response to a user selecting a first classification from among the plurality of classifications.
[0081] [Example 15] The method according to any one of Examples 1 to 14, further comprising displaying a video recording page in response to a user touch operation on a control for re-recording.
[0082] [Example 16] The method according to any one of Examples 1 to 15, further comprising displaying a second control, a third control and / or a fourth control in response to a user touch operation on a first control, wherein the second control is configured to trigger the display of a text question interface when touched, the third control is configured to trigger the display of an image text question interface when touched, and the fourth control is configured to trigger the display of a video question interface when touched.
[0083] [Example 17] The method according to any one of Examples 1 to 16, further comprising delivering a combined video including the video and the video-related content in response to a touch operation of a fifth control by the user.
[0084] [Example 18] A device for acquiring video-related content, A natural language text generation module configured to generate natural language text related to a video based on the video, A video-related content acquisition module is configured to acquire video-related content associated with a video, which includes at least one of a generated video, a searched video, and searched image text, based on the aforementioned natural language text. A device for acquiring video-related content, comprising the natural language text and a video-related content display module configured to display thumbnails corresponding to the video-related content.
[0085] [Example 19] The natural language text generation module is, A first text generation module configured to generate a first text containing a phrase related to the video based on the video, The apparatus according to Example 18, further comprising a second text generation module configured to generate a second text, which is natural language text, based on the first text.
[0086] [Example 20] The natural language text generation module is: A video frame extraction module configured to extract multiple video frames from the aforementioned video, A video frame deduplication module is configured to generate multiple deduplication frames by removing duplicates from the multiple video frames, The apparatus according to Example 18 or 19, further comprising a deduplication frame identification module configured to generate first text related to the video by identifying at least one deduplication frame among the plurality of deduplication frames.
[0087] [Example 21] The first text generation module is: A target and text identification module configured to identify the target and image text included in the deduplication frame, A target text conversion module configured to determine target text related to the aforementioned target, The apparatus according to any one of Examples 18 to 20, further comprising a phrase generation module configured to generate phrases related to the video based on the target text and the image text.
[0088] [Example 22] The first text generation module described above is: An audio extraction module configured to extract audio from the aforementioned video, A speech-to-text generation module configured to perform speech recognition on the aforementioned audio and generate speech-to-text, The apparatus according to any one of Examples 18 to 21, further comprising an associated text generation module configured to generate a first text related to the video based on the phrase and the audio text.
[0089] [Example 23] The phrase generation module is, A first extended word determination module configured to determine a first extended word semantically related to the target text and / or the image text based on a predetermined template, The apparatus according to any one of Examples 18 to 22, comprising a first phrase generation module configured to generate the phrase based on the target text, the image text, and the first extended word.
[0090] [Example 24] The speech-to-text generation module is: A text paragraph conversion module configured to convert identified speech into text paragraphs, A second extended word generation module configured to determine a second extended word semantically related to the text paragraph based on a pre-configured semantic library, The apparatus according to any one of Examples 18 to 23, further comprising a first speech-to-text generation module configured to generate the speech-to-text based on the text paragraph and the second extended word.
[0091] [Example 25] The natural language text generation module is: A first output text generation module configured to generate a first output text based on an understanding of the first text by a first pre-trained generative model, A second output text generation module is configured to generate a second output text based on the understanding of the first text by a second pre-trained generative model, The apparatus according to any one of Examples 18 to 24, further comprising a natural language text determination module configured to generate a second text based on the first output text and the second output text.
[0092] [Example 26] The video-related content acquisition module is: The apparatus according to any one of Examples 18 to 25, comprising a generated video acquisition module configured to obtain the generated video by generating a content model based on the natural language text.
[0093] [Example 27] The video-related content acquisition module is: A search term generation module configured to determine search terms related to the natural language text based on the meaning of the natural language text, The apparatus according to any one of Examples 18 to 26, further comprising a retrieved video acquisition module configured to obtain the retrieved video from a video library based on the search term.
[0094] [Example 28] The video-related content acquisition module is: The apparatus according to any one of Examples 18 to 27, further comprising a retrieved image text module configured to obtain the retrieved image text from an image text library based on the search term.
[0095] [Example 29] A relevance determination module configured to determine the degree of relevance between each piece of video-related content and the video, The apparatus according to any one of Examples 18 to 28, further comprising an ordering determination module configured to determine the ordering of each piece of video-related content based on the degree of relevance.
[0096] [Example 30] The natural language text includes at least the first problem, and the apparatus, A first set of thumbnail display modules configured to display a first set of thumbnails of video-related content related to the first problem, The apparatus according to any one of Examples 18 to 29, further comprising a video-related content display module configured to display video-related content corresponding to the selected thumbnail in response to a user selecting a thumbnail from the first set of thumbnails.
[0097] [Example 31] The first set of thumbnail display modules is: A classification display module configured to display multiple classifications related to video-related content, The apparatus according to any one of Examples 18 to 30, further comprising a thumbnail display module configured to display thumbnails of video-related content associated with a first classification in response to a user selecting a first classification from among the plurality of classifications.
[0098] [Example 32] The apparatus according to any one of Examples 18 to 31, further comprising a video recording page display module configured to display a video recording page in response to a user touch operation on a control for re-recording.
[0099] [Example 33] The apparatus according to any one of Examples 18 to 32, further comprising a control display module configured to display a second control, a third control and / or a fourth control in response to a user's touch operation on a first control, wherein the second control is configured to trigger the display of a text question interface when touched, the third control is configured to trigger the display of an image text question interface when touched, and the fourth control is configured to trigger the display of a video question interface when touched.
[0100] [Example 34] The apparatus according to any one of Examples 18 to 33, further comprising a combined video distribution module configured to distribute a combined video including the video and the video-related content in response to a touch operation of a fifth control by the user.
[0101] [Example 35] Electronic device, Processor and The system includes a memory coupled to the processor, which, when executed by the processor, stores instructions that cause the electronic device to perform an action, the action being used to acquire video-related content, and the action being, Based on the video, generate natural language text related to the video, Based on the aforementioned natural language text, the system obtains video-related content associated with the video, including at least one of the generated video, the searched video, and the searched image text. An electronic device that includes displaying thumbnails corresponding to the aforementioned natural language text and the aforementioned video-related content.
[0102] [Example 36] Generating natural language text related to the video is Based on the video, generate a first text containing a phrase related to the video, The electronic device according to Example 35, which includes generating a second text, which is natural language text, based on the first text.
[0103] [Example 37] Generating the first text related to the video is: Extracting multiple video frames from the aforementioned video, By removing duplicates from the aforementioned multiple video frames, multiple deduplication frames are generated. The electronic device according to Example 35 or 36, comprising generating first text related to the video by identifying at least one deduplication frame among the plurality of deduplication frames.
[0104] [Example 38] Generating the first text related to the video is: Identifying the subject and image text included in the aforementioned deduplication frame, To determine the target text related to the aforementioned target, The electronic device according to any one of Examples 35 to 37, comprising generating a phrase related to the video based on the target text and the image text.
[0105] [Example 39] Generating the first text related to the video is: Extracting audio from the aforementioned video, Perform speech recognition on the aforementioned audio and generate speech-to-text, The electronic device according to any one of Examples 35 to 38, further comprising generating a first text related to the video based on the phrase and the audio text.
[0106] [Example 40] Generating phrases related to the video is Based on a predetermined template, determine a first extended word that is semantically related to the target text and / or the image text, The electronic device according to any one of Examples 35 to 39, comprising generating the phrase based on the target text, the image text, and the first extended word.
[0107] [Example 41] Generating speech-to-text is Converting identified speech into text paragraphs, Based on a pre-configured semantic library, a second extended word semantically related to the text paragraph is determined, The electronic device according to any one of Examples 35 to 40, comprising generating the speech text based on the text paragraph and the second extended word.
[0108] [Example 42] Generating natural language text is Based on the understanding of the first text by the first pre-trained generative model, a first output text is generated, Based on the understanding of the first text by the second pre-trained generative model, a second output text is generated, The electronic device according to any one of Examples 35 to 41, comprising generating a second text based on the first output text and the second output text.
[0109] [Example 43] Obtaining video-related content related to the aforementioned video is The electronic device according to any one of Examples 35 to 42, comprising obtaining the generated video by generating a content model based on the natural language text.
[0110] [Example 44] Obtaining video-related content related to the aforementioned video is Based on the meaning of the aforementioned natural language text, determine the search terms related to the aforementioned natural language text, The electronic device according to any one of Examples 35 to 43, further comprising obtaining the searched video from a video library based on the search term.
[0111] [Example 45] Obtaining video-related content related to the aforementioned video is The electronic device according to any one of Examples 35 to 44, further comprising obtaining the retrieved image text from an image text library based on the search term.
[0112] [Example 46] The above operation is, To determine the degree of relevance between each of the aforementioned video-related contents and the video, The electronic device according to any one of Examples 35 to 45, further comprising determining the ordering of each piece of video-related content based on the degree of relevance.
[0113] [Example 47] The natural language text includes at least the first problem, and the operation is, To display a first set of thumbnails of video-related content related to the first problem, The electronic device according to any one of Examples 35 to 46, further comprising displaying video-related content corresponding to the selected thumbnail in response to a user selecting a thumbnail from the first set of thumbnails.
[0114] [Example 48] Displaying a first set of thumbnails for video-related content related to the first problem is: Displaying multiple categories related to video-related content, The electronic device according to any one of Examples 35 to 47, which includes displaying thumbnails of video-related content associated with a first classification in response to a user selecting a first classification from among the plurality of classifications.
[0115] [Example 49] The above operation is, The electronic device according to any one of Examples 35 to 48, further comprising displaying a video recording page in response to a user touch operation on a control for re-recording.
[0116] [Example 50] The above operation is, The electronic device according to any one of Examples 35 to 49, further comprising displaying a second control, a third control and / or a fourth control in response to a user touch operation on a first control, wherein the second control is configured to trigger the display of a text question interface when touched, the third control is configured to trigger the display of an image text question interface when touched, and the fourth control is configured to trigger the display of a video question interface when touched.
[0117] [Example 51] The above operation is, The electronic device according to any one of Examples 35 to 50, further comprising delivering a combined video including the video and the video-related content in response to a touch operation of a fifth control by the user.
[0118] [Example 52] A computer-readable storage medium storing one or more computer instructions that are executed by a processor to implement the method described in any one of Examples 1 to 17.
[0119] While this disclosure employs terminology specific to structural features and / or the logical operation of the method, it should be understood that the topics limited to the attached claims are not limited to the specific features or operations described above. Rather, the specific features and operations described above are merely illustrative forms of realizing the claims.
Claims
1. A method for obtaining video-related content, Based on the video, generate natural language text related to the video, Based on the aforementioned natural language text, the system obtains video-related content associated with the video, which includes at least one of the generated video, the searched video, and the searched image text. A method for obtaining video-related content, including displaying the natural language text and thumbnails corresponding to the video-related content.
2. To generate natural language text related to the aforementioned video, Based on the video, generate a first text containing a phrase related to the video, The method according to claim 1, further comprising generating a second text which is natural language text based on the first text.
3. Generating the first text related to the aforementioned video is: Extracting multiple video frames from the aforementioned video, By removing duplicates from the aforementioned multiple video frames, multiple deduplication frames are generated. The method according to claim 2, further comprising generating first text related to the video by identifying at least one deduplication frame among the plurality of deduplication frames.
4. Generating the first text related to the aforementioned video is: Identifying the subject and image text included in the aforementioned deduplication frame, To determine the target text related to the aforementioned target, The method according to claim 3, comprising generating a phrase related to the video based on the target text and the image text.
5. Generating the first text related to the aforementioned video is: Extracting audio from the aforementioned video, Perform speech recognition on the aforementioned audio and generate speech-to-text, The method according to claim 4, further comprising generating a first text related to the video based on the phrase and the audio text.
6. To generate phrases related to the aforementioned video, Based on a predetermined template, determine a first extended word that is semantically related to the target text and / or the image text, The method according to claim 4, comprising generating the phrase based on the target text, the image text, and the first extended word.
7. Generating speech-to-text is Converting identified speech into text paragraphs, Based on a pre-configured semantic library, a second extended word semantically related to the text paragraph is determined, The method according to claim 5, comprising generating the speech text based on the text paragraph and the second extended word.
8. To generate natural language text related to the aforementioned video, Based on the understanding of the first text by the first pre-trained generative model, a first output text is generated, Based on the understanding of the first text by the second pre-trained generative model, a second output text is generated, The method according to claim 2, comprising generating a second text based on the first output text and the second output text.
9. To obtain video-related content associated with the aforementioned video, The method according to claim 1, comprising obtaining the generated video by generating a content model based on the natural language text.
10. To obtain video-related content associated with the aforementioned video, Based on the meaning of the aforementioned natural language text, determine the search terms related to the aforementioned natural language text, The method of claim 9, further comprising obtaining the searched video from a video library based on the search term.
11. To obtain video-related content associated with the aforementioned video, The method according to claim 10, further comprising obtaining the retrieved image text from an image text library based on the search term.
12. To determine the degree of relevance between each of the aforementioned video-related contents and the video, The method according to claim 1, further comprising determining the ordering of each piece of video-related content based on the degree of relevance.
13. The aforementioned natural language text includes at least the first problem, and the method is To display a first set of thumbnails of video-related content related to the first problem, The method according to claim 1, further comprising displaying video-related content corresponding to the selected thumbnail in response to a user selecting a thumbnail from the first set of thumbnails.
14. Displaying a first set of thumbnails for video-related content related to the first problem is: Displaying multiple categories related to video-related content, The method according to claim 13, further comprising displaying thumbnails of video-related content associated with a first classification in response to a user selecting a first classification from among the plurality of classifications.
15. The method according to claim 13, further comprising displaying a video recording page in response to a user touch operation on a control for re-recording.
16. The method according to claim 13, further comprising displaying a second control, a third control and / or a fourth control in response to a user touch operation on a first control, wherein the second control is configured to trigger the display of a text question interface when touched, the third control is configured to trigger the display of an image text question interface when touched, and the fourth control is configured to trigger the display of a video question interface when touched.
17. The method according to claim 13, further comprising delivering a combined video including the video and the video-related content in response to a touch operation of a fifth control by the user.
18. A device for acquiring video-related content, A natural language text generation module configured to generate natural language text related to a video based on the video, A video-related content acquisition module is configured to acquire video-related content associated with a video, which includes at least one of a generated video, a searched video, and searched image text, based on the aforementioned natural language text. A device for acquiring video-related content, comprising the natural language text and a video-related content display module configured to display thumbnails corresponding to the video-related content.
19. It is an electronic device, Processor and An electronic device comprising: a memory coupled to the processor, which, when executed by the processor, stores instructions causing the electronic device to perform the method according to any one of claims 1 to 17.
20. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the method described in any one of claims 1 to 17.