Video content organizing method and apparatus

By automatically acquiring key image frames from videos using electronic devices and generating video notes, the problem of time-consuming manual video content organization in existing technologies is solved, achieving efficient video content organization.

CN122269005APending Publication Date: 2026-06-23VIVO MOBILE COMM CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
VIVO MOBILE COMM CO LTD
Filing Date
2026-03-13
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, users need to manually organize video content, which is cumbersome and time-consuming. In addition, traditional OCR technology has low recognition accuracy and cannot efficiently organize handwritten content in meeting or teaching videos.

Method used

By automatically acquiring key image frames from videos using electronic devices, corresponding video notes are generated based on these key image frames. The layout information of the note elements matches the layout information of the video content, reducing manual organization steps.

Benefits of technology

It eliminates the need for users to manually organize video notes, improving the efficiency of video content organization, reducing time consumption, and enhancing the efficiency of electronic devices in organizing video file content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122269005A_ABST
    Figure CN122269005A_ABST
Patent Text Reader

Abstract

The application discloses a video content arrangement method and device, and belongs to the technical field of electronic equipment. The method comprises the following steps: receiving a first input of a to-be-processed video; in response to the first input, acquiring N key image frames in the to-be-processed video, the key image frames are image frames including effective content in the to-be-processed video, and N is a positive integer; based on the N key image frames, generating and displaying a video note corresponding to the to-be-processed video, and the layout information of a note element in the video note matches the layout information of the effective content in the N key image frames.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of electronic equipment technology, specifically relating to a method and apparatus for organizing video content. Background Technology

[0002] In modern enterprise offices and remote collaboration, meetings are an important form of information exchange and decision-making. Participants typically use electronic devices such as mobile phones or cameras to record the whiteboard, presentation slides (PowerPoint), handwritten notes, and other materials during the meeting. Then, after the meeting, users can review the recorded video and manually organize the complete meeting content.

[0003] However, in the above methods, users usually need to repeatedly play the meeting video and manually pause it at key points to input the text, charts, hand-drawn content, etc. included in the video into note-taking software or documents one by one. This makes the process of organizing video content cumbersome and time-consuming, resulting in poor efficiency in organizing video files through electronic devices. Summary of the Invention

[0004] The purpose of this application is to provide a video content organization method and apparatus that can improve the efficiency of electronic devices in organizing video files.

[0005] In a first aspect, embodiments of this application provide a video content organization method, which includes: receiving a first input to a video to be processed; in response to the first input, acquiring N key image frames in the video to be processed, wherein the key image frames are image frames in the video to be processed that include valid content, and N is a positive integer; and generating and displaying video notes corresponding to the video to be processed based on the N key image frames, wherein the layout information of the note elements in the video notes matches the layout information of the valid content in the N key image frames.

[0006] Secondly, embodiments of this application provide a video content organization device, which includes: a receiving module, an acquisition module, and a display module; the receiving module is used to receive a first input of a video to be processed; the acquisition module is used to acquire N key image frames in the video to be processed in response to the first input received by the receiving module, wherein the key image frames are image frames in the video to be processed that include valid content, and N is a positive integer; the display module is used to generate and display video notes corresponding to the video to be processed based on the N key image frames acquired by the acquisition module, wherein the layout information of the note elements in the video notes matches the layout information of the valid content in the N key image frames.

[0007] Thirdly, embodiments of this application provide an electronic device including a processor and a memory, the memory storing programs or instructions executable on the processor, the programs or instructions, when executed by the processor, implementing the steps of the method described in the first aspect.

[0008] Fourthly, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the method described in the first aspect.

[0009] Fifthly, embodiments of this application provide a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being used to run programs or instructions to implement the method as described in the first aspect.

[0010] In a sixth aspect, embodiments of this application provide a computer program / program product stored in a storage medium, which is executed by at least one processor to implement the method described in the first aspect.

[0011] In this embodiment, a first input to the video to be processed is received; in response to the first input, N key image frames in the video to be processed are acquired, where each key image frame contains valid content and N is a positive integer; based on the N key image frames, video notes corresponding to the video to be processed are generated and displayed, wherein the layout information of the note elements in the video notes matches the layout information of the valid content in the N key image frames. In this solution, when a user needs to obtain video information from the video to be processed, i.e., needs to obtain video notes corresponding to the video to be processed, the user can select the video to be processed through an electronic device, enabling the electronic device to acquire key image frames containing valid content from the video to be processed, and directly generate and display the video file corresponding to the video to be processed based on these key image frames; that is, the electronic device can automatically generate corresponding video notes based on the video, thus eliminating the need for the user to manually organize the video notes, reducing the time spent organizing video content, and thereby improving the efficiency of the electronic device in organizing video files. Attached Figure Description

[0012] Figure 1 This is one of the flowcharts of a video content organization method provided in the embodiments of this application;

[0013] Figure 2A This is a schematic diagram of an example of a video recording interface provided in an embodiment of this application;

[0014] Figure 2B This is a schematic diagram of an example of a scanning function interface provided in an embodiment of this application;

[0015] Figure 3 This is a second flowchart of a video content organization method provided in the embodiments of this application;

[0016] Figure 4 This is one of the example diagrams of a video preview interface provided in the embodiments of this application;

[0017] Figure 5A This is a second example of a video preview interface provided in the embodiments of this application;

[0018] Figure 5B This is a third example of a video preview interface provided in the embodiments of this application;

[0019] Figure 5C This is a fourth example of a video preview interface provided in the embodiments of this application;

[0020] Figure 6 This is an example diagram of a video notes preview interface provided in an embodiment of this application;

[0021] Figure 7 This is the third flowchart of a video content organization method provided in the embodiments of this application;

[0022] Figure 8 This is a schematic diagram of the structure of a video content organization device provided in an embodiment of this application;

[0023] Figure 9 This is one of the hardware structure diagrams of an electronic device provided in the embodiments of this application;

[0024] Figure 10 This is a second schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0025] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.

[0026] The terms "first," "second," etc., used in this application's specification are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects. For example, a first object can be one or more, where "more" means at least two. Furthermore, in the specification, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.

[0027] The terms "at least one" and "at least one of" in this application's specification refer to any one, any two, or a combination of two or more of the included objects. For example, "at least one of a, b, and c" can mean "a", "b", "c", "a and b", "a and c", "b and c", and "a, b, and c", where a, b, and c can be single or multiple, and multiple means at least two. Similarly, "at least two" means two or more, and its meaning is similar to "at least one". The identifiers in this application are text, symbols, images, etc., used to indicate information, and can use controls or other containers as carriers for displaying information, including but not limited to text identifiers, symbol identifiers, and image identifiers.

[0028] The terminology used in the implementation section of this application is only for explaining specific embodiments of this application and is not intended to limit this application. The terminology involved in the embodiments of this application is explained below.

[0029] Controls: Elements in a graphical user interface that can receive user input to perform corresponding processing or display relevant data. Controls can include, but are not limited to, virtual buttons, sliders, progress bars, and checkboxes.

[0030] Interface: Refers to the medium through which users interact with electronic devices. The interface allows users to send commands to the system via input devices and receive feedback information via output devices. Input devices can be keyboards, mice, touchscreens, etc.; monitors, speakers, etc.

[0031] Optical Character Recognition (OCR) is a technology that converts text information in images into editable and searchable electronic text. It automatically recognizes printed or handwritten characters by simulating the human visual system and combining image processing, pattern recognition, and machine learning algorithms.

[0032] In modern enterprise offices and remote collaboration, meetings are a crucial form of information exchange and decision-making. Participants typically use electronic devices such as mobile phones or cameras to record whiteboards, presentation slides, handwritten notes, and other materials during the meeting. Compared to photos, videos can completely record the dynamic process of the meeting, including the speaker's thought process, the evolution of whiteboard content, the order in which slides are turned, and other continuous information, thus preserving a more comprehensive picture of the meeting's information flow. After the meeting, users can review the recorded video to manually organize the complete meeting content.

[0033] However, the methods described above typically require users to repeatedly play the meeting video and manually pause at key points to input text, charts, hand-drawn content, etc., into note-taking software or documents one by one through screenshots or manual editing. This process of organizing video content is time-consuming. Furthermore, because videos are usually long, users find it difficult to quickly locate important information and must review frame by frame or rely on memory to drag the progress bar, making it easy to miss key content or repeatedly view irrelevant scenes. This is especially true for meeting videos containing a large amount of handwritten whiteboard content, which require frequent pausing, replaying, and inputting, potentially taking several hours to organize for a single meeting.

[0034] Furthermore, handwritten text and hand-drawn graphics on the whiteboard in meeting videos often suffer from poor image clarity due to factors such as shooting angle, lighting changes, and camera shake. Traditional OCR technology has low recognition accuracy, forcing users to rely on manual input. This results in poor efficiency for electronic devices in organizing video files.

[0035] Furthermore, in related technologies, note-taking applications and office systems are independent of each other and usually lack unified intelligent video content parsing capabilities. In other words, users cannot directly call the video-to-note function in their commonly used work platforms, so they need to switch frequently between multiple applications, which will further reduce work efficiency.

[0036] The video content organization method provided in this application will be described in detail below with reference to the accompanying drawings, through specific embodiments and application scenarios.

[0037] The video content organization method provided in this application can be applied to scenarios where users need to obtain video notes corresponding to videos.

[0038] The following examples illustrate the video content organization method provided in this application.

[0039] Scenario 1: If a user needs to organize meeting notes, they can record the meeting video in real-time using an electronic device during the meeting. The device will then respond to the user's input and automatically acquire N keyframes containing relevant meeting content from the video. Based on these N keyframes, it will automatically generate meeting video notes and display them to the user. The meeting video notes can retain the layout information of each relevant piece of content from the video.

[0040] Scenario 2: While watching an instructional video, if the user wants to summarize or organize class notes based on the video, they can input information into the video. This triggers the electronic device to automatically acquire N key image frames containing relevant learning content from the video. Based on these N key image frames, it automatically generates corresponding class notes and displays them to the user. The class notes can retain the layout information of various indicator points from the video.

[0041] It should be noted that the above scenarios 1 and 2 are merely exemplary examples of some scenarios that may be applied to the embodiments of this application. In actual implementation, the embodiments of this application can also be applied to any possible scenarios such as needing to obtain corresponding video notes through video. The embodiments of this application are not limited here.

[0042] Based on the scenarios described in the embodiments of this application, the video content organization method provided in this application receives a first input of a video to be processed; in response to the first input, it acquires N key image frames in the video to be processed, where each key image frame contains valid content and N is a positive integer; based on the N key image frames, it generates and displays video notes corresponding to the video to be processed, where the layout information of the note elements matches the layout information of the valid content in the N key image frames. In this solution, when a user needs to acquire video information from a video to be processed, i.e., needs to acquire video notes corresponding to the video to be processed, the user can select the video to be processed through an electronic device, enabling the electronic device to acquire key image frames containing valid content from the video to be processed, and directly generate and display the video file corresponding to the video to be processed based on these key image frames; that is, the electronic device can automatically generate corresponding video notes based on the video, thus eliminating the need for the user to manually organize the video notes, reducing the time spent organizing video content, and thereby improving the efficiency of the electronic device in organizing video files.

[0043] The video content organization method provided in this application is executed by a video content organization device, which can be an electronic device, or a functional module or entity within an electronic device. This application does not limit the specific implementation of this method. The following will use an electronic device as an example to illustrate the video content organization method provided in this application.

[0044] This application provides a method for organizing video content. Figure 1 A flowchart illustrating a video content organization method provided in an embodiment of this application is shown. Figure 1 As shown, the video content organization method provided in this application embodiment may include the following steps 201 to 203.

[0045] Step 201: The electronic device receives the first input of the video to be processed.

[0046] In some embodiments of this application, the first input described above can be used to select the video to be processed, thereby triggering the electronic device to perform subsequent steps based on the selected video.

[0047] In some embodiments of this application, the first input mentioned above may include, but is not limited to, any of the following: click input, long press input, voice input, gesture input, or other feasible inputs. This application does not limit this type of input.

[0048] In some embodiments of this application, the video to be processed may be a video recorded in real time, or a video received or stored by an electronic device.

[0049] In some embodiments of this application, the electronic device can record the aforementioned video to be processed in real time through the shooting function of the system application or a third-party application.

[0050] In some embodiments of this application, when a shooting preview interface is displayed, the electronic device can receive click input for the shooting controls.

[0051] In some embodiments of this application, the above-mentioned shooting preview interface can be a shooting preview interface under a special shooting model, such as the "Instant Notes" shooting model.

[0052] Understandably, once the electronic device receives a click input to the shooting control, it can begin recording the aforementioned video to be processed.

[0053] For example, let's take a mobile phone as an example. If a user needs to take notes based on the content of a meeting or class, the user can do so during the meeting or class, such as... Figure 2AAs shown, the user can trigger the phone to run the camera application to display the shooting preview interface 20 corresponding to the "Instant Notes" shooting model. Then, the user can click the shooting control to trigger the phone to start recording video in real time, which is the video to be processed mentioned above.

[0054] In some embodiments of this application, when displaying a photo album interface, the electronic device can receive a first input to the video to be processed included in the photo album interface.

[0055] In some embodiments of this application, the album interface described above may include at least one video.

[0056] In some embodiments of this application, the electronic device may first display a functional interface based on user input, such as an interface displaying a scanning function. Then, the electronic device may display the aforementioned album interface based on the user's selection of a specific function within the functional interface, so that the user can select the video to be processed based on the album interface. For example, the specific function may be an "instant note-taking" function.

[0057] For example, such as Figure 2B As shown, the mobile phone can display a scanning function interface 22, which may include at least one scanning-related function. Therefore, when a user wants to obtain video notes corresponding to a specific video, the user can click the "Instant Notes" function included in the scanning function interface 22 to trigger the mobile phone to display a photo album interface. Further, based on the user's selected input for the video to be processed in the photo album interface (i.e., the aforementioned first input), subsequent operations can be performed on the video to be processed.

[0058] It is understandable that the "Instant Notes" shooting model control in the aforementioned camera application and the "Instant Notes" function control in the scanning function interface are entry points provided by the electronic device to the user to trigger the electronic device to execute the video-to-notes function.

[0059] Step 202: The electronic device responds to the first input and acquires N key image frames from the video to be processed.

[0060] In some embodiments of this application, the aforementioned key image frames may be image frames in the video to be processed that include valid content.

[0061] Where N is a positive integer.

[0062] In some embodiments of this application, the above-mentioned effective content can be understood as the recordable, identifiable, and information-valuable content in the video to be processed.

[0063] In some embodiments of this application, the aforementioned effective content may include, but is not limited to, at least one of the following: whiteboard content, PPT content, and handwritten notes content.

[0064] It is understood that the aforementioned valid content may include specific video content corresponding to at least one of the following types: whiteboard content, PPT content, and handwritten notes content.

[0065] In some embodiments of this application, the content of the writing board can be handwritten text, charts, formulas, or other content included on a writing board such as a whiteboard or blackboard.

[0066] In some embodiments of this application, the content of the writing board may include, but is not limited to, at least one of the following: flowchart, mind map, and calculation formula.

[0067] In some embodiments of this application, the PPT content can be a PPT page displayed on a projection screen.

[0068] In some embodiments of this application, the content of the PPT may include, but is not limited to, at least one of the following: title page, data charts, and list of key points.

[0069] In some embodiments of this application, the content of the handwritten notes can be handwritten content on a paper notebook or an electronic tablet.

[0070] In some embodiments of this application, the content of the handwritten notes may include, but is not limited to, at least one of the following: sketches, meeting minutes, and class notes.

[0071] Understandably, the aforementioned effective content can help users recapture the key points of the video, understand the video content, or acquire the knowledge and information contained in the video.

[0072] In some embodiments of this application, combined with Figure 1 ,like Figure 3 As shown, step 202 above can be specifically implemented through steps 202a to 202d below.

[0073] Step 202a: The electronic device responds to the first input and decodes the video to be processed to obtain the first image frame sequence corresponding to the video to be processed.

[0074] In some embodiments of this application, the first image frame sequence described above includes N key image frames.

[0075] In some embodiments of this application, the electronic device may, in response to the first input described above, perform metadata parsing on the video to be processed to extract the basic video information of the video to be processed, and then perform decoding processing on the video to be processed based on the basic video information to split the video to be processed into an independent sequence of image frames, namely the first image frame sequence described above.

[0076] In some embodiments of this application, the aforementioned basic video information may include, but is not limited to, at least one of the following: resolution, frame rate, duration, encoding format, and bit rate.

[0077] In some embodiments of this application, when the video to be processed is a video obtained from a photo album, the electronic device can calculate the total number of frames based on the video duration and frame rate in order to decode the video to be processed and obtain a first image frame sequence.

[0078] In some embodiments of this application, when the video to be processed is a real-time recorded video, the electronic device can obtain the first image frame sequence based on the real-time captured image frames; or, the electronic device can perform the above-mentioned metadata parsing based on the recorded video after a preset recording time has started, in order to decode the video to be processed and obtain the first image frame sequence.

[0079] It should be noted that the preset duration can be determined according to actual needs, and this application embodiment does not limit it. For example, the preset duration can be any one of 1 second, 3 seconds, or 5 seconds.

[0080] Step 202b: The electronic device performs feature recognition on each frame of the first image frame sequence based on the first image frame sequence to determine M key image frames containing valid content.

[0081] Where M is a positive integer greater than or equal to N.

[0082] In some embodiments of this application, the electronic device can perform intelligent analysis on the decoded first image frame sequence based on video content understanding technology to automatically detect image frames containing effective content such as meeting whiteboards, handwritten notes, and PPT content, namely the aforementioned M key image frames.

[0083] In some embodiments of this application, the electronic device can determine whether each image frame contains valid content by identifying whether the image frame includes special areas such as a writing board area, a handwritten text area, or a projection screen area.

[0084] In some embodiments of this application, the electronic device may first determine the image frame containing valid content, and then perform image filtering processing on the image frame containing valid content to filter out invalid image frames, thereby using the filtered image frame as the aforementioned N key image frames.

[0085] In some embodiments of this application, the aforementioned invalid image frames may include, but are not limited to, at least one of the following: close-up of a person, blank screen, blurry or jittery image, or scene switching.

[0086] In some embodiments of this application, when multiple consecutive image frames have similar frame content, the electronic device can employ an inter-frame difference analysis method to detect the degree of change in frame content.

[0087] For example, suppose the video to be processed is a meeting video. When new writing is added to the whiteboard or when slides are turned on the projected screen, the electronic device can detect changes in the content of the video frame and mark it as a keyframe, thus obtaining M key image frames corresponding to the video to be processed.

[0088] In some embodiments of this application, while the electronic device determines a key image frame, the electronic device can output information such as the timestamp and frame number corresponding to that key image frame.

[0089] In some embodiments of this application, during the process of an electronic device playing the video to be processed, the electronic device may mark the above-mentioned M key image frames in the video to be processed.

[0090] For example, suppose the video to be processed is video A, and video A includes four key image frames, such as key image frame 1, key image frame 2, key image frame 3, and key image frame 4. Figure 4 As shown, the mobile phone displays a video preview interface 23, in which key image frame 2 is being displayed. The mobile phone can display thumbnail images corresponding to the four key image frames above the video playback progress bar, and display the timestamp information t and frame sequence number f corresponding to each key image frame. Specifically, key image frame 1 can correspond to (t1, f1), key image frame 2 can correspond to (t2, f2), key image frame 3 can correspond to (t3, f3), and key image frame 4 can correspond to (t4, f4). Simultaneously, the mobile phone can also display four marker points on the video playback progress bar, each corresponding to one key image frame.

[0091] In some embodiments of this application, while the electronic device determines a key image frame, it can also output a content type tag corresponding to that key image frame. Examples include: whiteboard, PowerPoint presentation, handwritten notes, etc.

[0092] It should be noted that, due to Figure 4 The video preview interface 23 shown is playing at the key frame 2, so the slider in the video playback progress bar is displayed on the marker point corresponding to the key frame 2.

[0093] In some embodiments of this application, when the electronic device acquires the above-mentioned M key image frames, the electronic device can perform image enhancement processing on the M image frames to improve the image quality of the key frames.

[0094] In some embodiments of this application, the above-described image enhancement processing may include, but is not limited to, at least one of the following: deblurring processing, perspective correction processing, illumination equalization processing, and super-resolution reconstruction processing.

[0095] It should be noted that the above deblurring process can be used to restore sharpness for blur caused by shooting shake or inaccurate focus.

[0096] The aforementioned perspective correction process can automatically detect rectangular boundaries for writing boards or screens photographed at an angle, so as to correct the tilted image to a normal viewing angle.

[0097] The above-mentioned lighting equalization processing can address issues such as uneven lighting, reflections, and shadows, thereby enhancing the contrast of text and graphics;

[0098] The super-resolution reconstruction process described above can improve the image resolution of low-resolution or long-distance images, making handwritten text clearer.

[0099] Understandably, electronic devices can save the enhanced keyframe images in a high-quality standard format in memory, which can then be used for content recognition based on the processed images.

[0100] Step 202c: The electronic device compares every two adjacent key image frames based on the timestamp information of the M key image frames to obtain the similarity information corresponding to the M key image frames.

[0101] Understandably, when the video to be processed is a meeting video or a classroom video, since the effective content presented in the video, such as whiteboard content, is usually written or displayed step by step, the same whiteboard area may appear continuously in multiple key image frames. Therefore, electronic devices can identify the relationship between these continuous image frames through image registration and content comparison methods.

[0102] In some embodiments of this application, the specific implementation method for how electronic devices obtain similarity information of adjacent key image frames can be found in the specific implementation methods in related technologies, and will not be repeated here.

[0103] In some embodiments of this application, the electronic device can perform comparison processing based on the valid information included in two key image frames to determine the similarity information corresponding to the two key image frames.

[0104] In some embodiments of this application, the above-mentioned similarity information can be represented by numerical values ​​or by the degree of similarity.

[0105] For example, the degree of similarity mentioned above may include, but is not limited to, at least one of the following: completely identical, similar, completely different.

[0106] It is understandable that when a speaker pauses to explain in a video, or when there is no writing for a long time, the effective content included in the image frame will be exactly the same within a certain period of time.

[0107] Step 202d: Based on similarity information, the electronic device performs deduplication on the M key image frames to obtain N key image frames.

[0108] In some embodiments of this application, when multiple key image frames with identical valid content exist among M key image frames, the electronic device can obtain the image sharpness corresponding to these multiple key image frames to perform deduplication processing on the multiple key image frames based on image detail. For example, the electronic device can retain the key image frame with the best image display effect and discard other blurry or duplicate image frames.

[0109] For example, suppose that in the video to be processed, a speaker is sketching a "user login flowchart" on a whiteboard. In this video, the electronic device identifies key image frame A, which corresponds to the 10th second of the video, where the speaker has completed the flowchart and is explaining it to the audience. Key image frame B, which corresponds to the 12th second of the video, shows the same flowchart on the whiteboard, albeit with slightly different lighting. Key image frame C, which corresponds to the 14th second of the video, also shows the same flowchart on the whiteboard. Therefore, the electronic device can determine that key image frames A, B, and C contain identical content. It can then determine which of these key image frames has the best display quality (e.g., key image frame B has the most accurate focus and the least noise). Therefore, the electronic device can retain key image frame B and delete key image frames A and C.

[0110] In some embodiments of this application, when there are multiple key image frames with similar valid content among M key image frames, the electronic device can display a prompt message to the user, so that, according to the user's choice, the key image frame with the most valid content is retained and the other image frames are discarded; or, all key image frames can be retained.

[0111] In some embodiments of this application, when there are multiple key image frames with similar effective content among M key image frames, the electronic device can track the change process of the effective content in the multiple key image frames, extract each newly added content, and record the time sequence of its appearance.

[0112] In this embodiment, the electronic device can automatically identify key frames containing valid information in the video conference video and filter key image frames to filter out redundant frames and invalid content in the initially determined key image frames, thereby accurately locating the handwritten whiteboard frames and hand-drawn picture key image frames that need to be converted. In this way, the video notes generated subsequently based on the deduplication key image frames can be ensured to be concise and accurate.

[0113] In some embodiments of this application, when the electronic device determines the above-mentioned N key image frames, the electronic device can present the N key image frames to the user and adjust the display order, number and other information of the N key image frames based on the user's input of the N key frames.

[0114] Understandably, the key image frames used by electronic devices to ultimately generate video notes can be adjusted according to the user's actual needs. This allows the video notes to more accurately meet the user's requirements and improve the user experience.

[0115] In some embodiments of this application, the electronic device can determine the initial presentation order of the N key image frames based on at least one of the timestamp information t and the frame sequence number f corresponding to the N key image frames.

[0116] Step 203: The electronic device generates and displays video notes corresponding to the video to be processed based on N key image frames.

[0117] In some embodiments of this application, the layout information of the note elements in the video notes can be matched with the layout information of the effective content in N key image frames.

[0118] In some embodiments of this application, step 203 can be specifically implemented by the following steps 203a and 203b.

[0119] Step 203a: The electronic device performs layout analysis on N key image frames to obtain the layout information of the effective content in each key image frame.

[0120] In some embodiments of this application, the above-mentioned layout information may include, but is not limited to, at least one of the following: spatial location information and hierarchical structure information.

[0121] In some embodiments of this application, the aforementioned spatial location information may include, but is not limited to, at least one of the following: up and down, left and right, front and back.

[0122] In some embodiments of this application, the above-mentioned hierarchical structure information may include, but is not limited to, at least one of the following: parallel, inclusion, causal.

[0123] In some embodiments of this application, electronic devices can use multimodal large language models to comprehensively analyze multi-dimensional information such as text content, font attributes, positional relationships, graphic elements, color annotations, and time sequence in order to understand the semantic hierarchy and logical relationships of the content.

[0124] In some embodiments of this application, the electronic device can automatically identify the hierarchical structure of the above-mentioned valid content, such as the title, first-level points, second-level points, and body description, and extract key information.

[0125] In some embodiments of this application, the electronic device can perform layout analysis on the above-mentioned N key image frames, automatically identify different content regions in the image, segment the key image frames into regions, and analyze the spatial positional relationship and hierarchical structure between each region to obtain the above-mentioned layout information.

[0126] In some embodiments of this application, the regions corresponding to the above-mentioned region segmentation may include, but are not limited to, at least one of the following: title area, text area, chart area, hand-drawn graphic area, and annotation area.

[0127] In some embodiments of this application, when the colors corresponding to the effective content in key video frames are different, the electronic device can understand the intent of the color markings, record information such as the coordinates, size, content type, and color attributes of each area, construct a note structure tree, and realize the above-mentioned layout analysis and processing of image frames.

[0128] For example, when a speaker writes on a whiteboard in a video using markers of different colors, the electronic device can identify the areas written with different colored markers and understand the emphasis intended by the color markings, such as red for emphasis and blue for supplementary explanations.

[0129] In some embodiments of this application, during the process of layout analysis and processing of N key image frames by the electronic device, the writing information corresponding to the handwritten text can also be identified.

[0130] In some embodiments of this application, the above-mentioned writing information may include, but is not limited to, at least one of the following: writing style, degree of illegibility of handwriting, and writing angle.

[0131] In some embodiments of this application, the electronic device can recognize attributes such as the size, color, and thickness of text to understand the importance of the text.

[0132] In some embodiments of this application, the electronic device can use graphic recognition to identify hand-drawn geometric shapes, flowcharts, mind maps, tables, and other content in the hand-drawn graphic area. The geometric shapes may include, but are not limited to, at least one of the following: rectangles, circles, and arrows.

[0133] In some embodiments of this application, electronic devices can understand semantic information such as graphic type, connection relationship, and spatial layout expressed by hand-drawn lines in order to construct a complete content semantic structure tree.

[0134] In some embodiments of this application, the above-mentioned content semantic structure tree may include multi-level elements such as chapters, paragraphs, lists, and charts.

[0135] For example, an electronic device can identify the content connected by arrows as a causal relationship or a process relationship.

[0136] In some embodiments of this application, electronic devices can use high-precision OCR to recognize printed text in PPT presentations.

[0137] In some embodiments of this application, during the process of the electronic device recognizing key content in N key image frames, the electronic device can evaluate the accuracy of each recognition result, mark low-accuracy content as pending confirmation, and display it in the converted preview interface to prompt the user to confirm or edit.

[0138] In some embodiments of this application, low-accuracy content can be content with an accuracy less than a preset accuracy threshold.

[0139] It should be noted that the aforementioned preset accuracy threshold can be set according to actual needs, and this application embodiment does not limit it. For example, the preset accuracy threshold can be 5%.

[0140] For example, a mobile phone can access such as Figure 5A The key image frame 1 shown is used for accuracy evaluation. If the accuracy of the "Print 'You won!'" part in the flowchart 21 is less than the preset accuracy threshold, the mobile phone can display a marker box 25 in the flowchart 21 to mark the low-accuracy part and show it to the user. Thus, as shown... Figure 5B As shown, users can tap or long-press the marker box 25 to trigger the phone to display "Keep," "Edit," and "Discard" controls. If the user taps the "Keep" control, the phone can determine that the user has checked the content in the marker box 25 and does not need to modify it; if the user taps the "Edit" control, the phone can determine that the user needs to modify the content in the marker box 25 and will retain the modified content; if the user taps the "Discard" control, the phone can determine that the user does not need to retain the content in the marker box 25, meaning the phone will retain the content as shown. Figure 5C The flowchart shown is 21.

[0141] Step 203b: The electronic device generates and displays video notes corresponding to the video to be processed based on the layout information of the effective content in N key image frames.

[0142] In some embodiments of this application, the electronic device can organize the content of the N key image frames into a coherent note structure based on the temporal order and content correlation of the N key image frames to obtain the aforementioned video notes.

[0143] In some embodiments of this application, the electronic device can intelligently match a suitable note template based on the content type.

[0144] In some embodiments of this application, the above content types may include, but are not limited to, at least one of the following: meeting minutes template, mind map template, flowchart template.

[0145] In some embodiments of this application, video notes generated by electronic devices can maintain the visual hierarchy of the original content.

[0146] For example, electronic devices can convert large headings written on a meeting whiteboard into first-level headings in meeting notes; convert a list of key points on a meeting whiteboard into a bulleted list; and convert hand-drawn flowcharts into standardized flowchart elements. Figure 2A The process diagram 21 shown is converted into the following: Figure 6 The flowchart shown is 21.

[0147] Understandably, electronic devices can automatically generate editable video notes with a beautiful format and clear structure based on semantic structure tree and layout analysis results.

[0148] In some embodiments of this application, the electronic device may retain the color emphasis information of the original content. For example, the electronic device may convert red markings into a highlighted or bold style.

[0149] In some embodiments of this application, the text content included in the above video notes is editable text.

[0150] In some embodiments of this application, the image elements included in the video notes are vector graphics or editable objects.

[0151] In some embodiments of this application, the storage format of the above-mentioned video notes may include, but is not limited to, at least one of the following: Markdown, Word, HTML, and PDF.

[0152] In this embodiment, the electronic device can analyze key image frames to accurately identify handwritten text, hand-drawn graphics and their logical relationships, and can intelligently restore them into standardized and editable electronic notes according to the original layout structure. This can reduce the time spent organizing video content, greatly reduce the burden on users, and improve the efficiency of electronic devices in organizing video files.

[0153] In the video content organization method provided in this application embodiment, a first input of a video to be processed is received; in response to the first input, N key image frames in the video to be processed are obtained, where each key image frame is an image frame in the video to be processed containing valid content, and N is a positive integer; based on the N key image frames, video notes corresponding to the video to be processed are generated and displayed, where the layout information of the note elements in the video notes matches the layout information of the valid content in the N key image frames. In this solution, when a user needs to obtain video information in a video to be processed, i.e., needs to obtain video notes corresponding to the video to be processed, the user can select the video to be processed through an electronic device, so that the electronic device can obtain key image frames containing valid content in the video to be processed, and directly generate and display the video file corresponding to the video to be processed based on the key image frames; that is, the electronic device can automatically generate video notes corresponding to the video, thereby eliminating the need for the user to manually organize video notes, reducing the time spent organizing video content, and thus improving the efficiency of the electronic device in organizing video files.

[0154] It is understood that in the video content organization method provided in the embodiments of this application, users do not need to repeatedly play, pause to take screenshots, or manually input. They only need to upload the video or shoot in real time, and the electronic device can automatically complete the entire process of key image frame extraction, content recognition, and structured conversion, thus realizing one-click generation of video notes.

[0155] In some embodiments of this application, the ability to generate corresponding video notes based on video can be encapsulated into a standardized software development kit (SDK) to provide a unified SDK interface for electronic devices and third-party applications to easily integrate the video-to-note function, thereby achieving one-stop intelligent processing from video shooting to note generation.

[0156] Specifically, electronic devices can encapsulate the entire process capabilities described in the above embodiments, such as video parsing, key image frame extraction, content recognition, and note generation, into standardized SDK components to support both synchronous calls and asynchronous callback modes. These SDK components can include built-in caching mechanisms and breakpoint resume functionality, and support segmented uploading and processing of large video files.

[0157] In some embodiments of this application, the caller can integrate the SDK into various scenarios such as note-taking applications, conferencing systems, collaboration platforms, and educational software, providing users with a one-stop video-to-note service to meet the practical needs of different fields. In other words, this application can achieve wide application through its open SDK.

[0158] In some embodiments of this application, the aforementioned video notes include N timestamp links; combined with Figure 1 ,like Figure 7 As shown, after step 203 above, the video content organization method provided in this application embodiment may further include the following steps 301 and 302.

[0159] Step 301: The electronic device receives a second input for the first timestamp link among N timestamp links.

[0160] In some embodiments of this application, a timestamp link corresponds to a key image frame.

[0161] In some embodiments of this application, the first timestamp link can be any one of the N timestamp links.

[0162] In some embodiments of this application, the second input described above can be used to select the first timestamp link described above.

[0163] In some embodiments of this application, the second input may include, but is not limited to, any of the following: click input, long press input, voice input, gesture input, or other feasible inputs. This application does not limit this type of input.

[0164] Step 302: The electronic device responds to the second input and displays the video segment corresponding to the first timestamp link in the video to be processed.

[0165] In some embodiments of this application, the electronic device may overlay a video playback window on the currently displayed interface to display the video segment corresponding to the first timestamp link in the video to be processed; or, the electronic device may jump from the currently displayed interface to a video playback interface to display the video segment corresponding to the first timestamp link in the video to be processed.

[0166] In some embodiments of this application, the electronic device can first determine the key image frame corresponding to the first timestamp link, and determine the position of the key image frame in the video to be processed, such as at least one of the timestamp information t and the frame sequence number f. Then, the electronic device can further determine the corresponding video segment based on the key image frame.

[0167] For example, a video segment of a preset duration starting with the key image frame; or a video segment consisting of a sequence of image frames with a similarity to the key image frame greater than or equal to a preset threshold; or any other feasible video segment that is associated with the key image frame.

[0168] In this embodiment, the electronic device can embed a timestamp link of the original video into the video notes. When a user clicks on a timestamp link, the electronic device can be triggered to jump to the video segment corresponding to that timestamp link, thereby realizing bidirectional indexing between the notes and the original video. This improves the convenience and flexibility for users to obtain information included in the video notes.

[0169] To illustrate the various scenarios in which the embodiments of this application can be applied, and in conjunction with the various implementation schemes of the embodiments of this application described above, specific examples are given below to explain the implementation process of the embodiments of this application in various scenarios. A mobile phone is used as an example for illustration.

[0170] Scenario 1: If a user needs to organize meeting notes, they can do so during the meeting, such as... Figure 2A As shown, the phone is triggered to run the camera app, displaying the shooting preview interface 20 corresponding to the "Instant Notes" shooting model. Then, the user can click the shooting control to trigger the phone to start real-time video recording, i.e., the video to be processed mentioned above.

[0171] The mobile phone can respond to the user's input to start recording the meeting video, automatically performing metadata parsing on the recorded meeting video to extract the basic video information. Based on this information, the phone can then decode the video, breaking it down into independent image frame sequences, i.e., the meeting image frame sequence. Alternatively, the mobile phone can automatically perform metadata parsing on the completed meeting video after recording ends to obtain the corresponding meeting image frame sequence.

[0172] Therefore, the mobile phone performs feature recognition frame by frame on the conference image frame sequence to identify the image frames that include the conference whiteboard, i.e., the image frames containing valid content, as M key image frames, such as 10 key image frames. Among them, key image frame 1 includes 1 line of notes; key image frame 2 includes 2 lines of notes, i.e., 1 line of notes added to key image frame 1; key image frame 3 adds another line of notes; the notes in key image frames 4 to 8 remain unchanged; key image frame 9 corresponds to new notes, such as a flowchart; and the notes in key image frame 10 correspond to a table.

[0173] The mobile phone can compare each pair of adjacent key image frames based on the timestamp information of the 10 key image frames to obtain the similarity information corresponding to the 10 key image frames. For example, key image frame 2 is similar to key image frame 1, key image frame 3 is similar to key image frame 2, key image frames 3 to 8 are completely the same, key image frame 9 is completely different from key image frame 8, and key image frame 10 is completely different from key image frame 9.

[0174] Therefore, for similar key image frames 1 to 3, since additional notes are added to these three key image frames based on the original notes, the phone can include key image frames 1 to 3, or only key image frame 3 with the most complete notes, according to the user's actual needs. Then, for identical key image frames 3 to 8, the phone can obtain the clarity corresponding to each of key image frames 3 to 8, retaining the key image frame with the best display effect, such as key image frame 6, and discarding the other key image frames. Then, for key image frames 9 and 10 that are completely different from other key image frames, the phone can directly retain them. In other words, the phone can perform deduplication on 10 key image frames to retain 5 key image frames, such as key image frame 1, key image frame 2, key image frame 6, key image frame 9, and key image frame 10; or it can retain 3 key image frames, such as key image frame 6, key image frame 9, and key image frame 10.

[0175] Taking the key image frames 6, 9, and 10 retained by the mobile phone as an example, the mobile phone can perform layout analysis on these three key image frames respectively to obtain the layout information of the effective content in each key image frame. For example, key image frame 6 includes the hierarchical structure information between the three lines of handwritten notes, key image frame 9 includes the connection relationship between the various sets of images in the flowchart, and key image frame 10 includes the correspondence relationship of each data in the table.

[0176] Therefore, the mobile phone can organize the content of the three key image frames into a coherent note structure based on their temporal order and content relevance, thus obtaining the meeting video notes corresponding to the meeting video. These meeting video notes can include three timestamp links: timestamp link 1 corresponding to key image frame 6, timestamp link 2 corresponding to key image frame 9, and timestamp link 3 corresponding to key image frame 10.

[0177] Then, while the user is viewing the meeting video notes, the user can click on any timestamp link included in the meeting video notes, such as timestamp link 2. The phone can then overlay a video playback window on the currently displayed meeting video notes interface to play video content related to key image frame 9.

[0178] It should be noted that at each stage of generating meeting notes on the mobile phone, the phone can show the processing progress to the user. For example, after extracting the effective content based on key image frames, the phone can first show the user a preview of the notes so that the user can check, modify, and edit them. Then, based on the user's edited content, the phone can generate the corresponding meeting video notes.

[0179] It should be noted that the above-described method embodiments, or the various possible implementations of the method embodiments, can be executed individually, or, provided there are no contradictions, they can be combined with each other. The specific implementation can be determined according to actual usage requirements, and this application embodiment does not impose any restrictions on this.

[0180] It should be noted that the video content organization method provided in this application embodiment can be executed by a video content organization device. This application embodiment uses a video content organization device executing the video content organization method as an example to illustrate the video content organization device provided in this application embodiment.

[0181] Figure 8 A schematic diagram of a possible structure of the video content organization device involved in an embodiment of this application is shown. For example... Figure 8 As shown, the video content processing device 80 may include: a receiving module 81, an acquisition module 82, and a display module 83;

[0182] The receiving module 81 is used to receive the first input of the video to be processed;

[0183] The acquisition module 82 is used to acquire N key image frames in the video to be processed in response to the first input received by the receiving module 81. The key image frames are image frames in the video to be processed that include valid content, and N is a positive integer.

[0184] The display module 83 is used to generate and display video notes corresponding to the video to be processed based on the N key image frames acquired by the acquisition module 82. The layout information of the note elements in the video notes is matched with the layout information of the effective content in the N key image frames.

[0185] In one possible implementation, the video content processing device 80 provided in this application embodiment may further include: a determining module; the aforementioned obtaining module 82 is specifically used to decode the video to be processed to obtain a first image frame sequence corresponding to the video to be processed, the first image frame sequence including N key image frames; the determining module is used to perform feature recognition frame by frame on the first image frame sequence obtained by the obtaining module 82 to determine M key image frames including valid content, where M is a positive integer greater than or equal to N; the aforementioned obtaining module 82 is also used to perform comparison processing on every two adjacent key image frames based on the timestamp information of the M key image frames determined by the determining module to obtain similarity information corresponding to the M key image frames; the aforementioned obtaining module 82 is also used to perform deduplication processing on the M key image frames based on the similarity information to obtain N key image frames.

[0186] In one possible implementation, the video content organization device 80 provided in this application embodiment may further include: a processing module; the processing module is used to perform layout analysis processing on N key image frames to obtain layout information of the effective content in each key image frame, the layout information including at least one of the following: spatial location information, hierarchical structure information; the aforementioned display module 83 is specifically used to generate and display video notes corresponding to the video to be processed based on the layout information of the effective content in the N key image frames obtained by the processing module.

[0187] In one possible implementation, the video notes include N timestamp links; the receiving module 81 is further configured to receive a second input to the first timestamp link among the N timestamp links, where one timestamp link corresponds to one key image frame; the display module 83 is further configured to display the video segment corresponding to the first timestamp link in the video to be processed in response to the second input received by the receiving module 81.

[0188] In one possible implementation, the aforementioned valid content includes at least one of the following: whiteboard content, presentation content, and handwritten notes content.

[0189] In the video content organization device provided in this application embodiment, when a user needs to obtain video information from a video to be processed, i.e., needs to obtain video notes corresponding to the video to be processed, the user can select the video to be processed through the video content organization device, so that the video content organization device can obtain key image frames containing valid content in the video to be processed, and directly generate and display the video file corresponding to the video to be processed based on the key image frames; that is, the video content organization device can automatically generate corresponding video notes based on the video, thereby eliminating the need for the user to manually organize video notes, reducing the time spent organizing video content, and thus improving the efficiency of the video content organization device in organizing video files.

[0190] The video content organization device in this application embodiment can be an electronic device or a component of an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices besides a terminal. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, handheld computer, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM, or self-service machine, etc. This application embodiment does not specifically limit the device.

[0191] The video content organization device in this application embodiment can be a device with an operating system. This operating system can be Android, iOS, or other possible operating systems; this application embodiment does not specifically limit the specific operating system used.

[0192] The video content organization device provided in this application embodiment can realize the various processes implemented in the above method embodiments, and will not be described again here to avoid repetition.

[0193] Optionally, such as Figure 9 As shown, this application embodiment also provides an electronic device 90, including a processor 91 and a memory 92. The memory 92 stores a program or instructions that can run on the processor 91. When the program or instructions are executed by the processor 91, they implement the various steps of the above-described video content organization method embodiment and can achieve the same technical effect. To avoid repetition, they will not be described again here.

[0194] It should be noted that the electronic devices in the embodiments of this application include the mobile electronic devices and non-mobile electronic devices described above.

[0195] Figure 10 A schematic diagram of the hardware structure of an electronic device to implement an embodiment of this application.

[0196] The electronic device 100 includes, but is not limited to, components such as: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, and processor 110.

[0197] Those skilled in the art will understand that the electronic device 100 may also include a power supply (such as a battery) for supplying power to various components. The power supply may be logically connected to the processor 110 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system. Figure 10 The electronic device structure shown does not constitute a limitation on the electronic device. The electronic device may include more or fewer components than shown, or combine certain components, or have different component arrangements, which will not be elaborated here.

[0198] The user input unit 107 is used to receive the first input of the video to be processed;

[0199] Processor 110 is configured to, in response to a first input, acquire N key image frames in the video to be processed, wherein the key image frames are image frames in the video to be processed that contain valid content, and N is a positive integer;

[0200] Display unit 106 is used to generate and display video notes corresponding to the video to be processed based on N key image frames, wherein the layout information of the note elements in the video notes matches the layout information of the effective content in the N key image frames.

[0201] Optionally, the processor 110 is specifically configured to decode the video to be processed to obtain a first image frame sequence corresponding to the video to be processed, the first image frame sequence including N key image frames; and based on the first image frame sequence, perform feature recognition on each frame of the first image frame sequence to determine M key image frames containing valid content, where M is a positive integer greater than or equal to N; and based on the timestamp information of the M key image frames, perform comparison processing on every two adjacent key image frames to obtain similarity information corresponding to the M key image frames; and based on the similarity information, perform deduplication processing on the M key image frames to obtain N key image frames.

[0202] Optionally, the processor 110 is specifically used to perform layout analysis processing on N key image frames to obtain layout information of the effective content in each key image frame. The layout information includes at least one of the following: spatial location information and hierarchical structure information. The display unit 106 is specifically used to generate and display video notes corresponding to the video to be processed based on the layout information of the effective content in the N key image frames.

[0203] Optionally, the video notes include N timestamp links; the user input unit 107 is further configured to receive a second input to the first timestamp link among the N timestamp links, where one timestamp link corresponds to one key image frame; the display unit 106 is further configured to display the video segment corresponding to the first timestamp link in the video to be processed in response to the second input.

[0204] Optionally, the above-mentioned valid content includes at least one of the following: whiteboard content, presentation content, and handwritten notes.

[0205] In the electronic device provided in this application embodiment, when a user needs to obtain video information from a video to be processed, that is, to obtain video notes corresponding to the video to be processed, the user can select the video to be processed through the electronic device, so that the electronic device can obtain key image frames containing valid content in the video to be processed, and directly generate and display the video file corresponding to the video to be processed based on the key image frames; that is, the electronic device can automatically generate video notes corresponding to the video, thereby eliminating the need for the user to manually organize video notes, reducing the time spent organizing video content, and thus improving the efficiency of the electronic device in organizing video file content.

[0206] The electronic device provided in this application embodiment can implement the various processes implemented in the above method embodiments and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0207] For details on the beneficial effects of the various implementation methods in this embodiment, please refer to the beneficial effects of the corresponding implementation methods in the above method embodiments. To avoid repetition, these will not be repeated here.

[0208] It should be understood that, in this embodiment, the input unit 104 may include a graphics processing unit (GPU) 1041 and a microphone 1042. The GPU 1041 processes image data of still images or videos obtained by an image capture device (such as a camera) in video capture mode or image capture mode. The display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like. The user input unit 107 includes at least one of a touch panel 1071 and other input devices 1072. The touch panel 1071 is also called a touch screen. The touch panel 1071 may include a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control buttons, power buttons, etc.), trackballs, mice, and joysticks, which will not be described in detail here.

[0209] The memory 109 can be used to store software programs and various data. The memory 109 may primarily include a first storage area for storing programs or instructions and a second storage area for storing data. The first storage area may store the operating system, application programs or instructions required for at least one function (such as sound playback, image playback, etc.). Furthermore, the memory 109 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DRRAM). The memory 109 in the embodiments of this application includes, but is not limited to, these and any other suitable types of memory.

[0210] Processor 110 may include one or more processing units; optionally, processor 110 integrates an application processor and a modem processor, wherein the application processor mainly handles operations involving the operating system, user interface, and applications, and the modem processor mainly handles wireless communication signals, such as a baseband processor. It is understood that the aforementioned modem processor may also not be integrated into processor 110.

[0211] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above method embodiments and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0212] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0213] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the various processes of the above method embodiments and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0214] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.

[0215] This application provides a computer program product, which is stored in a storage medium and executed by at least one processor to implement the various processes of the above method embodiments and achieve the same technical effects. To avoid repetition, it will not be described again here.

[0216] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

[0217] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a computer software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0218] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A method for organizing video content, characterized in that, The method includes: Receive the first input of the video to be processed; In response to the first input, N key image frames in the video to be processed are obtained, where the key image frames are image frames in the video to be processed that include valid content, and N is a positive integer. Based on the N key image frames, video notes corresponding to the video to be processed are generated and displayed, and the layout information of the note elements in the video notes is matched with the layout information of the effective content in the N key image frames.

2. The method according to claim 1, characterized in that, The step of obtaining N key image frames from the video to be processed includes: The video to be processed is decoded to obtain a first image frame sequence corresponding to the video to be processed, and the first image frame sequence includes the N key image frames; Based on the first image frame sequence, feature recognition is performed frame by frame on the first image frame sequence to determine M key image frames that include the effective content, where M is a positive integer greater than or equal to N; Based on the timestamp information of the M key image frames, the similarity information corresponding to the M key image frames is obtained by comparing two adjacent key image frames. Based on the similarity information, the M key image frames are deduplicated to obtain the N key image frames.

3. The method according to claim 1, characterized in that, The process of generating and displaying video notes corresponding to the video to be processed based on the N key image frames includes: The layout analysis process is performed on the N key image frames to obtain the layout information of the effective content in each key image frame. The layout information includes at least one of the following: spatial location information and hierarchical structure information. Based on the layout information of the effective content in the N key image frames, video notes corresponding to the video to be processed are generated and displayed.

4. The method according to claim 1, characterized in that, The video notes include N timestamp links; the method also includes: Receive a second input for the first timestamp link among the N timestamp links, where one timestamp link corresponds to one key image frame; In response to the second input, the video segment corresponding to the first timestamp link in the video to be processed is displayed.

5. The method according to claim 1, characterized in that, The valid content includes at least one of the following: whiteboard content, presentation content, and handwritten notes.

6. A video content organization device, characterized in that, The video content processing device includes: a receiving module, an acquisition module, and a display module; The receiving module is used to receive the first input of the video to be processed; The acquisition module is configured to acquire N key image frames in the video to be processed in response to the first input received by the receiving module, wherein the key image frames are image frames in the video to be processed that include valid content, and N is a positive integer; The display module is used to generate and display video notes corresponding to the video to be processed based on the N key image frames acquired by the acquisition module, wherein the layout information of the note elements in the video notes matches the layout information of the effective content in the N key image frames.

7. The apparatus according to claim 6, characterized in that, The video content processing device further includes: a determination module; The acquisition module is specifically used to decode the video to be processed to obtain a first image frame sequence corresponding to the video to be processed, wherein the first image frame sequence includes the N key image frames. The determining module is used to perform feature recognition on the first image frame sequence frame by frame based on the first image frame sequence obtained by the acquiring module, and determine M key image frames including the effective content, where M is a positive integer greater than or equal to N. The acquisition module is further configured to compare two adjacent key image frames based on the timestamp information of the M key image frames determined by the determination module to obtain similarity information corresponding to the M key image frames; and to perform deduplication processing on the M key image frames based on the similarity information to obtain the N key image frames.

8. The apparatus according to claim 6, characterized in that, The video content processing device further includes: a processing module; The processing module is used to perform layout analysis processing on the N key image frames to obtain layout information of the effective content in each key image frame. The layout information includes at least one of the following: spatial location information and hierarchical structure information. The display module is specifically used to generate and display video notes corresponding to the video to be processed based on the layout information of the effective content in the N key image frames obtained by the processing module.

9. The apparatus according to claim 6, characterized in that, The video notes include N timestamp links; The receiving module is also used to receive a second input to the first timestamp link among the N timestamp links, where one timestamp link corresponds to one key image frame; The display module is further configured to, in response to the second input received by the receiving module, display the video segment corresponding to the first timestamp link in the video to be processed.

10. The apparatus according to claim 6, characterized in that, The valid content includes at least one of the following: whiteboard content, presentation content, and handwritten notes.