An intelligent content analysis and recommendation system based on cross-modal learning

By acquiring image, text, and speech information through a cross-modal learning system, forming synchronized content groups, extracting state shift features, identifying modal collaborative changes, and generating recommendation structures, this approach solves the problem of semantic continuity lack caused by independent modal processing and realizes collaborative analysis and recommendation of multimodal content.

CN122196265APending Publication Date: 2026-06-12ANHUI CHENYAN INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANHUI CHENYAN INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-03-04
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing intelligent content analysis and recommendation systems, image, text, and speech modalities are processed independently, lacking a segment-level synchronization mechanism. This results in the inability to establish temporal and positional correspondences between modalities, a lack of methods for comparing change trends, and difficulty in supporting content evolution analysis. Recommendation generation relies on behavior-driven approaches, neglecting the structural synergy of multimodal content, leading to a lack of continuity and integration in the semantic expression of recommendation outputs.

Method used

By constructing a cross-modal learning system, we can acquire image edge regions, text phrase positions, and speech frequency changes to form cross-modal synchronized content groups. We can then extract state offset features, identify modal collaborative changes, parse semantic nodes, and generate a cross-modal content analysis and recommendation structure.

🎯Benefits of technology

It establishes a unified reference basis for information from different modalities, supports the evolution tracking of multimodal content, enhances intermodal synergy, supports the integrated construction of content semantic structure, and promotes the coordinated coverage of recommendation results at the structural level.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196265A_ABST
    Figure CN122196265A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of cross-modal learning, in particular to an intelligent content analysis and recommendation system based on cross-modal learning, which comprises the following steps: acquiring an image edge region, a text phrase position and a voice frequency change, sequentially connecting synchronous groups in source order, extracting a segment displacement relationship to generate a displacement mark, comparing trends to output a collaborative matching result, analyzing semantic nodes to construct an identification structure, and screening matching content to generate a recommendation structure. The application organizes image edges, text phrases and voice frequency change points in sequence, constructs a multi-modal segment corresponding relationship, unifies an information reference basis, extracts a state mark through segment displacement, tracks a content evolution path, screens synchronous segments through trend comparison, improves modality collaboration, generates semantic nodes through semantic position joint pointing, constructs a semantic structure, screens content resources through consistency, and realizes structure linkage coverage of a recommendation result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cross-modal learning technology, and in particular to an intelligent content analysis and recommendation system based on cross-modal learning. Background Technology

[0002] Cross-modal learning technology involves the alignment and fusion of information between heterogeneous data, with core components including multimodal feature extraction, semantic mapping, unified embedding space construction, and intermodal correlation modeling. A common approach is to extract representation vectors from modalities such as text, images, and audio; then, use deep learning models to achieve modal alignment and fusion; this is applied to tasks such as multimodal retrieval, recognition, or recommendation. Coordinating collaborative modeling and joint reasoning between different modalities has been widely applied in content understanding and interactive systems.

[0003] Traditional intelligent content analysis and recommendation systems, in multi-source information scenarios, construct content models based on user behavior records, text semantic features, and contextual conditions, and then generate recommendation results using collaborative filtering or content-based methods. The processing flow typically involves: extracting texture and edge features from image modalities, encoding word vector representations for text modalities, and establishing separate content libraries; associating data from each modality based on manual rules or simple mapping functions; and combining user clicks, ratings, and other behavioral indicators to rank and recommend content using similarity. This approach relies on independent processing and rule mapping between modalities, lacks a unified semantic space, and limits modal collaboration capabilities.

[0004] Existing methods often process images, text, and speech independently, lacking segment-level synchronization mechanisms and failing to establish temporal and positional correspondences between modalities. Images and text only have coarse-grained mappings, lacking methods for comparing change trends, making it difficult to support content evolution analysis. Speech processing fails to build structured connections with other modalities, limiting semantic coverage. Recommendation generation relies on behavior-driven approaches, ignoring the structural synergy of multimodal content itself, resulting in a lack of continuity and integration in the semantic expression of matched content. Independent modal modeling and the lack of semantic mapping create gaps in the consistency of recommendation output across multiple modalities. Summary of the Invention

[0005] The purpose of this invention is to address the shortcomings of existing technologies by proposing an intelligent content analysis and recommendation system based on cross-modal learning.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: an intelligent content analysis and recommendation system based on cross-modal learning, the system comprising: The modal data acquisition module acquires edge information from image sources and extracts edge region distribution segments; it acquires phrase expressions from text and extracts the position of language segments within the text; it extracts speech frequency changes and obtains the starting point and mode. The three types of source content are then sequentially concatenated to form a cross-modal synchronized content group. The state offset extraction module extracts the segment movement mode based on the image change path, text phrase order, and speech frequency trend in the cross-modal synchronized content group, determines whether the process order has been adjusted, extracts displacement features from each source according to the relationship between the preceding and following parts, and merges them into the multi-source state offset tag set. The co-evolution recognition module extracts the continuous stage change process based on the images and text in the state offset marker set, compares the change trends of the two sources in the same stage, identifies consistent segments, outputs the comparison results in parallel, and integrates them into the modal co-evolution matching results. The semantic node parsing module uses the image edge positions and the text regions where phrases are located in the modal collaborative change matching results to extract corresponding edges from the image, extract sentence segments from the text, and extract similar segments from the speech, matching the three types of source content to form a cross-modal semantic recognition content structure; The content recommendation generation module calls the cross-modal semantic recognition content structure to extract image fragments, language structures and frequency expressions, filters relevant content blocks from resources based on source consistency, locates the matching items in each type of content block, and outputs the paragraphs they belong to to form a cross-modal content analysis and recommendation structure.

[0007] As a further aspect of the present invention, the cross-modal synchronized content group includes image edge position index, text phrase block annotation, and speech frequency change point identifier; the multi-source state offset tag set includes image region displacement label, text phrase temporal offset label, and speech change path annotation; the modal collaborative change matching result includes image segment trend consistency unit, text paragraph synchronized change unit, and cross-modal matching index relationship; the cross-modal semantic recognition content structure includes image edge semantic segments, text sentence corresponding phrase blocks, and speech signal semantic segments; and the cross-modal content analysis and recommendation structure includes image expression segment set, language structure expression set, and speech frequency feature set.

[0008] As a further aspect of the present invention, the modal data acquisition module includes: The edge extraction submodule acquires image frame data from the image source, judges based on pixel grayscale changes and region boundary continuity, identifies edge regions with coherent structures in the image, groups the identified edge regions according to image coordinate information, extracts their spatial location range in the image, and generates edge region distribution intervals. The language segment localization submodule calls the edge region distribution interval to obtain the sentence content and phrase division position in the text source. Based on the division position and semantic intensity change features, the phrase expression content is filtered. The character intervals corresponding to the filtered phrases in the text are marked with the spatial range of the image edge region to obtain the cross-domain position matching interval. The frequency change analysis submodule acquires the spectral signal and its time index sequence from the speech source based on the cross-domain position matching interval, identifies the start position and trend of frequency change, and performs unified alignment processing between the identified time position and the corresponding position in the image and text to establish a cross-modal synchronized content group.

[0009] As a further aspect of the present invention, the state offset extraction module includes: The path extraction submodule obtains the image region change path, text phrase appearance order and speech frequency change trend in the cross-modal synchronized content group, extracts the continuous movement direction of the region in the image, identifies the sequential position of the text phrase in the content, extracts the start and end trend of the frequency trend in the speech, and compares the three types of sources in the connection order in time to obtain the source path order structure. The sequence judgment submodule calls the source path sequence structure and, based on the relative relationships of the image path position numbers, adjacent segments between text phrases, and the start and end points of the speech change trend, identifies whether the order of the three types of sources has been adjusted in the process, extracts the corresponding change status, and obtains the source order change performance. The feature aggregation submodule extracts the starting point, ending point and movement direction of each segment of image, text and speech according to the source order change performance. It extracts the change performance in the corresponding movement process separately according to the source type, and arranges their temporal relationship and direction according to the sequential structure to obtain a multi-source state offset tag set.

[0010] As a further aspect of the present invention, the cooperative evolution identification module includes: The stage extraction submodule obtains the image source and text source in the multi-source state offset marker set, extracts the segment sequence corresponding to the continuous movement process in the image, extracts the continuous flow order of phrase positions in the text content, and aligns the image and text according to the connection correspondence in the processing stage to obtain the cross-source stage corresponding sequence. The trend comparison submodule calls the corresponding sequence of the cross-source stage, and judges whether the change trend of adjacent segments in the image and text is consistent according to the movement direction in the image segment and the order change in the text phrase. The change trend is segmented and compared to obtain the stage trend comparison segment. The collaborative recognition submodule, based on the phase trend comparison segments, filters image segments and text segments with consistent change trends, outputs segments with consistent sources in parallel, and compares the change rhythms under different sources to pair and organize them, thereby obtaining modal collaborative change matching results.

[0011] As a further aspect of the present invention, the semantic node parsing module includes: The image and text extraction submodule obtains the image edge position and the text region where the phrase is located in the modal cooperative change matching result, extracts the edge fragments at the corresponding positions from the image, extracts the complete sentence where the phrase is located from the text, and performs the image edge and text sentence placement processing according to the corresponding region order to obtain the image and text corresponding fragment combination. The voice association submodule calls the combination of corresponding image and text segments, collects the audio content of the time period adjacent to the image position in the voice source, extracts the voice part near the start position of the text sentence, compares the continuous sound segments in the voice with the corresponding positions of the aforementioned image and text areas, and obtains a set of voice proximity segments. The semantic fusion submodule compares the temporal correspondence between image edge segments, text sentences, and speech segments based on the set of speech segments that are close to each other. It then organizes and classifies the three types of segments in the order of consistent sources, and performs cross-source connection on the image, text, and speech content that are close to each other to obtain the cross-modal semantic recognition content structure.

[0012] As a further aspect of the present invention, the content recommendation generation module includes: The segment filtering submodule acquires image segments, language structures, and frequency expressions extracted from the cross-modal semantic recognition content structure, extracts continuous content segments containing all three types of expressions from the content resources, identifies the text and speech content corresponding to the structural positions in the image, and merges and organizes them according to the source category to obtain cross-source expression combination units. The content localization submodule calls the cross-source expression combination unit to identify the layer area where the image edge is located, the position of the sentence in the text and the time period annotation of the audio segment, and performs localization comparison in the corresponding position in the content resource according to the consistent source method, extracts content information with similar positions in various sources, and obtains multi-source corresponding localization results; The paragraph output submodule extracts the paragraphs containing the matching content of images, text, and audio from the resource content involved, based on the multi-source corresponding positioning results. These paragraphs are then integrated into the same content sequence according to the order of the segments. The overlapping descriptions in the continuous paragraphs are extracted and grouped to obtain the cross-modal content analysis and recommendation structure.

[0013] Compared with the prior art, the advantages and positive effects of the present invention are as follows: In this invention, cross-modal segment synchronization relationships are constructed by sequentially organizing image edge regions, text phrase positions, and speech frequency changes, providing a unified reference basis for information from different sources. State markers are formed by the relative displacement changes of segments in the process, supporting the tracking of multimodal content evolution. By comparing the corresponding trends of image and text segment changes, combinations of semantic segments with synchronous evolution characteristics are selected, enhancing intermodal synergy. A set of cross-modal semantic nodes is formed by the joint indication of edge positions, sentence segments, and frequency changes, supporting the integrated construction of content semantic structures. Multimodal content resources are screened using consistency conditions, promoting the interconnected coverage of recommendation results at the structural level. Attached Figure Description

[0014] Figure 1 This is a flowchart of the method of the present invention; Figure 2 This is a flowchart illustrating the acquisition process of the modal data acquisition module of the present invention. Figure 3 This is a flowchart illustrating the acquisition process of the state offset extraction module of the present invention. Figure 4 This is a flowchart illustrating the acquisition process of the collaborative evolution identification module of the present invention. Figure 5 This is a flowchart illustrating the acquisition process of the semantic node parsing module of the present invention. Figure 6 A flowchart for obtaining the recommended generation module is provided for the content of this invention. Detailed Implementation

[0015] The technical solution of the present invention will now be described with reference to the accompanying drawings.

[0016] In this embodiment of the invention, sometimes a subscript such as W1 may be written in a non-subscript form such as W1. When the difference is not emphasized, the meaning they express is the same.

[0017] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.

[0018] Please see Figure 1 This invention provides a technical solution: an intelligent content analysis and recommendation system based on cross-modal learning, the system comprising: The modal data acquisition module acquires edge information from the image source, extracts the distribution segments of the edge region, acquires the phrase expression part from the text source, extracts the position expression of the language segment in the text, acquires the frequency change part from the speech source, extracts the start position and change mode of the change, and concatenates the information extracted from the three sources according to the source order to form a cross-modal synchronous content group. The state offset extraction module extracts the movement mode of the source segments in the current stage according to the image region change path, text phrase appearance order and speech frequency change trend in the cross-modal synchronized content group, determines whether it has been adjusted in the sequential process, extracts the displacement features of each type of source according to the front-back relationship between positions, and merges them into the multi-source state offset tag set. The co-evolution recognition module extracts the segment change process of images and text sources in the continuous processing stage based on the multi-source state offset marker set. It compares the change trends of two sources in the same stage segment by segment, identifies source segments with consistent trends, and outputs the comparison results in parallel, which are then incorporated into the modal co-evolution matching results. The semantic node parsing module uses the image edge positions and the text regions where the phrases appear in the modal cooperative change matching results to extract the edge parts corresponding to the positions from the image content, extract the sentence segments to which the corresponding phrases belong from the text, and extract the speech parts that are close to the positions involved in these two sources from the speech source. The three types of segments are matched with each other to form a cross-modal semantic recognition content structure. The content recommendation generation module calls upon image fragments, language structures, and frequency expressions extracted from the content structure of cross-modal semantic recognition. Based on source consistency, it filters out content blocks that simultaneously involve the three types of expressions from the content resources. It then compares and locates the matching items in each type of content block and outputs them after combining them into the paragraphs in the resource content, forming a cross-modal content analysis and recommendation structure.

[0019] The cross-modal synchronous content group includes image edge location index, text phrase block annotation, and speech frequency change point identifier. The multi-source state offset tag set includes image region displacement label, text phrase temporal offset label, and speech change path annotation. The modal collaborative change matching results include image segment trend consistency unit, text paragraph synchronous change unit, and cross-modal matching index relationship. The cross-modal semantic recognition content structure includes image edge semantic segments, text sentence corresponding phrase blocks, and speech signal semantic segments. The cross-modal content analysis recommendation structure includes image expression segment set, language structure expression set, and speech frequency feature set.

[0020] Please see Figure 2 The modal data acquisition module includes: The edge extraction submodule acquires image frame data from the image source, judges based on pixel grayscale changes and region boundary continuity, identifies edge regions with coherent structures in the image, groups the identified edge regions according to image coordinate information, extracts their spatial location range in the image, and generates edge region distribution intervals. The image preprocessing program is initiated to establish a grayscale matrix mapping for high-definition video frame data. Taking a teaching video frame with a resolution of 1920 x 1080 as an example, the processor traverses each pixel in the video frame row by row and column by column, reading its values ​​in the red, green, and blue channels. To eliminate the interference of color information on geometric structure recognition and reduce the computational load, the system weights the channel values ​​according to the luminance conversion weights of the International Commission on Illumination (ICI) standard to obtain the corresponding grayscale values. First-order differential logic is used to deeply analyze the gradient of pixel grayscale changes in the horizontal and vertical directions. The absolute values ​​of the gradients in the two directions are superimposed to reflect the overall degree of grayscale change. The calculated comprehensive gradient value is then compared point by point with the system's preset judgment benchmark value 45. The standard value was set based on tests on 5,000 images of teaching scenarios including complex blackboard writing and PPT projection. At this value, the noise false recognition rate and the effective edge missed detection rate were optimally balanced. The system then marked all pixels with gradient values ​​exceeding 45 as edge candidate points and started the connected component analysis algorithm to automatically remove isolated noise areas with fewer than 15 pixels that could not form a continuous line. For edge areas with clear structure that were selected and retained, the system calculated their extreme value range in the image coordinate system and extracted the coordinates of the bounding rectangle. For example, the system identified the upper left corner coordinates of a geometric edge on the left side of the blackboard as 320 and 450, and the lower right corner coordinates as 580 and 610. The system encapsulated this set of precise coordinate data to generate the edge area distribution range.

[0021] The language fragment localization submodule calls the edge region distribution interval to obtain the sentence content and phrase division position in the text source. Based on the division position and semantic intensity change features, the phrase expression content is filtered. The character intervals corresponding to the filtered phrases in the text are marked with the spatial range of the image edge region to obtain the cross-domain position matching interval. The system utilizes a pre-trained convolutional recurrent neural network model to identify text content in video frames by utilizing edge region distribution intervals. It accurately extracts the source text and its phrase segmentation positions within the image. Deep semantic strength analysis is performed on the identified text stream. Fine-grained part-of-speech tagging is conducted using professional word segmentation tools, and word frequency-inverse document frequency (IF / IVF) analysis is performed using a large-scale corpus to assess lexical importance. The system sets a semantic strength filtering threshold of 0.6. For words such as "triangle," "hypotenuse," and "5" identified in the text, the system calculates their IF / IVF values ​​to be 0.72. The values ​​of 0.68 and 0.81 are significantly higher than the screening threshold, so these words are identified as core phrases carrying key information. The system further obtains the character border coordinates of these core phrases in the original coordinate system of the image and compares them with the spatial range of the aforementioned edge region using geometric inclusion coefficients. When the "triangle" character region is detected to be completely located inside the geometric edge region and the calculated overlap coefficient is 1.0, this value is significantly higher than the preset inclusion judgment threshold of 0.9. Based on this, the system determines that there is a strong correlation between the text content and the image region and marks them accordingly, thereby obtaining the cross-domain position matching interval.

[0022] The frequency change analysis submodule obtains the spectral signal and its time index sequence from the speech source based on the cross-domain position matching interval, identifies the start position and trend of frequency change, and performs unified alignment processing between the identified time position and the corresponding position in the image and text to establish a cross-modal synchronized content group.

[0023] Based on the cross-domain location matching interval and synchronously loading the audio stream spectrum signal corresponding to the video content, the system uses short-time spectrum analysis technology with an analysis frame length of 20 milliseconds and a frame shift step of 10 milliseconds to extract high-density frequency domain features from the audio signal. In order to accurately capture the start time of the voice command and filter the background noise, the system sets the energy threshold for voice start recognition to -40 dB. When the system detects that the energy level of a certain frequency band significantly jumps from the background noise level to above -35 dB within a time window of 3 consecutive frames, it determines that the moment is the valid start point of the voice signal. Based on this, the system identifies the time index sequence of reciting specific teaching content in the voice stream as precisely from 12.50 seconds to 13.20 seconds. In order to achieve spatiotemporal unification of multimodal data, the system performs unified alignment processing on the voice time period, the image frame index of the corresponding time point of 12.50 seconds (375th image frame), and the spatial coordinate position of the text content in the image, anchoring the data scattered in different modalities on the same spatiotemporal reference and establishing a cross-modal synchronized content group.

[0024] Please see Figure 3 The state offset extraction module includes: The path extraction submodule obtains the image region change path, text phrase appearance order and speech frequency change trend in the cross-modal synchronized content group, extracts the continuous movement direction of the region in the image, identifies the sequential position of the text phrase in the content, extracts the start and end trend of the frequency trend in the speech, and compares the connection order of the three types of sources in time to obtain the source path order structure. The system acquires image region information from the cross-modal synchronized content group. Focusing on the time interval from 12.50 to 13.50 seconds, it dynamically tracks 30 consecutive frames played during this period, specifically targeting the center point coordinate changes of objects such as a teacher's finger or pointer. Through frame-by-frame comparison, the system detects that this center point smoothly moves from coordinates 350 and 480 in the starting frame to 450 and 480 in the ending frame. Based on this, it calculates the displacement vector value generated by this action to be a horizontal movement of 100 pixels to the right with an angle of 0 degrees. Simultaneously, the system parses the text modality, recognizing elements such as "triangle," "hypotenuse," and "length." The order of appearance of core phrases such as "" in the complete sentence structure is indexed as 1, 2, and 3, respectively, clarifying the logical flow of text information. In terms of speech modality, the system extracts the fundamental frequency trajectory features in the audio signal. By analyzing the frequency change curve, it is identified that the speech pitch rises linearly from 200 Hz to 280 Hz at 12.50 seconds, showing a clear pitch increase trend. Based on the fact that the image movement start time, text appearance time, and speech start time are all 12.50 seconds, it is confirmed that the three sources have a strict consistent connection order at the time starting point. This multi-dimensional spatiotemporal trajectory information is integrated to generate a source path sequence structure.

[0025] The sequence judgment submodule calls the source path sequence structure and, based on the relative relationships of the image path position numbers, adjacent segments between text phrases, and the start and end points of the speech change trend, identifies whether the order of the three types of sources has been adjusted in the process, extracts the corresponding change status, and obtains the source order change performance. The purpose of calling the source path sequence structure is to verify whether the information flow direction between different modalities is consistent. The system performs a deep comparison based on the changes in the position number of the image path and the order of adjacent segments between text phrases. The system sets strict order consistency judgment rules and performs difference analysis on the direction angle of the image displacement vector and the standard text reading direction angle. In the Chinese environment, the reading direction is usually defined as horizontal to the right, i.e., 0 degrees. When the absolute value of the angle difference between the image movement direction and the text reading direction is less than the allowable error range of 30 degrees, and the timestamp text index number of the speech signal shows a strict increasing relationship, the system determines that the sequential logic of the three sources in the teaching process has not been adjusted and is in a state of coordinated advancement. In this embodiment, the image movement angle is detected as 0 degrees, which is completely consistent with the text reading direction, and the speech content and text vocabulary are expanded synchronously. The three are highly consistent in spatiotemporal logic. Based on this, the system identifies the current change state as synchronous positive and defines this state of perfect cooperation between the modalities as linear synchronous evolution, outputting the source order change performance.

[0026] The feature aggregation submodule extracts the starting point, ending point and movement direction of each segment of image, text and speech based on the source order change performance. It extracts the corresponding change performance in the movement process separately according to the source type, arranges their temporal relationship and direction according to the sequential structure, and obtains a multi-source state offset label set.

[0027] Based on the changes in the source order, the feature data of each segment of image, text, and speech are extracted and encapsulated in a refined manner. For the image modality, the precise starting coordinates (350 and 480), ending coordinates (450 and 480), and displacement vector values ​​describing the nature of the action segment are extracted. For the text modality, the character index range (5 to 8) of the core phrase in the sentence is extracted to clarify its position in the semantic structure. For the speech modality, the accurate time range (12.50 seconds to 13.20 seconds) of the speech instruction is extracted. The system no longer treats these data as isolated information points, but classifies them according to the source type. Strictly following the previously determined "linear synchronous evolution" sequence structure, these feature data are arranged and combined according to temporal relationship and direction to construct a structured dataset containing multi-dimensional feature vectors. This ensures that subsequent modules can directly call these spatiotemporally calibrated feature information to obtain a multi-source state offset label set.

[0028] Please see Figure 4 The co-evolutionary identification module includes: The stage extraction submodule obtains the image source and text source from the multi-source state offset marker set, extracts the segment sequence corresponding to the continuous movement process in the image, extracts the continuous flow order of phrase positions in the text content, and aligns the image and text according to the connection correspondence in the processing stage to obtain the cross-source stage corresponding sequence. The system acquires image and text source data from a multi-source state offset marker set. Based on a velocity threshold, it segments continuous actions in the image, identifying 10 consecutive static segments with a velocity less than 5 pixels per frame as action breakpoints. Using these breakpoints as boundaries, the image stream is divided into several independent action units. Based on this, a sequence of continuous movement segments with a time span of 12.5 to 15.0 seconds is extracted from the image. This sequence corresponds to a complete teaching instruction action. Simultaneously, the system extracts the continuous flow sequence of phrase positions in the text content, dividing the text stream into corresponding semantic units. To verify whether the visual action and text semantics are in the same teaching stage, the system calculates the overlap rate between the duration of the image segment and the corresponding time of the text segment on the time axis. When the overlap rate reaches 0.92, which is significantly higher than the system's set stage determination threshold of 0.85, the system confirms that these two pieces of information from different modalities are actually describing the same teaching step and belong to the same processing stage. The two are then precisely aligned according to the time axis to obtain a cross-source stage corresponding sequence.

[0029] The trend comparison submodule calls the corresponding sequence of cross-source stages. Based on the movement direction in the image segment and the order change in the text phrase, it determines whether the change trend of adjacent segments in the image and text is consistent in the same stage. The change trend is segmented and compared to obtain the stage trend comparison segment. The system calls upon the corresponding sequence of cross-source stages to conduct in-depth analysis to determine whether the changing trends of visual and textual elements within the same stage are isomorphic. The system compares the direction of movement in the image segment with the direction of sequential change in the text phrase. To quantify this trend relationship, the system calculates a trend consistency coefficient, which is determined by combining the sign of the image displacement direction and the sign of the text index change direction. In this embodiment, the image shows a positive horizontal displacement, meaning the visual focus moves to the right over time, while the text index also increases positively, meaning the semantic logic shifts backward over time. The consistent signs of the two changing directions result in a consistency coefficient of 1, indicating that the visual guidance and textual logic are completely consistent in direction. The system performs high-density segmented comparison processing on all sampling points within the same stage and monitors the stability of the coefficient in real time, confirming that the coefficient remains stable at 1 throughout the entire stage. This detection result numerically confirms that the visual trajectory and text flow maintain a strict linear progressive relationship throughout the entire action cycle, without any non-monotonic changes such as visual regression, action erasure, or text index jumps or disorder, thus ensuring strict synchronization between the two in micro-temporal sequence, thereby obtaining a stage trend comparison segment.

[0030] The collaborative recognition submodule filters image and text segments with consistent change trends based on stage trend comparison segments, arranges and outputs segments with consistent sources side by side, and compares the change rhythm under different sources to pair and organize them to obtain modal collaborative change matching results.

[0031] Based on the phase trend comparison segments, image segments and text segments with consistent change trends were selected and arranged side by side for output. To further evaluate the quality of collaboration, the system compared the change rhythm under different sources and calculated the ratio of image motion rate to text presentation rate, that is, the ratio of image displacement to text information per unit time. In this example, the calculated image rate was 120 pixels per second and the text rate was 1.2 phrases per second. The ratio remained stable at around 100, which reflects the information density relationship under the current teaching mode. The system continuously monitored the changes in this ratio and confirmed that there were no sudden changes exceeding 20% ​​of the baseline value, that is, there were no abnormal situations such as sudden fast-forwarding of the picture or long-term stagnation of the text. Based on this, the system determined that the image and text maintained a good match in terms of change rhythm. After pairing and sorting, this set of highly collaborative multimodal data was confirmed as an effective teaching unit, and the modal collaborative change matching results were obtained.

[0032] Please see Figure 5 The semantic node parsing module includes: The image and text extraction submodule obtains the image edge position and the text region where the phrase is located in the modal cooperative change matching result. It extracts the edge fragments at the corresponding positions from the image and extracts the complete sentence where the phrase is located from the text. It then processes the image edges and text sentences according to the corresponding region order to obtain the image and text corresponding fragment combination. The system obtains the image edge positions and the text regions containing phrases from the modal covariance matching results. To obtain the clearest visual features, the system backtracks to the original high-resolution image data and crops the corresponding edge fragments according to the precise coordinate range to ensure that no details are lost. At the same time, the complete sentence content containing the phrases is extracted from the text source to preserve the context. The system is reconstructed according to the spatial layout logic of the image and text on the screen. The image edge fragments are placed to the left layout node of the virtual canvas, and the text sentences are placed to the right layout node. This placement process not only restores the original spatial relationship but also provides a standardized spatial structure for subsequent semantic fusion. After processing, the corresponding image and text fragments are obtained.

[0033] The speech association submodule calls the combination of corresponding image and text segments, collects the audio content of the time period adjacent to the image position in the speech source, extracts the speech part near the beginning position of the text sentence, compares the continuous sound segments in the speech with the corresponding positions of the aforementioned image and text areas, and obtains a set of speech segments that are close to each other. The system calls up the corresponding image and text segments to complete the auditory information. The system collects audio content from the time period adjacent to the image position in the speech source. To prevent information truncation caused by the early or late pronunciation of speech, the system locks the core time window of 12.50 seconds to 13.20 seconds and extends it forward and backward by 0.2 seconds respectively. The system extracts the complete speech segment from 12.30 seconds to 13.40 seconds and uses advanced phoneme forced alignment technology to perform microscopic analysis on this speech segment. It accurately identifies the specific start and end time of the pronunciation of specific teaching words as 12.55 seconds to 13.15 seconds. The system then performs time-axis comparison processing on the corresponding positions of the aforementioned image and text areas to ensure that the sound signal and the visual image are accurately matched in time, resulting in a set of speech segments that are close to each other.

[0034] The semantic fusion submodule compares the temporal correspondence between image edge segments, text sentences and speech segments based on the set of speech segments that are close together. It then organizes and classifies the three types of segments in the order of consistent sources, and performs cross-source connection on the image, text and speech content that are close to each other to obtain the cross-modal semantic recognition content structure.

[0035] Based on a comprehensive comparison of the temporal correspondence between image edge segments, text sentences, and speech segments using a set of closely related speech segments, the system introduces the concept of multimodal fusion degree to quantify the degree of correlation among the three. It calculates the ratio of the intersection duration to the union duration of the three segments on the time axis. In this embodiment, the precisely calculated fusion degree value is 0.60, which is higher than the system's preset fusion threshold of 0.5. This indicates that the image, text, and speech highly overlap in time and point to the same semantic entity in content. The system determines that the three are highly semantically related and then organizes and classifies the three types of segments according to their consistent source. The system connects the closely related image, text, and speech content across sources via a data chain to construct a complete semantic node containing visual, text, and auditory three-dimensional information, thus obtaining the cross-modal semantic recognition content structure.

[0036] Please see Figure 6 The content recommendation generation module includes: The segment filtering submodule acquires image segments, language structures, and frequency expressions extracted from the content structure of cross-modal semantic recognition. It extracts continuous content segments that simultaneously contain the three types of expressions from the content resources, identifies the text and speech content corresponding to the structural positions in the image, and merges and organizes them according to the source category to obtain cross-source expression combination units. The system extracts image fragments, language structures, and frequency expressions from the content structure of cross-modal semantic recognition. Using these features as retrieval probes, it extracts continuous content fragments from a massive content resource library that simultaneously contain geometric visual features, specific keyword text features, and specific tone frequency features. To evaluate the accuracy of the retrieval results, the system performs a weighted summation of the matching degrees of the three types of features according to preset weights. When the image feature matching degree of a candidate video fragment is 0.85, the text feature matching degree is 0.90, and the speech feature matching degree is 0.60, the system calculates a comprehensive matching score of 0.82. This score is higher than the system's recommended screening threshold of 0.75, indicating that the fragment has a very high similarity and correlation with the source content. The system then identifies the text and speech content corresponding to the structural positions in the image and categorizes them into standard teaching knowledge point units according to the source category, thus obtaining cross-source expression combination units.

[0037] The content localization submodule calls the cross-source expression combination unit to identify the layer area where the image edge is located, the position of the sentence in the text and the time period annotation of the audio segment, and performs localization comparison in the corresponding position in the content resource according to the consistent source method, extracts content information with similar positions in various sources, and obtains multi-source corresponding localization results; The system utilizes cross-source representation combination units to deeply identify the layer regions where image edges are located, the positions of sentences in text, and the time periods of audio segments. The system performs global localization comparisons in other videos in the content resource library in a consistent source manner to discover knowledge associations across videos. For example, at 112 seconds in another teaching video, the system locates a structure with a highly similar graphic and definition that appears synchronously with the current unit, and the comprehensive matching score at that point reaches 0.80. This means that the two videos explain the same knowledge point at different times. The system extracts content information with similar positions from various sources, associates and marks these related segments scattered in different videos, and obtains multi-source corresponding localization results.

[0038] The paragraph output submodule extracts the paragraphs containing matching content of three categories (image, text, and audio) from the resource content involved based on the multi-source corresponding positioning results. These paragraphs are then integrated into the same content sequence according to the order of the segments. The overlapping descriptions in consecutive paragraphs are extracted and grouped to obtain the cross-modal content analysis and recommendation structure.

[0039] Based on the multi-source corresponding positioning results, the system intelligently extracts the complete segments containing the matching content of images, text, and audio from the relevant resource content. Specifically, this includes the 40-60 second segment of the first video and the 110-130 second segment of the second video. Instead of playing the videos according to the timeline of a single video, the system integrates these segments into the same content sequence according to the progressive principle of knowledge logic to form a new logical playlist. In addition, the system detects the existence of overlapping descriptions of the concept of "right-angled side" in consecutive segments, identifies them as key knowledge points, and treats them as knowledge reinforcement points for special grouping processing to obtain a cross-modal content analysis and recommendation structure containing precise time jump links.

[0040] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. An intelligent content analysis and recommendation system based on cross-modal learning, characterized in that, The system includes: The modal data acquisition module acquires edge information of image sources and extracts edge region distribution segments; acquires phrase expressions in text and extracts the position of language segments in text; extracts speech frequency changes, acquires the starting point and mode, and concatenates the three types of source content in sequence to form a cross-modal synchronous content group; The state offset extraction module extracts the segment movement mode based on the image change path, text phrase order, and speech frequency trend in the cross-modal synchronized content group, determines whether the process order has been adjusted, extracts displacement features from each source according to the relationship between the preceding and following parts, and merges them into the multi-source state offset tag set. The co-evolution recognition module extracts the continuous stage change process based on the images and text in the state offset marker set, compares the change trends of the two sources in the same stage, identifies consistent segments, outputs the comparison results in parallel, and integrates them into the modal co-evolution matching results. The semantic node parsing module uses the image edge positions and the text regions where phrases are located in the modal collaborative change matching results to extract corresponding edges from the image, extract sentence segments from the text, and extract similar segments from the speech, matching the three types of source content to form a cross-modal semantic recognition content structure.

2. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that: The cross-modal synchronization content group includes image edge position index, text phrase block annotation, and speech frequency change point identifier. The multi-source state offset tag set includes image region displacement label, text phrase temporal offset label, and speech change path annotation. The modal collaborative change matching result includes image segment trend consistency unit, text paragraph synchronization change unit, and cross-modal matching index relationship. The cross-modal semantic recognition content structure includes image edge semantic segments, text sentence corresponding phrase blocks, and speech signal semantic segments.

3. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that, The modal data acquisition module includes: The edge extraction submodule acquires image frame data from the image source, judges based on pixel grayscale changes and region boundary continuity, identifies edge regions with coherent structures in the image, groups the identified edge regions according to image coordinate information, extracts their spatial location range in the image, and generates edge region distribution intervals. The language segment localization submodule calls the edge region distribution interval to obtain the sentence content and phrase division position in the text source. Based on the division position and semantic intensity change features, the phrase expression content is filtered. The character intervals corresponding to the filtered phrases in the text are marked with the spatial range of the image edge region to obtain the cross-domain position matching interval. The frequency change analysis submodule acquires the spectral signal and its time index sequence from the speech source based on the cross-domain position matching interval, identifies the start position and trend of frequency change, and performs unified alignment processing between the identified time position and the corresponding position in the image and text to establish a cross-modal synchronized content group.

4. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that, The state offset extraction module includes: The path extraction submodule obtains the image region change path, text phrase appearance order and speech frequency change trend in the cross-modal synchronized content group, extracts the continuous movement direction of the region in the image, identifies the sequential position of the text phrase in the content, extracts the start and end trend of the frequency trend in the speech, and compares the three types of sources in the connection order in time to obtain the source path order structure. The sequence judgment submodule calls the source path sequence structure and, based on the relative relationships of the image path position numbers, adjacent segments between text phrases, and the start and end points of the speech change trend, identifies whether the order of the three types of sources has been adjusted in the process, extracts the corresponding change status, and obtains the source order change performance. The feature aggregation submodule extracts the starting point, ending point and movement direction of each segment of image, text and speech according to the source order change performance. It extracts the change performance in the corresponding movement process separately according to the source type, and arranges their temporal relationship and direction according to the sequential structure to obtain a multi-source state offset tag set.

5. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that, The collaborative evolution identification module includes: The stage extraction submodule obtains the image source and text source in the multi-source state offset marker set, extracts the segment sequence corresponding to the continuous movement process in the image, extracts the continuous flow order of phrase positions in the text content, and aligns the image and text according to the connection correspondence in the processing stage to obtain the cross-source stage corresponding sequence. The trend comparison submodule calls the corresponding sequence of the cross-source stage, and judges whether the change trend of adjacent segments in the image and text is consistent according to the movement direction in the image segment and the order change in the text phrase. The change trend is segmented and compared to obtain the stage trend comparison segment. The collaborative recognition submodule, based on the phase trend comparison segments, filters image segments and text segments with consistent change trends, outputs segments with consistent sources in parallel, and compares the change rhythms under different sources to pair and organize them, thereby obtaining modal collaborative change matching results.

6. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that, The semantic node parsing module includes: The image and text extraction submodule obtains the image edge position and the text region where the phrase is located in the modal cooperative change matching result, extracts the edge fragments at the corresponding positions from the image, extracts the complete sentence where the phrase is located from the text, and performs the image edge and text sentence placement processing according to the corresponding region order to obtain the image and text corresponding fragment combination. The voice association submodule calls the combination of corresponding image and text segments, collects the audio content of the time period adjacent to the image position in the voice source, extracts the voice part near the start position of the text sentence, compares the continuous sound segments in the voice with the corresponding positions of the aforementioned image and text areas, and obtains a set of voice proximity segments. The semantic fusion submodule compares the temporal correspondence between image edge segments, text sentences, and speech segments based on the set of speech segments that are close to each other. It then organizes and classifies the three types of segments in the order of consistent sources, and performs cross-source connection on the image, text, and speech content that are close to each other to obtain the cross-modal semantic recognition content structure.

7. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 1, characterized in that, The system also includes: The content recommendation generation module calls the cross-modal semantic recognition content structure to extract image fragments, language structures and frequency expressions, filters relevant content blocks from resources based on source consistency, locates the matching items in each type of content block, and outputs the paragraphs they belong to to form a cross-modal content analysis and recommendation structure. The cross-modal content analysis recommendation structure includes a set of image expression fragments, a set of language structure expressions, and a set of speech frequency features.

8. The intelligent content analysis and recommendation system based on cross-modal learning according to claim 7, characterized in that, The content recommendation generation module includes: The segment filtering submodule acquires image segments, language structures, and frequency expressions extracted from the cross-modal semantic recognition content structure, extracts continuous content segments containing all three types of expressions from the content resources, identifies the text and speech content corresponding to the structural positions in the image, and merges and organizes them according to the source category to obtain cross-source expression combination units. The content localization submodule calls the cross-source expression combination unit to identify the layer area where the image edge is located, the position of the sentence in the text and the time period annotation of the audio segment, and performs localization comparison in the corresponding position in the content resource according to the consistent source method, extracts content information with similar positions in various sources, and obtains multi-source corresponding localization results; The paragraph output submodule extracts the paragraphs containing the matching content of images, text, and audio from the resource content involved, based on the multi-source corresponding positioning results. These paragraphs are then integrated into the same content sequence according to the order of the segments. The overlapping descriptions in the continuous paragraphs are extracted and grouped to obtain the cross-modal content analysis and recommendation structure.