An illegal content detection and identification method for an advertisement screen based on a deep map convolution network

By segmenting and recognizing advertising screen interface images using a depth graph convolutional network, a heterogeneous relationship graph between text and images is constructed, enabling joint analysis of multimodal content on advertising screens. This solves the problem of ignoring the correlation between text and images in existing technologies and improves the accuracy of illegal content detection.

CN122289780APending Publication Date: 2026-06-26GUIZHOU HIGH-SPEED DATA OPERATION CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUIZHOU HIGH-SPEED DATA OPERATION CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies for detecting illegal content on advertising screens cannot effectively integrate the spatial layout correlation between text and image content, leading to missed detections and false detections during multimodal content detection, thus reducing the accuracy of illegal content identification.

Method used

By constructing a method based on a depth graph convolutional network, the advertising screen interface image is segmented into text and image regions. Text and image features are obtained using optical character recognition and visual feature extraction. A heterogeneous relationship graph is constructed, and global graph pooling and classification are performed through multi-layer graph convolutional networks for propagation and aggregation, ultimately identifying illegal content.

Benefits of technology

It improves the accuracy of detecting illegal content on advertising screens, effectively integrates the spatial layout correlation of text and images, and enhances the accuracy of multimodal content detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289780A_ABST
    Figure CN122289780A_ABST
Patent Text Reader

Abstract

This invention relates to the field of advertising screen content detection technology, and particularly to a method for detecting and identifying illegal content on advertising screens based on a depth graph convolutional network. The method includes: segmenting the display area of ​​the advertising screen playback interface image; performing optical character recognition (OCR) processing on the text display area image and extracting features from the image display area image; constructing a heterogeneous relationship graph based on text node feature vectors and visual node feature vectors; inputting the heterogeneous relationship graph into a depth graph convolutional network to propagate and aggregate the text node feature vectors and visual node feature vectors; performing global graph pooling processing on the updated text node feature vectors and updated visual node feature vectors; and inputting the global graph representation vector into a classification layer for illegal content category determination. This invention can jointly analyze the text and image content simultaneously contained in the advertising screen playback interface, improving the accuracy of illegal content identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of advertising screen content detection technology, and in particular to a method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks. Background Technology

[0002] Advertising screens, as an important carrier of information dissemination, are widely used in various scenarios. Ensuring the compliance of the content displayed on these screens is crucial for guaranteeing the quality of information dissemination. With the diversification of advertising formats, advertising screen interfaces often simultaneously contain both text and images, making joint detection of multimodal content a core requirement for illegal content identification. Currently, existing illegal content detection technologies for advertising screens mostly employ single-modal detection methods, processing text and image content independently. Text content detection primarily relies on optical character recognition (OCR) technology to extract text information, then uses keyword matching to determine the presence of illegal content. Image content detection often uses conventional convolutional neural networks to extract image features for illegal image identification. However, existing technologies cannot perform joint analysis of text and image content, neglecting the spatial layout and correlation of text and image content within the advertising screen interface. Text and images on advertising screens typically exhibit specific positional pairings and semantic relationships; this correlation is crucial for the accurate identification of illegal content. Existing technologies, lacking consideration of this correlation, are prone to missed detections and false detections during multimodal content detection, significantly reducing the accuracy of illegal content identification and failing to meet the actual needs of real-time and accurate detection of illegal content on advertising screens.

[0003] Chinese patent publication CN110851590A discloses a method for text classification through sensitive word detection and illegal content identification, including: Step 1: Obtain the text to be tested, and then simultaneously execute steps 2 and 3; Step 2: Perform sensitive word detection using an AC automaton, and then execute step 4; Step 3: Perform illegal content identification using a recurrent neural network model, and then execute step 6; Step 4: Determine whether the text contains sensitive words. If so, execute step 5; otherwise, return to step 3; Step 5: If the text contains sensitive words, determine the text category based on the sensitive words; Step 6: Determine whether the text contains illegal content. If so, execute step 7; otherwise, execute step 8; Step 7: If the text contains illegal content, determine the text category based on the illegal content; Step 8: If the text does not contain illegal content; Step 9: End the current processing logic. This scheme only performs sensitive word detection and illegal content identification on single text content and cannot perform joint analysis on text and image content simultaneously contained in the advertising screen playback interface. This leads to the neglect of the spatial layout correlation between text and images in multimodal content detection, reducing the accuracy of illegal content identification. Summary of the Invention

[0004] To address this issue, the present invention provides a method for detecting and recognizing illegal content on advertising screens based on depth graph convolutional networks. This method overcomes the problem in existing technologies that cannot jointly analyze the text and image content simultaneously contained in the advertising screen playback interface, which leads to the neglect of the spatial layout correlation between text and images during multimodal content detection, thus reducing the accuracy of illegal content recognition.

[0005] To achieve the above objectives, this invention provides a method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks, comprising the following steps: S1. Obtain the image of the advertising screen playback interface, and perform text region segmentation and image region segmentation on the image of the advertising screen playback interface to obtain the text display area image and the image display area image. S2. Perform optical character recognition processing on the text display area image to obtain a text content sequence, and extract features from the image display area image according to the visual feature extraction model to obtain a visual feature map; S3. The text content sequence is processed by word vector embedding to obtain text node feature vectors, and the visual feature map is processed by region division and feature aggregation to obtain visual node feature vectors. A heterogeneous relationship graph is constructed based on the text node feature vectors and the visual node feature vectors. The connection edges between text nodes and visual nodes in the heterogeneous relationship graph are established through the spatial positional relationship between the text display area and the image display area in the advertising screen playback interface image. S4. Construct a deep graph convolutional network containing multiple graph convolutional layers and classification layers. Input the heterogeneous relationship graph into the deep graph convolutional network. Propagate and aggregate the text node feature vectors and visual node feature vectors through the graph convolutional layers to obtain updated text node feature vectors and updated visual node feature vectors. S5. Perform global graph pooling on the updated text node feature vector and the updated visual node feature vector to obtain the global graph representation vector. S6. Input the global graph representation vector into the classification layer to determine the category of illegal content, and obtain the illegal content detection and recognition result corresponding to the advertising screen playback interface image.

[0006] The technical principle of this application is as follows: By acquiring an image of the advertising screen playback interface, text region segmentation and image region segmentation are performed on the image to separate the text display area image and the image display area image, thereby achieving comprehensive capture of the content played on the advertising screen; optical character recognition processing is performed on the text display area image to obtain a text content sequence, and simultaneously, a visual feature extraction model is used to extract a visual feature map from the image display area image to obtain basic information at both the text and visual levels; word vector embedding is performed on the text content sequence to obtain text node feature vectors, and region division and feature aggregation are performed on the visual feature map to obtain visual node feature vectors; and a heterogeneous relationship graph is constructed based on the spatial positional relationship between the text node feature vectors and the visual node feature vectors in the advertising screen interface. For example, in the advertising screen interface, text nodes (advertising slogans) are divided into text nodes and visual nodes. The text nodes are positioned centered directly below the visual node (product main image), with side text nodes arranged adjacent to the visual node on the left and right. Contextual text nodes are arranged vertically along the edge of the visual node, establishing a connection between text and visual features. A deep graph convolutional network containing multiple graph convolutional layers and a classification layer is constructed, and a heterogeneous relationship graph is input into it. The graph convolutional layers achieve the propagation and aggregation of text and visual node features, updating node features. The updated node features are then subjected to global graph pooling to obtain a global graph representation vector. Finally, the global graph representation vector is input into the classification layer to complete the determination of illegal content categories. For example, an advertising screen plays illegal financial advertisements, with text containing false claims of "guaranteed principal and high returns, risk-free investment," and images showing fake returns. The combination of text and images constitutes illegal financial fraud advertising content, thus enabling the detection of illegal content played on the advertising screen.

[0007] Compared with existing technologies, the beneficial effects of this application are as follows: By constructing a heterogeneous relationship graph containing text node feature vectors and visual node feature vectors, and establishing connection edges between nodes based on the actual spatial position relationship between the text display area and the image display area in the advertising screen playback interface image, the deep graph convolutional network can simultaneously integrate text semantic information and visual semantic information during propagation and aggregation, while preserving the spatial layout correlation between the text display area and the image display area; by performing global graph pooling processing on the updated text node feature vectors and the updated visual node feature vectors, a global graph representation vector that can comprehensively reflect the text and visual content and their spatial relationship is obtained, and the global graph representation vector is input into the classification layer for illegal content category determination, realizing the joint analysis and recognition of multimodal content in the advertising screen playback interface image; by propagating and aggregating the text node feature vectors and visual node feature vectors in the heterogeneous relationship graph, the deep graph convolutional network can capture the interaction features between text content and visual content, thereby considering both text semantic information and visual semantic information during the illegal content detection process, effectively improving the accuracy of illegal content recognition.

[0008] Furthermore, S1 includes the following steps: S11. The image of the advertising screen playback interface is converted to grayscale according to the color space conversion model to obtain a grayscale image, and the edge is extracted from the grayscale image according to the Canny edge detection operator to obtain an edge intensity map. S12. Divide the edge intensity map into regions according to the connected component labeling algorithm to obtain candidate connected component masks, and filter the candidate connected component masks to obtain text candidate regions and image candidate regions. S13. The text candidate region and the image candidate region are cropped at the pixel level according to the mask generation algorithm to obtain the text display area image and the image display area image.

[0009] In this solution, grayscale processing is performed through a color space conversion model, edge intensity is extracted using the Canny edge detection operator, and then candidate regions are divided and filtered by a connected component labeling algorithm. Finally, a mask generation algorithm is used to achieve pixel-level cropping, which can accurately separate the text display area and image display area of ​​the advertising screen playback interface, improve the accuracy and refinement of interface area division, and achieve pixel-level precise segmentation.

[0010] Furthermore, S2 includes the following steps: S21. The text display area image is binarized and segmented according to the adaptive binarization algorithm to obtain a binarized text image, and the binarized text image is segmented into characters based on the vertical projection segmentation method to obtain a sequence of single character images. S22. Based on a pre-trained convolutional recurrent neural network, sequence recognition is performed on the single character image sequence to obtain a text content sequence. Multi-layer convolutional feature extraction is performed on the image display area to obtain a multi-scale feature map. The multi-scale feature map is then weighted and fused according to the channel attention mechanism to obtain a visual feature map.

[0011] In this scheme, an adaptive binarization algorithm and a vertical projection segmentation method are used to accurately segment text regions and divide characters, ensuring the integrity of individual characters. A pre-trained convolutional recurrent neural network is used to achieve high-precision sequence recognition. At the same time, multi-layer convolutional feature extraction and channel attention mechanism are combined to weightedly fuse multi-scale visual features, which significantly improves feature expression ability and thus enhances the accuracy of text content recognition.

[0012] Furthermore, in S22, the mathematical expression for the channel attention mechanism is: In the formula, Indicates the first Attention weights for each channel, The multi-scale feature map represents the first... Each channel and spatial location is The value at that location, Indicates the height of the feature map. Indicates the width of the feature map. This represents the weight matrix of the first fully connected layer. This represents the weight matrix of the second fully connected layer. This represents the Sigmoid activation function. express Activation function.

[0013] In this scheme, by introducing a channel attention mechanism, and with the synergistic effect of two fully connected layers and ReLU and Sigmoid activation functions, the channel dimensions of multi-scale feature maps are weighted and learned. This can accurately enhance the expressive power of important feature channels, effectively suppress interference information from irrelevant channels, and improve the discriminativeness and robustness of feature extraction.

[0014] Furthermore, S3 includes the following steps: S31. Perform word embedding encoding on each text term in the text content sequence to obtain an initial text node feature vector; S32. The visual feature map is divided into grids to obtain multiple visual feature map blocks, and each visual feature map block is subjected to global average pooling to obtain the initial visual node feature vector corresponding to each visual feature map block. S33. Construct a spatial proximity matrix between text nodes and visual nodes based on the Euclidean distance between the center point coordinates of the text display area and the center point coordinates of the image display area in the advertising screen playback interface image; S34. Based on the spatial proximity matrix, establish connection edges between each text node and visual nodes that exceed a preset distance threshold, and initialize the weight values ​​of the connection edges to obtain the heterogeneous relationship graph.

[0015] In this scheme, word embedding encoding is performed on text entries, and visual feature maps are divided into blocks and global average pooling is performed to obtain corresponding initial node feature vectors. Then, a spatial proximity matrix is ​​constructed by combining the spatial positional relationship between text and visual display areas. Connection edges between text and visual nodes are established according to preset distance thresholds and weights are initialized. This can accurately construct heterogeneous relationship graphs, fully integrate the feature information and spatial associations of text and vision, and improve the accuracy of heterogeneous information modeling.

[0016] Furthermore, S4 includes the following steps: S41. Input the text node feature vector and visual node feature vector in the heterogeneous relation graph into the first graph convolutional layer of the deep graph convolutional network, and sample and aggregate the neighbor node features of each node according to the adjacency matrix to obtain the node feature vector updated by the first graph convolutional layer. S42. Input the updated node feature vector of the first graph convolutional layer into the second graph convolutional layer, and perform weighted aggregation processing on the neighbor node features according to the adjacency matrix and attention coefficient to obtain the updated node feature vector of the second graph convolutional layer. S43. Input the updated node feature vectors from the convolutional layer of the second graph into the residual connection layer for feature enhancement processing to obtain the enhanced node feature vectors. S44. Input the enhanced node feature vector into the layer normalization unit for normalization processing to obtain the updated text node feature vector and the updated visual node feature vector.

[0017] In this scheme, text and visual node features from heterogeneous relationship graphs are input into a depth graph convolutional network. The features of neighboring nodes are sampled and aggregated by the first graph convolutional layer, and the features are weighted and aggregated by the attention coefficient in the second graph convolutional layer. Then, the features are enhanced by residual connection layers and normalized by layer normalization units. This process can optimize the node feature representation layer by layer and enhance the multimodal feature fusion effect and stability.

[0018] Furthermore, in S42, the mathematical expression for the weighted aggregation process is: In the formula, Represents a node The node feature vectors updated after the convolutional layer in the second graph. This represents the activation function. Represents a node The set of neighboring nodes, Indicates the neighbor node index. Represents a node with neighboring nodes Attention coefficient between them This represents the learnable parameter matrix of the first convolutional layer. Representing neighboring nodes The node feature vectors updated after the convolutional layer in the first graph.

[0019] In this scheme, by employing weighted aggregation processing with attention coefficients in graph convolution feature update, and combining activation functions and learnable parameter matrices to accurately aggregate neighbor node features, the accuracy and relevance of node feature update can be improved, the effectiveness of feature transfer between graph convolution layers can be enhanced, and the node feature representation capability can be optimized.

[0020] Furthermore, S5 includes the following steps: S51. Concatenate all the updated text node feature vectors to obtain the text node feature matrix, and concatenate all the updated visual node feature vectors to obtain the visual node feature matrix. S52. Perform horizontal concatenation on the text node feature matrix and the visual node feature matrix to obtain the full node feature matrix, and perform global max pooling on the full node feature matrix to obtain the global max pooling vector. S53. Perform global average pooling on the full node feature matrix to obtain a global average pooling vector, and concatenate the global max pooling vector with the global average pooling vector to obtain a global graph representation vector.

[0021] In this scheme, text and visual node features are concatenated into matrices and then merged horizontally. Global max pooling and global average pooling are then performed on the full node feature matrix, and the two pooling results are concatenated to obtain a global graph representation vector. This approach can capture both global salient features and overall distribution information, thereby more comprehensively integrating multimodal semantics and enhancing the richness and robustness of graph representation.

[0022] Furthermore, S6 includes the following steps: S61. Input the global graph representation vector into the first fully connected unit of the classification layer for linear transformation processing to obtain the first transformation vector, and perform random deactivation processing on the first transformation vector to obtain the regularized feature vector. S62. Input the regularized feature vector into the second fully connected unit for dimensionality compression processing to obtain a category score vector with the same number of illegal content categories. S63. Input the category score vector into the normalized exponential function unit for probability mapping processing to obtain the predicted probability value of each illegal content category; S64. Select the category corresponding to the highest probability value as the illegal content detection and recognition result corresponding to the advertising screen playback interface image based on the predicted probability value of each illegal content category.

[0023] In this scheme, by sequentially performing linear transformation and random deactivation regularization on the global graph representation vector through the first fully connected unit, and then dimensional compression by the second fully connected unit, and combining the normalized exponential function to complete the probability mapping and select the class with the highest probability, the feature expression effect can be improved, the risk of model overfitting can be reduced, and the illegal content category can be accurately output, making the detection and recognition of illegal content in the advertising screen playback interface image more accurate. Attached Figure Description

[0024] Figure 1 This is a flowchart illustrating an illegal content detection and identification method for advertising screens based on a depth graph convolutional network, according to an embodiment of the present invention. Detailed Implementation

[0025] The following detailed description illustrates the specific implementation method: like Figure 1 As shown, this is a flowchart illustrating a method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to an embodiment of the present invention, including the following steps: S1. Obtain the image of the advertising screen playback interface, and perform text region segmentation and image region segmentation on the image of the advertising screen playback interface to obtain the text display area image and the image display area image. S2. Perform optical character recognition processing on the text display area image to obtain a text content sequence, and extract features from the image display area image according to the visual feature extraction model to obtain a visual feature map; S3. The text content sequence is processed by word vector embedding to obtain text node feature vectors, and the visual feature map is processed by region division and feature aggregation to obtain visual node feature vectors. A heterogeneous relationship graph is constructed based on the text node feature vectors and the visual node feature vectors. The connection edges between text nodes and visual nodes in the heterogeneous relationship graph are established through the spatial positional relationship between the text display area and the image display area in the advertising screen playback interface image. S4. Construct a deep graph convolutional network containing multiple graph convolutional layers and classification layers. Input the heterogeneous relationship graph into the deep graph convolutional network. Propagate and aggregate the text node feature vectors and visual node feature vectors through the graph convolutional layers to obtain updated text node feature vectors and updated visual node feature vectors. S5. Perform global graph pooling on the updated text node feature vector and the updated visual node feature vector to obtain the global graph representation vector. S6. Input the global graph representation vector into the classification layer to determine the category of illegal content, and obtain the illegal content detection and recognition result corresponding to the advertising screen playback interface image.

[0026] Specifically, S1 includes the following steps: S11. The image of the advertising screen playback interface is converted to grayscale according to the color space conversion model to obtain a grayscale image, and the edge is extracted from the grayscale image according to the Canny edge detection operator to obtain an edge intensity map. S12. Divide the edge intensity map into regions according to the connected component labeling algorithm to obtain candidate connected component masks, and filter the candidate connected component masks to obtain text candidate regions and image candidate regions. S13. The text candidate region and the image candidate region are cropped at the pixel level according to the mask generation algorithm to obtain the text display area image and the image display area image.

[0027] In this embodiment, the color space conversion model is determined based on the core requirement of grayscale processing, and an RGB-to-grayscale model is selected in conjunction with the color characteristics of the advertising screen playback interface image. The connected component labeling algorithm is determined based on the region division requirements of the edge intensity map, and an 8-neighborhood-based connected component labeling algorithm is selected. The mask generation algorithm is determined around the pixel-level cropping requirements of the text candidate region and the image candidate region, and a binary mask generation algorithm is selected. This binary mask generation algorithm can generate a binary mask with the same size as the advertising screen playback interface image based on the pixel coordinate range of the text candidate region and the image candidate region. Within the mask, the pixel values ​​corresponding to the candidate regions are set to 1, and the non-candidate regions are set to 0, which can quickly obtain the text display area image and the image display area image.

[0028] The process involves acquiring an image of the advertising screen's playback interface and converting it to grayscale using an RGB-to-grayscale model. This involves using a weighted summation formula to convert the RGB three-channel values ​​of each pixel in the image to a single-channel grayscale value. This conversion is repeated for all pixels to obtain the grayscale image. Subsequently, the Canny edge detection operator is used to extract edges from the grayscale image. First, Gaussian filtering is applied to remove noise. Then, the gradient magnitude and direction of each pixel are calculated. Non-maximum suppression is applied to the gradient image to refine the edges, resulting in an edge intensity map. Higher pixel values ​​in the edge intensity map indicate a more pronounced edge at that location. An 8-neighborhood-based connected component labeling algorithm is used to divide the edge intensity map into regions. First, each pixel in the edge intensity map is traversed, and an edge pixel threshold is set. Pixels with values ​​higher than the edge pixel threshold are considered edge pixels. Then, adjacent edge pixels are labeled using an 8-neighborhood scanning method. Connected regions are labeled with the same number to generate candidate connected component masks. After that, the candidate connected component masks are filtered. Combining the regional features of text and images, connected component area thresholds and connected component aspect ratio thresholds are set. Connected components with too small an area or abnormal aspect ratio are removed, and connected components that meet the features of text and images are retained as candidate text regions and candidate image regions, respectively. A binary mask generation algorithm is used to process the text and image candidate regions. First, the pixel coordinate ranges of the text and image candidate regions are obtained, and a binary mask with the same size as the advertising screen playback interface image is generated. The pixel values ​​of the mask corresponding to the text and image candidate regions are set to 1, while the remaining regions are set to 0. This yields the text candidate region mask and the image candidate region mask. Then, the two masks are pixel-wise superimposed on the original advertising screen playback interface image, retaining pixels with a mask pixel value of 1 and discarding pixels with a pixel value of 0. After pixel-wise cropping, the text display area image and the image display area image are obtained. The edge pixel threshold is used to filter out true edge pixels from the edge intensity map. In Canny edge detection, it is usually related to the lower threshold of the two thresholds. For images with grayscale values ​​ranging from 0 to 255, a common empirical value is 50 to 100. An adaptive method (such as the Otsu algorithm) can also be used to automatically determine the threshold, or it can be set according to the statistical characteristics of the gradient magnitude (such as taking a multiple of the average gradient magnitude). The connected component area threshold is used to remove noise regions that are too small. The specific value of the connected component area threshold depends on the image size and the minimum target size. For example, for a 1080p resolution advertising screen image, the text region may contain at least tens to hundreds of pixels, so the lower limit of the area threshold can be set to 30 to 50 pixels. If the image resolution is low, the threshold will be reduced accordingly.For the aspect ratio threshold of connected components, text regions usually have a specific aspect ratio (e.g., a single character may be narrow and tall, while a line of text may be wide and flat), while image regions may be close to a square or have any proportion. For example, the aspect ratio of text regions is between 0.2 and 5, and anything outside this range is considered abnormal; image regions can be relaxed to 0.1 to 10 or greater, depending on the actual layout.

[0029] Specifically, S2 includes the following steps: S21. The text display area image is binarized and segmented according to the adaptive binarization algorithm to obtain a binarized text image, and the binarized text image is segmented into characters based on the vertical projection segmentation method to obtain a sequence of single character images. S22. Based on a pre-trained convolutional recurrent neural network, sequence recognition is performed on the single character image sequence to obtain a text content sequence. Multi-layer convolutional feature extraction is performed on the image display area to obtain a multi-scale feature map. The multi-scale feature map is then weighted and fused according to the channel attention mechanism to obtain a visual feature map.

[0030] Specifically, in S22, the mathematical expression for the channel attention mechanism is: In the formula, Indicates the first Attention weights for each channel, The multi-scale feature map represents the first... Each channel and spatial location is The value at that location, Indicates the height of the feature map. Indicates the width of the feature map. This represents the weight matrix of the first fully connected layer. This represents the weight matrix of the second fully connected layer. This represents the Sigmoid activation function. express Activation function.

[0031] In this embodiment, the determination of the adaptive binarization algorithm needs to be combined with the grayscale distribution characteristics of the text display area image. The core is to dynamically adjust the binarization threshold to adapt to the local differences in the image. Specifically, the text display area image is first grayscale processed, and then the image is divided into multiple uniformly sized local sub-blocks (the sub-block size is preset to 16×16 pixels). The grayscale mean of each sub-block is calculated as the binarization threshold of that sub-block. The value range of the binarization threshold is controlled between 0 and 255 to ensure that the grayscale difference between the text area and the background area in each sub-block is maximized. For pixels at the edge of the sub-block, the weighted average of the thresholds of adjacent sub-blocks is used for supplementation. Finally, the text display area image is binarized pixel by pixel through dynamic thresholding to obtain a binarized text image. The pre-trained convolutional recurrent neural network is determined based on the requirement of recognizing single character image sequences. The network structure consists of convolutional layers, recurrent layers, and an output layer. The pre-training process requires the use of a standard dataset containing various common characters (such as the MNIST character dataset). The standard dataset contains labeled single character image samples, covering characters with different fonts, sizes, and slight distortions. During training, the spatial features of a single character image are first extracted through the convolutional layer, and then the temporal correlation of the single character image sequence is captured through the recurrent layer (using an LSTM structure). The output layer uses the Softmax activation function to output the character category probability. The pre-training is set to 100 iterations, with a preset learning rate of 0.001. The network parameters are optimized through backpropagation until the recognition accuracy reaches more than 99%, thus completing the pre-training.

[0032] An adaptive binarization algorithm is used to process the text display area image. The image is divided into 16×16 pixel local sub-blocks. The gray-scale mean of each sub-block is calculated as the binarization threshold (value 0-255) for the corresponding sub-block. The pixels in each sub-block are binarized. Pixels with gray-scale values ​​higher than the binarization threshold are set to 255 (background), and pixels with gray-scale values ​​lower than the binarization threshold are set to 0 (text), resulting in a binarized text image. Then, based on the vertical projection segmentation method, the vertical projection value of the binarized text image is calculated. The preset projection threshold is 5. When the vertical projection value is lower than the preset projection threshold, it is determined to be a character interval. Based on this, the binarized text image is vertically segmented, and the region corresponding to each character is extracted in turn to obtain an ordered sequence of single character image sequences. A sequence of individual character images is input into a pre-trained convolutional recurrent neural network. The spatial features of each individual character image are extracted through the network's convolutional layers, and the temporal relationship of the individual character image sequence is captured through the recurrent layers. The corresponding character category is output through the output layer, and the sequence is concatenated to obtain the text content sequence. At the same time, multi-layer convolutional feature extraction is performed on the image display area. Three convolutional layers are selected, with each layer having a pre-set kernel size of 3×3, a stride of 1, and padding of 1. Feature information at different scales is extracted sequentially to obtain a multi-scale feature map. Then, according to the channel attention mechanism, feature statistics and weighted calculations are performed on each channel of the multi-scale feature map. The channel features are processed through two fully connected layers, and the weights are adjusted by combining the Sigmoid activation function and the ReLU activation function. The channels of the multi-scale feature map are then weighted and fused to finally obtain the visual feature map.

[0033] Specifically, S3 includes the following steps: S31. Perform word embedding encoding on each text term in the text content sequence to obtain an initial text node feature vector; S32. The visual feature map is divided into grids to obtain multiple visual feature map blocks, and each visual feature map block is subjected to global average pooling to obtain the initial visual node feature vector corresponding to each visual feature map block. S33. Construct a spatial proximity matrix between text nodes and visual nodes based on the Euclidean distance between the center point coordinates of the text display area and the center point coordinates of the image display area in the advertising screen playback interface image; S34. Based on the spatial proximity matrix, establish connection edges between each text node and visual nodes that exceed a preset distance threshold, and initialize the weight values ​​of the connection edges to obtain the heterogeneous relationship graph.

[0034] In this embodiment, the visual feature map is divided into grids with a preset grid size of 16×16 pixels. The visual feature map is evenly divided into multiple visual feature map blocks of the same size. Then, global average pooling is performed on each visual feature map block to calculate the feature mean of all pixels in each visual feature map block. This feature mean is used as the feature representation of the corresponding visual feature map block. The feature mean vector corresponding to each visual feature map block is the initial visual node feature vector. Obtain the center coordinates of the text display area and the image display area. Both center coordinates are determined based on the pixel coordinate system of the advertising screen playback interface image. The coordinate values ​​are taken with the upper left corner of the advertising screen playback interface image as the origin, the horizontal axis as the x-axis, and the vertical axis as the y-axis. Then, calculate the Euclidean distance between the two center coordinates in the advertising screen playback interface image. Based on this Euclidean distance, construct a spatial proximity matrix between text nodes and visual nodes. The number of rows in the spatial proximity matrix is ​​the same as the number of text nodes, and the number of columns is the same as the number of visual nodes. The value of each element in the matrix corresponds to the Euclidean distance between a text node and a visual node. The smaller the value, the higher the spatial proximity between the two, ensuring that the spatial proximity matrix can accurately reflect the spatial relationship between text nodes and visual nodes. The spatial proximity matrix and a preset distance threshold are obtained. The preset distance threshold is set to 50 pixels. This preset distance threshold is reasonably set based on the size of the advertising screen playback interface image (preset to 1920×1080 pixels) to ensure that it can effectively distinguish between neighboring and non-neighboring nodes. Then, each element in the spatial proximity matrix is ​​traversed to determine whether the Euclidean distance between each text node and each visual node exceeds the preset distance threshold. If it does, a connection edge is established between the text node and the visual node. At the same time, the weight value of the connection edge is initialized. The preset weight initial value is 0.1, and the preset weight initial value is controlled between 0 and 1. Finally, all text nodes, visual nodes, connection edges and initial weight values ​​constitute a heterogeneous relationship graph to ensure that the heterogeneous relationship graph can accurately represent the relationship between text nodes and visual nodes.

[0035] Specifically, S4 includes the following steps: S41. Input the text node feature vector and visual node feature vector in the heterogeneous relation graph into the first graph convolutional layer of the deep graph convolutional network, and sample and aggregate the neighbor node features of each node according to the adjacency matrix to obtain the node feature vector updated by the first graph convolutional layer. S42. Input the updated node feature vector of the first graph convolutional layer into the second graph convolutional layer, and perform weighted aggregation processing on the neighbor node features according to the adjacency matrix and attention coefficient to obtain the updated node feature vector of the second graph convolutional layer. S43. Input the updated node feature vectors from the convolutional layer of the second graph into the residual connection layer for feature enhancement processing to obtain the enhanced node feature vectors. S44. Input the enhanced node feature vector into the layer normalization unit for normalization processing to obtain the updated text node feature vector and the updated visual node feature vector.

[0036] Specifically, in S42, the mathematical expression for the weighted aggregation process is: In the formula, Represents a node The node feature vectors updated after the convolutional layer in the second graph. This represents the activation function. Represents a node The set of neighboring nodes, Indicates the neighbor node index. Represents a node with neighboring nodes Attention coefficient between them This represents the learnable parameter matrix of the first convolutional layer. Representing neighboring nodes The node feature vectors updated after the convolutional layer in the first graph.

[0037] In this embodiment, text node feature vectors and visual node feature vectors are obtained from the heterogeneous relation graph. These two feature vectors are then input into the first graph convolutional layer of the deep graph convolutional network. First, the adjacency matrix of the first graph convolutional layer is determined. This matrix is ​​constructed based on the connecting edges in the heterogeneous relation graph, with matrix elements representing the weight values ​​of the connecting edges. Elements at positions without connecting edges are set to 0. Then, the features of neighboring nodes are sampled according to the adjacency matrix. The preset number of neighboring nodes sampled for each node is 10. After sampling, the neighboring node features are averaged and aggregated. The aggregated features are then fused with the node's own features, with the fusion weights preset to 0.6 and 0.4. Finally, the updated node feature vector of the first graph convolutional layer is obtained, ensuring that the updated node feature vector contains the association information of neighboring nodes. The node feature vector updated by the first convolutional layer is obtained and input into the second convolutional layer. Simultaneously, the adjacency matrix corresponding to the heterogeneous graph and the preset attention coefficient are called. The preset attention coefficient has a value range of 0-1 and is calculated based on node feature similarity; the higher the similarity, the larger the attention coefficient. Then, the set of neighboring nodes for each node is determined based on the adjacency matrix. The neighbor node features are weighted and aggregated using the attention coefficient. The learnable parameter matrix of the first convolutional layer is preset to 128×128 dimensions. The weighted neighbor node features are then processed with the learnable parameter matrix, and a non-linear transformation is performed using an activation function (ReLU activation function is selected). Finally, the node feature vector updated by the second convolutional layer is obtained.

[0038] The node feature vectors updated by the convolutional layer in the second image are input into the residual connection layer for feature enhancement. The residual connection layer is preset to contain one fully connected layer, with the weight matrix dimension matching the node feature vector dimension (128×128). The bias term is preset to 0. The node feature vectors updated by the convolutional layer in the second image are then input into the fully connected layer for linear transformation. The transformed feature vectors are then residually connected to the original node feature vectors updated by the convolutional layer in the second image (i.e., element-wise addition). A preset regularization term (value 0.001) is added to prevent overfitting, resulting in the enhanced node feature vectors, improving the representational ability of the node features. The enhanced node feature vectors are then input into the layer normalization unit for normalization. The layer normalization unit is preset to normalize the feature vectors by their channel dimension. Value To avoid a denominator of 0, the process first calculates the mean and variance of the enhanced node feature vector along the channel dimension, then subtracts the mean from each feature value, divides by the square root of the variance, and adds... Finally, the feature vectors corresponding to the text nodes and the visual nodes are separated by adjusting the preset scaling factor (value 1.0) and offset factor (value 0.0) to obtain the updated text node feature vectors and the updated visual node feature vectors, ensuring that the feature vector values ​​are stable.

[0039] Specifically, S5 includes the following steps: S51. Concatenate all the updated text node feature vectors to obtain the text node feature matrix, and concatenate all the updated visual node feature vectors to obtain the visual node feature matrix. S52. Perform horizontal concatenation on the text node feature matrix and the visual node feature matrix to obtain the full node feature matrix, and perform global max pooling on the full node feature matrix to obtain the global max pooling vector. S53. Perform global average pooling on the full node feature matrix to obtain a global average pooling vector, and concatenate the global max pooling vector with the global average pooling vector to obtain a global graph representation vector.

[0040] In this embodiment, all updated text node feature vectors and all updated visual node feature vectors are obtained. Both types of feature vectors are 128-dimensional, ensuring that the dimensions of all updated text node feature vectors and all updated visual node feature vectors are consistent. The updated text node feature vectors are then concatenated vertically, with each updated text node feature vector connected end-to-end in the original order of the text nodes to form a text node feature matrix. The number of rows in the text node feature matrix is ​​equal to the number of updated text node feature vectors, and the number of columns is 128. At the same time, the same vertical concatenation process is performed on all updated visual node feature vectors, concatenating them sequentially in the original order of the visual nodes to obtain a visual node feature matrix. The number of rows in the visual node feature matrix is ​​equal to the number of updated visual node feature vectors, and the number of columns is 128, ensuring that the two types of feature matrices have a standardized structure and complete features. Obtain the text node feature matrix and the visual node feature matrix, ensuring that the number of rows in the two feature matrices is consistent (if inconsistent, a preset zero-padding rule is used to pad the feature matrix with fewer rows to the same number of rows as the other feature matrix). Perform horizontal concatenation on the text node feature matrix and the visual node feature matrix, concatenating the columns of the visual node feature matrix with the columns of the text node feature matrix in sequence to obtain the full node feature matrix. The full node feature matrix has the same number of rows as the two feature matrices and 256 columns. Then, perform global max pooling on the full node feature matrix, with the preset pooling dimension being the channel dimension of the feature matrix. Iterate through all elements of each channel of the full node feature matrix and extract the maximum value of each channel. The vector composed of the maximum values ​​of all channels is the global max pooling vector, with a dimension of 256. Global average pooling is performed on the feature matrix of all nodes. The preset pooling dimension is the same as that of global max pooling, which is the channel dimension of the feature matrix. During the processing, all elements of each channel of the feature matrix of all nodes are traversed, and the average value of all elements in each channel is calculated. The vector composed of the average values ​​of all channels is the global average pooling vector. The dimension of the global average pooling vector is 256, which is consistent with the dimension of the global max pooling vector. Then, the global max pooling vector and the global average pooling vector are concatenated horizontally. All elements of the global average pooling vector are sequentially appended to the global max pooling vector. The concatenated vector is the global graph representation vector. The dimension of the global graph representation vector is 512, which ensures that the global feature information of the feature matrix of all nodes can be fully integrated.

[0041] Specifically, S6 includes the following steps: S61. Input the global graph representation vector into the first fully connected unit of the classification layer for linear transformation processing to obtain the first transformation vector, and perform random deactivation processing on the first transformation vector to obtain the regularized feature vector. S62. Input the regularized feature vector into the second fully connected unit for dimensionality compression processing to obtain a category score vector with the same number of illegal content categories. S63. Input the category score vector into the normalized exponential function unit for probability mapping processing to obtain the predicted probability value of each illegal content category; S64. Select the category corresponding to the highest probability value as the illegal content detection and recognition result corresponding to the advertising screen playback interface image based on the predicted probability value of each illegal content category.

[0042] In this embodiment, the global graph representation vector has a dimension of 512. The global graph representation vector is input into the first fully connected unit of the classification layer for linear transformation. The first fully connected unit has a preset weight matrix dimension of 512×256 and a preset bias term of 0. During the linear transformation, the global graph representation vector is operated on with the weight matrix of the first fully connected unit, and the bias term is added to obtain the first transformed vector, which has a dimension of 256. Then, the first transformed vector is subjected to random deactivation processing with a preset random deactivation probability of 0.5, that is, 50% of the feature values ​​in the first transformed vector are randomly set to 0, and the remaining 50% of the feature values ​​are retained to avoid model overfitting. Finally, a regularized feature vector is obtained, ensuring that the regularized feature vector has good generalization ability. The regularized feature vector has a dimension of 256. The regularized feature vector is input into the second fully connected unit for dimension compression. The number of illegal content categories is preset to 10 (the preset number of illegal content categories is determined according to relevant standards). Therefore, the second fully connected unit needs to compress the dimension of the regularized feature vector from 256 to 10. The weight matrix of the second fully connected unit is preset to 256×10, and the bias term is preset to 0. During the processing, the regularized feature vector is operated on with the weight matrix of the second fully connected unit. After adding the bias term, the dimension compression is completed. The resulting vector is the category score vector with the same number of illegal content categories. Each element of the category score vector corresponds to the preliminary score of an illegal content category. Obtain the category score vector. For example, if the category score vector has 10 dimensions, corresponding to 10 illegal content categories, input the category score vector into the normalized exponential function unit for probability mapping processing. Transform each element in the category score vector, converting the score of each element into a value between 0 and 1, with the sum of all element values ​​being 1. Each transformed value is the predicted probability value for the corresponding illegal content category. Obtain the predicted probability value for each illegal content category. All predicted probability values ​​are between 0 and 1, and their sum is 1. Each predicted probability value corresponds to a specific illegal content category. Iterate through the predicted probability values ​​of all illegal content categories, comparing each predicted probability value one by one. Select the predicted probability value with the largest value to determine the illegal content category corresponding to the largest probability value. If multiple predicted probability values ​​are the same and all are the maximum, a preset priority judgment rule is used to select the illegal content category with the smallest category number. Finally, this selected category is used as the illegal content detection and recognition result corresponding to the advertising screen playback interface image. Illegal content categories include violence and gore, gambling and fraud, politically sensitive content, and false advertising.

[0043] The above are merely embodiments of the present invention. Commonly known structures and characteristics are not described in detail here. Those skilled in the art are aware of all common technical knowledge in the field prior to the application date or priority date, are aware of all existing technologies in that field, and have the ability to apply conventional experimental methods prior to that date. Those skilled in the art can, under the guidance of this application, improve and implement this solution in combination with their own capabilities. Some typical known structures or methods should not be obstacles for those skilled in the art to implement this application. It should be noted that those skilled in the art can make several modifications and improvements without departing from the structure of the present invention. These should also be considered within the scope of protection of the present invention, and will not affect the effectiveness of the implementation of the present invention or the practicality of the patent. The scope of protection claimed in this application should be determined by the content of its claims, and the specific embodiments described in the specification can be used to interpret the content of the claims.

Claims

1. A method for detecting and identifying illegal content of an advertising screen based on a deep graph convolutional network, characterized in that: Includes the following steps: S1. Obtain the image of the advertising screen playback interface, and perform text region segmentation and image region segmentation on the image of the advertising screen playback interface to obtain the text display area image and the image display area image. S2. Perform optical character recognition processing on the text display area image to obtain a text content sequence, and extract features from the image display area image according to the visual feature extraction model to obtain a visual feature map; S3. The text content sequence is processed by word vector embedding to obtain text node feature vectors, and the visual feature map is processed by region division and feature aggregation to obtain visual node feature vectors. A heterogeneous relationship graph is constructed based on the text node feature vectors and the visual node feature vectors. The connection edges between text nodes and visual nodes in the heterogeneous relationship graph are established through the spatial positional relationship between the text display area and the image display area in the advertising screen playback interface image. S4. Construct a deep graph convolutional network containing multiple graph convolutional layers and classification layers. Input the heterogeneous relationship graph into the deep graph convolutional network. Propagate and aggregate the text node feature vectors and visual node feature vectors through the graph convolutional layers to obtain updated text node feature vectors and updated visual node feature vectors. S5. Perform global graph pooling on the updated text node feature vector and the updated visual node feature vector to obtain the global graph representation vector. S6. Input the global graph representation vector into the classification layer to determine the category of illegal content, and obtain the illegal content detection and recognition result corresponding to the advertising screen playback interface image. 2.The method of claim 1, wherein the method further comprises: S1 includes the following steps: S11. The image of the advertising screen playback interface is converted to grayscale according to the color space conversion model to obtain a grayscale image, and the edge is extracted from the grayscale image according to the Canny edge detection operator to obtain an edge intensity map. S12. Divide the edge intensity map into regions according to the connected component labeling algorithm to obtain candidate connected component masks, and filter the candidate connected component masks to obtain text candidate regions and image candidate regions. S13. The text candidate region and the image candidate region are cropped at the pixel level according to the mask generation algorithm to obtain the text display area image and the image display area image. 3.The method of claim 1, wherein the method further comprises: S2 includes the following steps: S21. The text display area image is binarized and segmented according to the adaptive binarization algorithm to obtain a binarized text image, and the binarized text image is segmented into characters based on the vertical projection segmentation method to obtain a sequence of single character images. S22. Based on a pre-trained convolutional recurrent neural network, sequence recognition is performed on the single character image sequence to obtain a text content sequence. Multi-layer convolutional feature extraction is performed on the image display area to obtain a multi-scale feature map. The multi-scale feature map is then weighted and fused according to the channel attention mechanism to obtain a visual feature map.

4. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 3, characterized in that: In S22, the mathematical expression for the channel attention mechanism is: In the formula, Indicates the first Attention weights for each channel, The multi-scale feature map represents the first... Each channel and spatial location is The value at that location, Indicates the height of the feature map. Indicates the width of the feature map. This represents the weight matrix of the first fully connected layer. This represents the weight matrix of the second fully connected layer. This represents the Sigmoid activation function. express Activation function.

5. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 1, characterized in that: S3 includes the following steps: S31. Perform word embedding encoding on each text term in the text content sequence to obtain an initial text node feature vector; S32. The visual feature map is divided into grids to obtain multiple visual feature map blocks, and each visual feature map block is subjected to global average pooling to obtain the initial visual node feature vector corresponding to each visual feature map block. S33. Construct a spatial proximity matrix between text nodes and visual nodes based on the Euclidean distance between the center point coordinates of the text display area and the center point coordinates of the image display area in the advertising screen playback interface image; S34. Based on the spatial proximity matrix, establish connection edges between each text node and visual nodes that exceed a preset distance threshold, and initialize the weight values ​​of the connection edges to obtain the heterogeneous relationship graph.

6. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 1, characterized in that: S4 includes the following steps: S41. Input the text node feature vector and visual node feature vector in the heterogeneous relation graph into the first graph convolutional layer of the deep graph convolutional network, and sample and aggregate the neighbor node features of each node according to the adjacency matrix to obtain the node feature vector updated by the first graph convolutional layer. S42. Input the updated node feature vector of the first graph convolutional layer into the second graph convolutional layer, and perform weighted aggregation processing on the neighbor node features according to the adjacency matrix and attention coefficient to obtain the updated node feature vector of the second graph convolutional layer. S43. Input the updated node feature vectors from the convolutional layer of the second graph into the residual connection layer for feature enhancement processing to obtain the enhanced node feature vectors. S44. Input the enhanced node feature vector into the layer normalization unit for normalization processing to obtain the updated text node feature vector and the updated visual node feature vector.

7. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 6, characterized in that: In step S42, the mathematical expression for the weighted aggregation process is: In the formula, Represents a node The node feature vectors updated after the convolutional layer in the second graph. This represents the activation function. Represents a node The set of neighboring nodes, Indicates the neighbor node index. Represents a node with neighboring nodes Attention coefficient between them This represents the learnable parameter matrix of the first convolutional layer. Representing neighboring nodes The node feature vectors updated after the convolutional layer in the first graph.

8. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 1, characterized in that: S5 includes the following steps: S51. Concatenate all the updated text node feature vectors to obtain the text node feature matrix, and concatenate all the updated visual node feature vectors to obtain the visual node feature matrix. S52. Perform horizontal concatenation on the text node feature matrix and the visual node feature matrix to obtain the full node feature matrix, and perform global max pooling on the full node feature matrix to obtain the global max pooling vector. S53. Perform global average pooling on the full node feature matrix to obtain a global average pooling vector, and concatenate the global max pooling vector with the global average pooling vector to obtain a global graph representation vector.

9. The method for detecting and identifying illegal content on advertising screens based on depth graph convolutional networks according to claim 1, characterized in that: S6 includes the following steps: S61. Input the global graph representation vector into the first fully connected unit of the classification layer for linear transformation processing to obtain the first transformation vector, and perform random deactivation processing on the first transformation vector to obtain the regularized feature vector. S62. Input the regularized feature vector into the second fully connected unit for dimensionality compression processing to obtain a category score vector with the same number of illegal content categories. S63. Input the category score vector into the normalized exponential function unit for probability mapping processing to obtain the predicted probability value of each illegal content category; S64. Select the category corresponding to the highest probability value as the illegal content detection and recognition result corresponding to the advertising screen playback interface image based on the predicted probability value of each illegal content category.