Content detection method and system

By dividing the content image into multiple subspaces and extracting and fusing features based on connectivity, the problem of low detection accuracy in existing technologies is solved, achieving more efficient identification and detection of fake content.

CN116246152BActive Publication Date: 2026-06-26ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2022-12-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing content detection methods perform uniform feature modeling across all regions of the content image, resulting in low accuracy in detecting fake or forged content and an inability to effectively identify the possibility of tampering and clues in different regions.

Method used

The target content image is divided into multiple continuous subspaces, including contour regions and key point regions. Feature extraction is performed based on the connectivity between subspaces. A content detection model is used for feature fusion and purification to generate target fusion features and spatial features to determine the risk detection results.

Benefits of technology

It improves the accuracy of content detection, reduces computational load, enhances the accuracy of feature extraction, and can more effectively identify traces of fake or forged content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116246152B_ABST
    Figure CN116246152B_ABST
Patent Text Reader

Abstract

The content detection method and system provided in the specification, after obtaining a target content image of target content and extracting spatial images corresponding to a plurality of continuous subspaces from the target content image, the plurality of continuous subspaces include a contour region and at least one key point region corresponding to a key point, the contour region includes a region in the target content image except the key point region, then, based on the connection relationship between the plurality of continuous subspaces, feature extraction is performed on the spatial images to obtain target fusion features and spatial features of each subspace in the plurality of continuous subspaces, and based on the target fusion features and the spatial features, a risk detection result of the target content is determined, and the risk detection result is output; the scheme can improve the accuracy of content detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of content detection, and in particular to a content detection method and system. Background Technology

[0002] In recent years, with the rapid development of internet technology, content creation has become increasingly convenient, resulting in a surge of online content. However, this content often includes risky material, such as false or fabricated information. This risky content can significantly mislead public opinion, necessitating content risk detection. Current content detection methods primarily rely on cloud-based or edge-based content detection models.

[0003] In the process of researching and practicing existing technologies, the inventors of this application found that existing content detection methods often perform uniform feature modeling in all areas of the content image. However, in reality, when fake or forged content is generated, the probability of tampering in different areas is different, and the clues after tampering are also different, which makes the accuracy of the extracted features used for risk detection low. Therefore, the accuracy of content detection is low. Summary of the Invention

[0004] This manual provides a more accurate content detection method and system.

[0005] In a first aspect, this specification provides a content detection method, comprising: acquiring a target content image corresponding to target content, and extracting spatial images corresponding to multiple continuous subspaces from the target content image, wherein the multiple continuous subspaces include a contour region and a keypoint region corresponding to at least one keypoint, and the contour region includes the region in the target content image other than the keypoint region; performing feature extraction on the spatial images based on the connection relationship between the multiple continuous subspaces to obtain a target fusion feature and spatial features of each subspace in the multiple continuous subspaces, wherein the target fusion feature is the feature corresponding to the fusion of the spatial features; and determining a risk detection result of the target content based on the target fusion feature and the spatial features, and outputting the risk detection result.

[0006] In some embodiments, the at least one key point includes a facial key point corresponding to at least one part of the facial region.

[0007] In some embodiments, obtaining the target content image corresponding to the target content includes: obtaining the original content image corresponding to the target content; and performing facial alignment on the original content image to obtain the target content image.

[0008] In some embodiments, the step of performing facial alignment on the original content image to obtain the target content image includes: performing facial detection on the original content image; when the facial detection of the original content image passes, extracting at least one initial key point corresponding to the facial region in the original content image; and aligning the facial region in the original content image with a preset facial template based on the position information of the at least one initial key point to obtain the target content image.

[0009] In some embodiments, extracting spatial images of multiple continuous subspaces from the target content image includes: acquiring at least one key point in the target content image; segmenting at least one region image corresponding to a facial part in the target content image based on the at least one key point to obtain at least one facial part image; and using the region image in the target content image other than the facial part image as a facial contour image, and using the facial part image and the facial contour image as spatial images of the multiple continuous subspaces.

[0010] In some embodiments, the step of extracting features from the spatial image to obtain target fusion features and spatial features corresponding to each of the plurality of continuous subspaces includes: determining the connection relationship between the plurality of continuous subspaces based on the spatial type of each subspace; and extracting features from the spatial image using a content detection model based on the connection relationship to obtain target fusion features and spatial features of each of the plurality of continuous subspaces.

[0011] In some embodiments, the content detection model includes a content detection network and a graph network; and the step of using the content detection model to extract features from the spatial image to obtain target fusion features and spatial features of each subspace in the plurality of continuous subspaces includes: using the content detection network to extract features from the spatial image to obtain initial spatial features of each subspace, fusing the initial spatial features to obtain initial fusion features, constructing a connection network graph based on the connection relationship using the initial spatial features and the initial fusion features as nodes, and refining the initial spatial features and the initial fusion features using the graph network based on the connection network graph to obtain the target fusion features and spatial features of each subspace.

[0012] In some embodiments, the connection network graph includes a first node corresponding to the initial spatial feature and a second node corresponding to the initial fusion feature; and the step of using the graph network to refine the initial spatial feature and the initial fusion feature to obtain the target fusion feature and the spatial feature corresponding to each subspace includes: selecting neighboring nodes corresponding to each node in the connection network graph; refining the initial spatial feature based on a preset information transfer function corresponding to the graph network and the neighboring nodes corresponding to the first node to obtain the spatial feature of each subspace; and refining the initial fusion feature based on the preset information transfer function and the neighboring nodes corresponding to the second node to obtain the target fusion feature.

[0013] In some embodiments, training the content detection model includes the following steps: training a preset content detection network to obtain an initial content detection network, and dynamically compressing the initial content detection network to obtain the content detection network; and training a preset graph network to obtain the graph network, and using the content detection network and the graph network as the content detection model.

[0014] In some embodiments, the preset content detection network includes a feature extraction subnetwork, a feature fusion subnetwork, and a risk classification subnetwork; and training the preset content detection network to obtain an initial content detection network includes: acquiring a first spatial image sample of each subspace, and using the feature extraction subnetwork to extract features from the first spatial image sample to obtain a first sample spatial feature; using the feature fusion subnetwork to fuse the first sample spatial feature to obtain a first sample fusion feature; using the risk classification subnetwork to determine a first predicted risk category corresponding to the first sample spatial feature and a second predicted risk category corresponding to the second sample fusion feature; and based on the first predicted risk category and the second predicted risk category, converging the preset content detection network to obtain the initial content detection network.

[0015] In some embodiments, the step of converging the preset content detection network to obtain the initial content detection network includes: acquiring a first labeled risk category of the first spatial image sample, and comparing the first labeled risk category with the first predicted risk category to obtain first independent classification loss information; comparing the first labeled risk category with the second predicted risk category to obtain first fused classification loss information; and fusing the first independent classification loss information and the first fused classification loss information, and converging the preset content detection network based on the fused target classification loss information to obtain the initial content detection network.

[0016] In some embodiments, dynamically compressing the initial content detection network to obtain the content detection network includes: adding a compression layer after each network layer in the initial content detection network to obtain a candidate content detection network, wherein the compression layer includes a batch normalization layer and a target convolutional layer of a preset size; acquiring second spatial image samples of each subspace and training the candidate content detection network based on the second spatial image samples to obtain a trained current content detection network; acquiring the feature channel weights of each batch normalization layer in the current content detection network and compressing the current content detection network based on the feature channel weights; and linearly superimposing the target convolutional layer with the convolutional layer in the compressed content detection network to obtain the content detection network.

[0017] In some embodiments, training the candidate content detection network based on the second spatial image samples to obtain the trained current content detection network includes: using the candidate content detection network to extract features from the second spatial image samples to obtain second sample spatial features and initial feature channel weights of the batch normalization layer corresponding to the second sample spatial features; fusing the second sample spatial features and determining the target compression loss information of the second spatial image samples based on the fused second sample features, the second sample spatial features, and the initial feature channel weights; and converging the compression layer in the candidate content detection network based on the target compression loss information to obtain the current content detection network.

[0018] In some embodiments, determining the target compression loss information of the second spatial image sample includes: determining second fusion classification loss information based on the second sample fusion features, and determining second independent classification loss information based on the second sample spatial features; determining weighted sparsity loss information based on the initial feature channel weights, wherein the constraint condition of the weighted sparsity loss information is that the initial feature channel weights corresponding to a preset number of feature channels are less than a preset weight threshold; and fusing the second fusion classification loss information, the second independent classification loss information, and the weighted sparsity loss information to obtain the target compression loss information.

[0019] In some embodiments, compressing the current content detection network based on the feature channel weights includes: selecting at least one feature channel weight less than a preset weight threshold from the feature channel weights to obtain a target feature channel weight; identifying a target feature channel corresponding to the target feature channel weight in each network layer of the current content detection network; and pruning the target feature channel to obtain the content detection compression network.

[0020] In some embodiments, training a preset graph network to obtain the graph network includes: acquiring connection network graph samples, wherein the connection network graph samples include nodes corresponding to third sample spatial features and nodes corresponding to third sample fusion features, and the third sample fusion features are features obtained by fusing the third sample spatial features; based on the connection network graph samples, using the preset graph network to refine the third sample spatial features and the third sample fusion features respectively, to obtain updated spatial features corresponding to the third sample spatial features and updated fusion features corresponding to the third sample fusion features; and based on the updated spatial features and the updated fusion features, converging the preset graph network to obtain the trained graph network.

[0021] In some embodiments, the step of converging the preset graph network to obtain the trained graph network includes: obtaining a second labeled risk category corresponding to the connected network graph sample; determining a third predicted risk category of the connected network graph sample based on the update space features, and comparing the second labeled risk category with the third predicted risk category to obtain third independent classification loss information; determining a fourth predicted risk category of the connected network graph sample based on the update fusion features, and comparing the second labeled risk category with the fourth predicted risk category to obtain third fused classification loss information; and fusing the third independent classification loss information and the third fused classification loss information, and converging the preset graph network based on the fused graph network loss information to obtain the trained graph network.

[0022] In some embodiments, the content detection model is deployed on a client or terminal.

[0023] In some embodiments, determining the risk detection result of the target content based on the target fusion feature and the spatial feature includes: determining the fusion attack probability of the target content based on the target fusion feature; determining the independent attack probability of the target content in each subspace based on the spatial feature; and determining the risk detection result of the target content based on the fusion attack probability and the independent attack probability.

[0024] In some embodiments, the risk detection result includes either risky content or normal content; and determining the risk detection result of the target content includes: determining the average probability of the independent attack probability, and determining the risk detection result of the target content as risky content when the fused attack probability is greater than a preset first probability threshold or the average probability is greater than a preset second probability threshold.

[0025] In some embodiments, the method further includes: when the probability of the fusion attack is less than the preset first probability threshold and the average probability is less than the preset second probability threshold, determining the risk detection result of the target content as normal content.

[0026] Secondly, this specification also provides a content detection system, comprising: at least one storage medium storing at least one instruction set for performing content detection; and at least one processor communicatively connected to the at least one storage medium, wherein, when the content detection system is running, the at least one processor reads the at least one instruction set and executes the content detection method described in the first aspect of this specification according to the instructions of the at least one instruction set.

[0027] As can be seen from the above technical solutions, the content detection method and system provided in this specification, after acquiring the target content image and extracting spatial images corresponding to multiple continuous subspaces from the target content image, wherein the multiple continuous subspaces include contour regions and key point regions corresponding to at least one key point, and the contour regions include regions in the target content image other than key point regions, then, based on the connection relationship between the multiple continuous subspaces, feature extraction is performed on the spatial images to obtain target fusion features and spatial features of each subspace in the multiple continuous subspaces, and based on the target fusion features and spatial features, the risk detection result of the target content is determined and the risk detection result is output. Since this solution can extract spatial images corresponding to multiple continuous subspaces from the target content image, and use different weights to extract features from different regions based on the connection relationship between the multiple continuous subspaces, it can extract features from regions where false or forged traces are concentrated, thereby improving the accuracy of feature extraction. Moreover, dividing the target content image into multiple regions reduces the input resolution and greatly reduces the computational load during risk detection, thus improving the accuracy of content detection.

[0028] Other functions of the content detection methods and systems provided in this specification will be partially listed in the following description. The figures and examples described below will be readily apparent to those skilled in the art. The inventive aspects of the content detection methods and systems provided in this specification can be fully understood through practice or use of the methods, apparatus, and combinations described in the detailed examples below. Attached Figure Description

[0029] To more clearly illustrate the technical solutions in the embodiments of this specification, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0030] Figure 1 A schematic diagram illustrating an application scenario of a content detection system provided according to an embodiment of this specification is shown.

[0031] Figure 2 A schematic diagram of the hardware structure of a computing device provided according to an embodiment of this specification is shown;

[0032] Figure 3 A schematic flowchart of a content detection method provided according to an embodiment of this specification is shown;

[0033] Figure 4 A schematic diagram illustrating the overall process of content detection in a deepfakes detection scenario according to embodiments of this specification is shown; and

[0034] Figure 5 A schematic diagram of a deepfakes content detection process provided according to an embodiment of this specification is shown. Detailed Implementation

[0035] The following description provides specific application scenarios and requirements for this specification, intended to enable those skilled in the art to make and use the contents of this specification. Various partial modifications to the disclosed embodiments will be apparent to those skilled in the art, and the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of this specification. Therefore, this specification is not limited to the embodiments shown, but rather to the widest scope consistent with the claims.

[0036] The terminology used herein is for the purpose of describing particular exemplary embodiments only and is not restrictive. For example, unless the context clearly indicates otherwise, the singular forms “a,” “an,” and “the” used herein may also include the plural forms. When used in this specification, the terms “comprising,” “including,” and / or “containing” mean that the associated integers, steps, operations, elements, and / or components are present, but do not exclude the presence of one or more other features, integers, steps, operations, elements, components, and / or groups, or that other features, integers, steps, operations, elements, components, and / or groups may be added to the system / method.

[0037] Considering the following description, these and other features of this specification, as well as the operation and function of the related components of the structure, and the economy of assembly and manufacture of the parts, can be significantly improved. All of these form part of this specification with reference to the accompanying drawings. However, it should be clearly understood that the drawings are for illustrative and descriptive purposes only and are not intended to limit the scope of this specification. It should also be understood that the drawings are not drawn to scale.

[0038] The flowcharts used in this specification illustrate operations implemented according to some embodiments of this specification. It should be clearly understood that the operations in the flowcharts may not be implemented in a sequential order. Instead, the operations may be implemented in reverse order or simultaneously. Furthermore, one or more additional operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

[0039] For ease of description, the terms that will appear in the following descriptions will be explained as follows:

[0040] Continuous subspace learning: After aligning the facial region, it is divided into several key continuous subspaces; feature learning is performed for each subspace, and then the continuous relationship between the subspaces is learned to perform risk detection learning for deepfakes.

[0041] Dynamic compression: The model parameters are compressed by first changing the network structure (introducing 1*1 convolutional layers), then unnecessary parameters are pruned using the parameters of the convolutional layers, and finally the convolutional layers are integrated into the network structure to complete the compression.

[0042] Deepfakes (generated false / fake content): refers to the creation of false and untrue text, image, or video content using text or image / video generation methods, and then publishing it on content platforms. Multimodal false / fake content can have a significant misleading effect on public opinion.

[0043] Deepfakes detection refers to the use of deep learning, machine learning, and other technologies to detect whether images / text or other content were generated by deepfakes models. It can also be understood as detecting whether the generated content is fake / forged.

[0044] Before describing the specific embodiments in this specification, the application scenarios of this specification will be introduced as follows:

[0045] The content detection methods provided in this manual can be applied to any content detection scenario. For example, in content review scenarios, the content detection methods in this manual can be used to detect content awaiting review on a content platform; in content publishing scenarios, the content detection methods in this manual can be used to detect content to be published; in content sharing scenarios, the content detection methods in this manual can be used to detect content to be shared; and they can also be applied to any other content detection scenario, which will not be elaborated on here.

[0046] Those skilled in the art should understand that the content detection methods and systems described in this specification are also within the scope of protection of this specification when applied to other use scenarios.

[0047] Figure 1This diagram illustrates an application scenario of a content detection system 001 provided according to an embodiment of this specification. The content detection system 001 (hereinafter referred to as System 001) can be applied to content detection in any scenario, such as content detection in content review scenarios, content detection in content publishing scenarios, content detection in content sharing scenarios, etc. Figure 1 As shown, system 001 may include target user 100, client 200, server 300 and network 400 within the target space.

[0048] Target user 100 can be the user who triggers content detection on the target content. Target user 100 can perform content detection operations on client 200.

[0049] Client 200 can be a device for detecting target content in response to a content detection operation by target user 100. In some embodiments, the content detection method can be executed on client 200. In this case, client 200 may store data or instructions for executing the content detection method described herein, and may execute or be used to execute said data or instructions. In some embodiments, client 200 may include a hardware device with data information processing capabilities and the necessary programs required to drive the hardware device to operate. Figure 1As shown, client 200 can communicate with server 300. In some embodiments, server 300 can communicate with multiple clients 200. In some embodiments, client 200 can interact with server 300 through network 400 to receive or send messages, such as receiving target content or target content images. In some embodiments, client 200 may include mobile devices, tablets, laptops, built-in devices in motor vehicles, or similar content, or any combination thereof. In some embodiments, the mobile device may include smart home devices, smart mobile devices, virtual reality devices, augmented reality devices, or similar devices, or any combination thereof. In some embodiments, the smart home device may include smart TVs, desktop computers, or any combination thereof. In some embodiments, the smart mobile device may include smartphones, personal digital assistants, gaming devices, navigation devices, or any combination thereof. In some embodiments, the virtual reality device or augmented reality device may include virtual reality headsets, virtual reality glasses, virtual reality controllers, augmented reality headsets, augmented reality glasses, augmented reality controllers, or similar content, or any combination thereof. For example, the virtual reality device or the augmented reality device may include Google Glass, head-mounted displays, VR, etc. In some embodiments, the built-in device in the motor vehicle may include an in-vehicle computer, an in-vehicle TV, etc. In some embodiments, the client 200 may include an image acquisition device for acquiring target content or an image of the target content, thereby obtaining a target content image. In some embodiments, the image acquisition device may be a two-dimensional image acquisition device (such as an RGB camera), or a combination of a two-dimensional image acquisition device (such as an RGB camera) and a depth image acquisition device (such as a 3D structured light camera, a laser detector, etc.). In some embodiments, the client 200 may be a device with positioning technology for locating the position of the client 200.

[0050] In some embodiments, the client 200 may have one or more applications (APPs) installed. The APPs provide the target user 110 with the ability and interface to interact with the outside world via the network 400. The APPs include, but are not limited to: web browser APPs, search APPs, chat APPs, shopping APPs, video APPs, financial management APPs, instant messaging tools, email clients, social media platform software, etc. In some embodiments, the client 200 may have a target APP installed. The target APP can capture target content or images of target content for the client 200, thereby obtaining a target content image. In some embodiments, the target user 100 can also trigger a content detection request through the target APP. The target APP can respond to the content detection request by executing the content detection method described in this specification. The content detection method will be described in detail later.

[0051] Server 300 may be a server providing various services, such as a backend server supporting the acquisition of target content images on client 200 or the performance of content detection on target content images. In some embodiments, the content detection method may be executed on server 300. In this case, server 300 may store data or instructions for executing the content detection method described herein, and may execute or be used to execute said data or instructions. In some embodiments, server 300 may include hardware devices with data processing capabilities and necessary programs for driving the hardware devices. Server 300 may communicate with multiple clients 200 and receive data sent by clients 200.

[0052] Network 400 serves as a medium to provide a communication connection between client 200 and server 300. Network 400 facilitates the exchange of information or data. For example... Figure 1 As shown, client 200 and server 300 can connect to network 400 and transmit information or data to each other through network 400. In some embodiments, network 400 can be any type of wired or wireless network, or a combination thereof. For example, network 400 may include cable networks, wired networks, fiber optic networks, telecommunications networks, intranets, the Internet, local area networks (LANs), wide area networks (WANs), wireless local area networks (WLANs), metropolitan area networks (MANs), public switched telephone networks (PSTNs), and Bluetooth. TM Network, ZigBee TMA network, a near-field communication (NFC) network, or a similar network. In some embodiments, network 400 may include one or more network access points. For example, network 400 may include wired or wireless network access points, such as base stations or internet exchange points, through which one or more components of client 200 and server 300 can connect to network 400 to exchange data or information.

[0053] It should be understood that Figure 1 The number of clients 200, servers 300, and networks 400 shown is merely illustrative. Depending on implementation needs, there can be any number of clients 200, servers 300, and networks 400.

[0054] It should be noted that the content detection method can be executed entirely on the client 200, entirely on the server 300, or partially on the client 200 and partially on the server 300.

[0055] Figure 2 A hardware structure diagram of a computing device 600 according to an embodiment of this specification is shown. The computing device 600 can execute the content detection method described in this specification. The content detection method is described in other parts of this specification. When the content detection method is executed on a client 200, the computing device 600 can be the client 200. When the content detection method is executed on a server 300, the computing device 600 can be the server 300. When the content detection method can be executed partly on the client 200 and partly on the server 300, the computing device 600 can be both the client 200 and the server 300.

[0056] like Figure 2 As shown, the computing device 600 may include at least one storage medium 630 and at least one processor 620. In some embodiments, the computing device 600 may also include a communication port 650 and an internal communication bus 610. Additionally, the computing device 600 may include I / O components 660.

[0057] The internal communication bus 610 can connect different system components, including storage medium 630, processor 620 and communication port 650.

[0058] I / O component 660 supports input / output between computing device 600 and other components.

[0059] Communication port 650 is used for data communication between computing device 600 and external sources. For example, communication port 650 can be used for data communication between computing device 600 and network 400. Communication port 650 can be a wired communication port or a wireless communication port.

[0060] Storage medium 630 may include a data storage device. The data storage device may be a non-transitory storage medium or a temporary storage medium. For example, the data storage device may include one or more of a disk 632, a read-only storage medium (ROM) 634, or a random access storage medium (RAM) 636. Storage medium 630 also includes at least one instruction set stored in the data storage device. The instructions are computer program code, which may include programs, routines, objects, components, data structures, procedures, modules, etc., that execute the content detection methods provided in this specification.

[0061] At least one processor 620 can be communicatively connected to at least one storage medium 630 and a communication port 650 via an internal communication bus 610. The at least one processor 620 is used to execute the at least one instruction set described above. When the computing device 600 is running, the at least one processor 620 reads the at least one instruction set and, according to the instructions of the at least one instruction set, executes the content detection method provided in this specification. The processor 620 can execute all the steps included in the content detection method. The processor 620 can be in the form of one or more processors. In some embodiments, the processor 620 may include one or more hardware processors, such as a microcontroller, microprocessor, reduced instruction set computer (RISC), application-specific integrated circuit (ASIC), application-specific instruction set processor (ASIP), central processing unit (CPU), graphics processing unit (GPU), physical processing unit (PPU), microcontroller unit, digital signal processor (DSP), field-programmable gate array (FPGA), advanced RISC machine (ARM), programmable logic device (PLD), any circuit or processor capable of performing one or more functions, or any combination thereof. For illustrative purposes only, only one processor 620 is described in this specification for the computing device 600. However, it should be noted that the computing device 600 in this specification may also include multiple processors. Therefore, the operation and / or method steps disclosed in this specification may be executed by one processor as described in this specification, or they may be executed jointly by multiple processors. For example, if the processor 620 of the computing device 600 in this specification executes steps A and B, it should be understood that steps A and B may also be executed jointly or separately by two different processors 620 (e.g., the first processor executes step A, the second processor executes step B, or the first and second processors jointly execute steps A and B).

[0062] Figure 3A flowchart of a content detection method P100 provided according to an embodiment of this specification is shown. As previously described, the computing device 600 can execute the content detection method P100 of this specification. Specifically, the processor 620 can read an instruction set stored in its local storage medium and then execute the content detection method P100 of this specification according to the instructions in the instruction set. Figure 3 As shown, method P100 may include:

[0063] S110: Obtain the target content image corresponding to the target content, and extract spatial images corresponding to multiple continuous subspaces from the target content image.

[0064] The target content refers to the content that needs to be detected. The content can be of various types, such as video content, image content, text or audio-generated image or video content, and so on.

[0065] The target content image can be an image corresponding to the target content. For example, if the target content is an image, then the target content image can be the target content itself. Or, if the target content is video content, then the target content image can be one or more video frames in the video content.

[0066] The multiple contiguous subspaces include a contour region and a keypoint region corresponding to at least one keypoint. The contour region includes the area in the target content image other than the keypoint region. The at least one keypoint may include facial keypoints corresponding to at least one facial feature, which may include at least one of the following: eyes (left / right), nose, ears (left / right), mouth, eyebrows, forehead, or other facial features. The keypoint region may include facial features or other facial regions. The multiple contiguous subspaces may include a facial features subspace, other facial region subspaces, and a contour region subspace, etc.

[0067] There are several ways to obtain the target content image corresponding to the target content and extract spatial images corresponding to multiple continuous subspaces from the target content image, as follows:

[0068] S111: Obtain the target content image corresponding to the target content.

[0069] For example, the processor 620 can acquire the original content image corresponding to the target content, perform facial alignment on the original content image, and obtain the target content image.

[0070] The original content image can be the original image corresponding to the target content. There are several ways to obtain the original content image corresponding to the target content. For example, the processor 620 can directly receive the original content image of the target content uploaded by the target user 100 through the client 200 or the terminal; or, it can extract the content image from the target content to obtain the original content image; or, it can perform image acquisition on the target content to obtain the original content image; or, it can directly acquire the target content through an image device to obtain the original content image corresponding to the target content; or, it can also receive a content detection request, which carries the storage address of the target content or the original content image of the target content, and obtain the original content image corresponding to the target content based on the storage address, and so on.

[0071] After acquiring the original content image, the processor 620 can perform face alignment on the original content image to obtain the target content image. There are several ways to perform face alignment on the original image. For example, the processor 620 can perform face detection on the original content image. When the face detection of the original content image passes, it can extract at least one initial key point corresponding to the face region in the original content image, and based on the position information of the at least one initial key point, align the face region in the original content image with a preset face template to obtain the target content image.

[0072] There are several ways to perform face detection on the original content. For example, the processor 620 can detect whether the original content image contains a facial region. If it contains a facial region, it can be determined that the face detection of the original content image has passed. If it does not contain a facial region, it can be determined that the face detection of the original content image has failed. In this case, it is necessary to return to the step of obtaining the original content image corresponding to the target content until the face detection of the original content image passes.

[0073] When face detection passes in the original content image, the processor 620 can extract at least one initial keypoint corresponding to the facial region in the original content image. Then, based on the position information of the at least one initial keypoint, the facial region in the original content image is aligned with a preset facial template to obtain the target content image. The preset facial template can be a pre-defined template with labeled facial regions. There are several ways to align the facial region in the original content image with the preset facial template. For example, the processor 620 can obtain the standard position information of the corresponding standard keypoints in the preset facial template, and perform an affine transformation on at least one initial keypoint based on the position information and the corresponding standard position information to complete the alignment operation and obtain the target content image.

[0074] The size of the target content image after alignment can be a preset size, or it can be the same as the size of the preset face template, such as 256*256, or any other size, etc.

[0075] S112: Extract spatial images corresponding to multiple continuous subspaces from the target content image.

[0076] For example, the processor 620 can acquire at least one key point in the target image, segment at least one region image corresponding to a facial part in the target content image based on the at least one key point, obtain at least one facial part image, and take the region image other than the facial part image in the target content image as a facial contour image, and take the facial part image and the facial contour image as a spatial image of multiple continuous subspaces.

[0077] It should be noted that a subspace can be equivalent to an image region in the target content image. Multiple consecutive subspaces can represent multiple image regions with interconnected relationships. For example, taking the face as an example, at least one facial feature image can include at least one region image corresponding to the ears (left / right), eyes (left / right), nose, mouth, etc. There are various ways to segment at least one facial feature region image from the target content image. For example, taking the nose as an example, the processor 620 can select at least one target key point corresponding to the nose from at least one key point. Based on at least one target key point, the nose region corresponding to the nose is identified in the target content image, and the image of the nose region is segmented to obtain the facial feature image corresponding to the nose. The segmentation method for other facial features is similar to obtain at least one facial feature image. The facial contour image can be the region image corresponding to the contour region in the target content image, and the contour region can be the region in the target content image other than the facial feature image. Therefore, the spatial image corresponding to multiple consecutive subspaces can include at least one facial feature image and a facial contour image.

[0078] S120: Based on the connection relationship of multiple continuous subspaces, feature extraction is performed on the spatial image to obtain the target fusion feature and the spatial features of each subspace in multiple continuous subspaces.

[0079] Among them, the target fusion feature is the feature corresponding to the fusion of spatial features.

[0080] The connection relationships among multiple continuous subspaces can include the connection relationships between facial features and the connection relationships between facial features and facial contours. For example, taking facial features as the five facial features, it can include the connection between the left and right eyes, the connection between the left and right eyes and the nose, the connection between the left and right eyes and the left and right ears respectively; the connection between the nose and the mouth; the connection between all facial feature areas and contour areas, and so on.

[0081] Among them, based on the connection relationship of multiple continuous subspaces, there are various ways to extract features from spatial images, as follows:

[0082] For example, the processor 620 can determine the connection relationship between multiple consecutive subspaces based on the spatial type of each subspace, and based on the connection relationship, use a content detection model to extract features from the spatial image to obtain target fusion features and spatial features of each subspace in multiple consecutive subspaces.

[0083] There are several ways to determine the connection relationships between multiple continuous subspaces. For example, the processor 620 can obtain the space type of each subspace in the multiple continuous subspaces, select the connection relationship corresponding to the space type from the preset connection relationship set, and thus determine the connection relationship between each subspace. For example, taking the facial features as an example, the connection relationship can include the left and right eyes being connected, the left and right eyes being connected to the nose, the left and right eyes being connected to the left and right ears respectively; the nose being connected to the mouth; all facial feature areas being connected to the contour area, and so on.

[0084] After determining the connection relationships between multiple consecutive subspaces, the processor 620 can extract features from the spatial image using a content detection model based on these connections, thereby obtaining the target fusion feature and the spatial features of each subspace in the multiple consecutive subspaces. The content detection model includes a content detection network and a graph network. The content detection network is used for content detection, and the graph network is used to refine or update the extracted features. There are various ways to extract features from the spatial image using the content detection model. For example, the processor 620 can use a content detection network to extract features from the spatial image, obtain initial spatial features for each subspace, fuse the initial spatial features to obtain initial fusion features, construct a connection network graph based on the connection relationships, and then use a graph network to refine the initial spatial features and initial fusion features respectively, to obtain the target fusion feature and the spatial features of each subspace.

[0085] In this context, the spatial image can be viewed as a region image extracted or extracted from the target content image. Therefore, the size of the spatial image can be smaller than that of a traditional image; for example, the size of the spatial image can be 64*64, only 1 / 4 of the traditional input. This reduces the computational cost in feature extraction by approximately two orders of magnitude, significantly improving the efficiency of feature extraction. The initial spatial features corresponding to the spatial image can be the unrefined spatial features extracted by the feature extraction subnetwork in the content detection network. There are several ways to fuse the initial features. For example, the processor 620 can directly concatenate the initial spatial features of each subspace to obtain the initial fused features. Alternatively, it can obtain the spatial weights corresponding to each subspace, weight the initial spatial features based on the spatial weights, and then concatenate the weighted initial spatial features to obtain the initial fused features, and so on.

[0086] After extracting the initial spatial features and initial fused features, the processor 620 can construct a connection network graph based on the connection relationships, using the initial spatial features and initial fused features as nodes respectively. The connection network graph represents the relationship between the initial spatial features and the initial fused features. There are several ways to construct the connection network graph. For example, the processor 620 can determine the edge information between nodes corresponding to the initial spatial features based on the connection relationships, construct an initial connection network graph based on the edge information and the initial spatial features, then add nodes corresponding to the initial fused features to the initial connection network graph, and then connect the nodes corresponding to the initial fused features to all nodes corresponding to the initial spatial features to obtain the connection network graph. Alternatively, it can add target connection relationships between the initial fused features and all initial spatial features based on the connection relationships to obtain updated connection relationships, and then construct connection network graphs based on the updated connection relationships, using the initial fused features and initial spatial features as nodes respectively, and so on.

[0087] After constructing the connection network graph, the processor 620 can use the graph network to refine the initial spatial features and the initial fused features to obtain the target fused features and the spatial features of each subspace. The graph network can include a first node corresponding to the initial spatial features and a second node corresponding to the initial fused features. There are several ways to refine the initial spatial features and the initial fused features using the graph network. For example, the processor 620 can select the neighboring nodes corresponding to each node in the connection network graph, refine the initial spatial features based on the preset information transfer function corresponding to the graph network and the neighboring nodes corresponding to the first node to obtain the spatial features of each subspace, and refine the initial fused features based on the preset information transfer function and the neighboring nodes corresponding to the second node to obtain the target fused features.

[0088] The preset information transfer function corresponding to the graph network can be a function that transfers information between nodes in the graph network. The preset information transfer function can be viewed as a fully connected layer (FC layer). By inputting the initial spatial features corresponding to the first node and the features of the first node's neighboring nodes (other initial spatial features and / or initial fused features) into this FC layer, the purified spatial features can be obtained. The purification method for the initial fused features is similar to that for the initial spatial features, also using the preset information transfer function.

[0089] The training of the content detection model may include the following steps: the processor 620 can train a preset content detection network to obtain an initial content detection network, dynamically compress the initial content detection network to obtain a new content detection network, and train a preset graph network to obtain a graph network. The content detection network and the graph network are then used as the content detection model, specifically as follows:

[0090] (1) Training of the content detection network

[0091] The preset content detection network includes a feature extraction subnetwork, a feature fusion subnetwork, and a risk classification subnetwork. There are various ways to train the preset content detection network. For example, the processor 620 can acquire first spatial image samples for each subspace, and use the feature extraction subnetwork to extract features from the first spatial image samples to obtain first sample spatial features. Then, the feature fusion subnetwork can fuse the first sample spatial features to obtain first sample fused features. Finally, the risk classification subnetwork can determine the first predicted risk category corresponding to the first sample spatial features and the second predicted risk category corresponding to the second sample fused features. Based on the first and second predicted risk categories, the preset content detection network can be converged to obtain the initial content detection network.

[0092] There are several ways to obtain the first spatial image sample of each subspace. For example, the processor 620 can directly obtain the first spatial image sample of each subspace, or it can obtain the first content image sample and extract the first spatial image sample corresponding to each subspace from the first content image sample, and so on.

[0093] The predicted risk category can be the risk category corresponding to the first spatial image sample predicted by the risk classification sub-network. The risk category can include risky content and normal content. Risky content can be fake / forged content generated by deepfakes technology (model). Based on the first and second predicted risk categories, there are multiple ways to converge the preset content detection network. For example, the processor 620 can obtain the first labeled risk category of the first spatial image sample, compare the first labeled risk category with the first predicted risk category to obtain first independent classification loss information, compare the first labeled risk category with the second predicted risk category to obtain first fused classification loss information, fuse the first independent classification loss and the second fused classification loss information, and converge the preset content detection network based on the fused target classification loss information to obtain the initial content detection network.

[0094] The first labeled risk category can be a risk category labeled in the first spatial image sample, or it can be a risk category labeled in the first content image sample corresponding to the first spatial image sample. The first independent classification loss information can be the classification loss generated by risk classification of the first sample space features corresponding to each subspace. There are multiple ways to compare the first labeled risk category with the first predicted risk category. For example, the processor 620 can use the cross-entropy loss function to compare the first labeled risk category with the first predicted risk category to obtain the initial independent classification loss information corresponding to each subspace, and then accumulate the initial independent classification loss information to obtain the first independent classification loss information. Alternatively, other types of comparison functions can be used to compare the first labeled risk category with the first predicted risk category to obtain the initial independent classification loss information corresponding to each subspace, and then accumulate the initial independent classification loss information to obtain the first independent classification loss information, and so on.

[0095] The first fusion classification loss information can be the classification loss generated when the first fusion feature is used for risk classification. The method of comparing the first labeled risk category with the second predicted risk category is similar to the method of comparing the first labeled risk category with the first predicted risk category, as described above, and will not be repeated here.

[0096] After determining the first independent classification loss information and the first fused classification loss information, the processor 620 can fuse the first independent classification loss information and the first fused classification loss information. There are various ways to fuse them. For example, the processor 620 can directly add the first independent classification loss information and the first fused classification loss information to obtain the target classification loss information, as shown in formula (1):

[0097] Loss total1 =Loss subspace-cls1 +Loss fusion-cls1 (1)

[0098] Among them, Loss total1 Loss information for target classification subspace-cls1 For the first independent classification loss information, Loss fusion-cls1 This is the first fusion classification loss information.

[0099] In some embodiments, the processor 620 may also obtain a first classification weight, and weight the first independent classification loss information and the first fused classification loss information based on the first classification weight, and add the weighted first independent classification loss information and the weighted first fused classification loss information to obtain the target classification loss information.

[0100] After fusing the first independent classification loss information and the first fused classification loss information, the processor 620 can converge the fused target classification loss information to the preset content detection network. There are several ways to converge the preset content detection network. For example, the processor 620 can update the network parameters of the preset content detection network based on the target classification loss information using a gradient descent algorithm, and then return to the step of obtaining the first spatial image sample of each subspace until the preset content detection network converges, thus obtaining the trained initial content detection network. Alternatively, it can update the network parameters of the preset content detection network based on the target classification loss information using other network parameter update algorithms, and then return to the step of obtaining the first spatial image sample of each subspace until the preset content detection network converges, thus obtaining the trained initial content detection network.

[0101] It should be noted that, in the process of updating the network parameters of the preset content detection network based on the target classification loss information, the target classification loss information can be used for global updates, or the first independent classification loss information and the first fusion classification loss information contained in the target classification loss information can be used to update the corresponding network parameters respectively.

[0102] After training a preset content detection network to obtain an initial content detection network, the processor 620 can dynamically compress the initial content detection network to obtain a final content detection network. There are several ways to dynamically compress the initial content detection network. For example, the processor 620 can add a compression layer after each layer of the initial content detection network. This compression layer includes a batch normalization layer (BN layer) and a target convolutional layer of a preset size. It acquires second-space image samples for each subspace and trains a candidate content detection network based on these second-space image samples to obtain the trained current content detection network. It then acquires the feature channel weights of each batch normalization layer in the current content detection network and compresses the current content detection network based on these feature channel weights. Finally, it linearly superimposes the target convolutional layer with the convolutional layers in the compressed content detection network to obtain the final content detection network.

[0103] The compression layer is a network layer that dynamically compresses the initial content detection network. The number of compression layers is the same as the number of layers in the initial content detection network, and the compression layer is located after each layer in the initial content detection network. Therefore, the structure of the candidate content detection network is that each layer has an additional compression layer compared to the initial content detection network. The target convolutional layer is a linear convolutional layer with a preset size, which can be set according to the actual application, such as 1*1 or any other arbitrary size. The batch normalization layer (BN layer) is a network layer that performs batch normalization. In this scheme, it is mainly used to determine the feature channel weights corresponding to each feature channel when each network layer extracts features.

[0104] The method for obtaining the second spatial image sample of each subspace is similar to the method for obtaining the first spatial image sample, as detailed above, and will not be repeated here.

[0105] After acquiring the second spatial image samples, the processor 620 can train the candidate content detection network based on the second spatial image samples. There are several ways to train the candidate content detection network based on the second spatial image samples. For example, the processor 620 can use the candidate content detection network to extract features from the second spatial image samples, obtain the second sample spatial features and the initial feature channel weights of the batch normalization layer corresponding to the second sample spatial features, fuse the second sample spatial features, and determine the target compression loss information of the second spatial image samples based on the fused second sample features, the second sample spatial features, and the initial feature channel weights. Then, based on the target compression loss information, the processor 620 can converge the compression layer in the candidate content detection network to obtain the current content detection network.

[0106] The initial feature channel weights can be the weights corresponding to each feature channel during feature extraction by the feature extraction subnetwork in the candidate content detection network. These initial feature channel weights are output through the Batch Normalization (BN) layer. The process of feature extraction for the second spatial samples is similar to that for the first spatial image samples. The only difference is that during feature extraction, not only can the second sample spatial features be output, but the initial feature channel weights can also be output through the BN layer, as detailed above. Further details will not be elaborated upon here.

[0107] After extracting the second sample space features, the processor 620 can fuse the second sample space features to obtain the second sample fused features. The process of fusing the second sample space features is similar to the process of fusing the first sample space features, as detailed above, and will not be repeated here.

[0108] After fusing the second sample space features, the processor 620 can determine the target compression loss information of the second spatial image samples based on the fused second sample fusion features, the second sample space features, and the initial feature channel weights. The target compression loss information can be the loss information generated during the dynamic compression process of the initial content detection network. There are several ways to determine the target compression loss information. For example, the processor 620 can determine the second fusion classification loss information based on the second sample fusion features, determine the second independent classification loss information based on the second sample space features, determine the weight sparsity information based on the initial feature channel weights, and fuse the second fusion classification loss information, the second independent classification loss information, and the weight sparsity loss information to obtain the target compression loss information.

[0109] The method for determining the second fusion classification loss information is similar to the method for determining the first fusion classification loss information, and the method for determining the second independent classification loss information is similar to the method for determining the first independent classification loss information, as detailed above, and will not be repeated here.

[0110] The weight sparsity loss information is a loss information characterizing the sparsity of the initial feature channel weights. The constraint condition for the weight sparsity loss information is that the initial feature channel weights corresponding to a certain number of feature channels are less than a preset weight threshold. During the training of the candidate content detection network, this weight sparsity loss information is used to constrain the number of target feature channels whose weights are less than the preset weight threshold to be as large as possible while ensuring classification accuracy. For example, taking a preset weight threshold of 0 as an example, a feature channel with a preset weight threshold of 0 means that the parameters corresponding to that feature channel will not be calculated during feature extraction or processing, thereby reducing the number of parameters that need to be calculated during feature extraction or processing, thus achieving compression of the initial content detection model. Moreover, for different samples, the number or type of parameters corresponding to the uncalculated feature channels are also different, thus achieving dynamic compression of the initial content detection model. There are multiple ways to determine the weight sparsity loss information. For example, the processor 620 can select a target number of initial feature channel weights that are less than the preset weight threshold and determine the weight sparsity loss information based on the target number.

[0111] After determining the second fused classification loss information, the second independent classification loss information, and the weighted sparse loss information, the processor 620 can fuse these information. There are various fusion methods. For example, the processor 620 can directly sum the second fused classification loss information, the second independent classification loss information, and the weighted sparse loss information to obtain the target compression loss information, as shown in formula (2).

[0112] Lods total2 =oss subspace-cls2 +oss fusion-cls +oss sparse (2)

[0113] Among them, Loss total2 To compress loss information for the target, Loss subspace-cls2 For the second independent classification loss information, Loss fusion-cls For the second fusion classification loss information, Loss sparse This represents the weighted sparse loss information.

[0114] In some embodiments, the processor 620 may also obtain compression weights, and based on the compression weights, weight the second fusion classification loss information, the second independent classification loss information, and the weighted sparse loss information respectively, and accumulate the weighted second fusion classification loss information, the weighted second independent classification loss information, and the weighted weighted sparse loss information to obtain the target compression loss information.

[0115] After determining the target compression loss information, the processor 620 can converge the compression layer in the candidate content detection network based on the target compression loss information. There are several convergence methods. For example, the processor 620 can use a gradient descent algorithm to update the network parameters of the compression layer in the candidate content detection network based on the target compression loss information, obtaining an updated candidate content detection network. Then, it returns to the step of acquiring second-space image samples for each subspace until the compression layer in the candidate content detection network converges, thus obtaining the trained current content detection network. Alternatively, it can use other network parameter update algorithms based on the target compression loss information to update the network parameters of the compression layer in the candidate content detection network, obtaining an updated candidate content detection network. Then, it returns to the step of acquiring second-space image samples for each subspace until the compression layer in the candidate content detection network converges, thus obtaining the trained current content detection network, and so on.

[0116] After training the candidate content detection network, the processor 620 can obtain the feature channel weights of each batch of normalized layers in the trained current content detection network. Then, it compresses the current content detection network based on the feature channel weights to obtain a compressed content detection network. There are several ways to compress the current content detection network. For example, the processor 620 can select at least one feature channel weight that is less than a preset weight threshold to obtain the target feature channel weight, identify the target feature channel corresponding to the target feature channel weight in each network layer of the current content detection network, and prune the target feature channel to obtain the compressed content detection network.

[0117] Pruning the target feature channel involves trimming the parameters corresponding to that channel. This eliminates the need to calculate these parameters during feature extraction, thus dynamically compressing the content detection network and resulting in a compressed content detection network. It's important to note that this pruning does not modify the network structure; rather, it suppresses the calculation of the parameters for that channel. Therefore, it does not disrupt the model's structure or the trained network parameters, and consequently, it does not affect the performance of the original model (the initial content detection model).

[0118] After compressing the current content detection network, the processor 620 can linearly superimpose the target convolutional layer with the convolutional layer in the compressed content detection network to obtain the content detection network.

[0119] (2) Training of graph networks

[0120] For example, the processor 620 can acquire a connected network graph sample, which includes nodes corresponding to the third sample spatial features and nodes corresponding to the third sample fusion features. The third sample fusion features are features obtained by fusing the third sample spatial features. Based on the connected network graph sample, a preset graph network is used to refine the third sample spatial features and the third sample fusion features respectively to obtain the updated spatial features corresponding to the third sample spatial features and the updated fusion features corresponding to the third sample fusion features. Based on the updated spatial features and the updated fusion features, the preset graph network is converged to obtain the trained graph network.

[0121] There are several ways to obtain connection network graph samples. For example, the processor 620 can directly obtain connection network graph samples, or it can obtain second content image samples and construct connection network graph samples based on the second content image samples, and so on.

[0122] After acquiring the connected network graph samples, the processor 620 can use a preset graph network to purify the spatial features of the third sample and the fused features of the third sample respectively. The purification method can be found above, and will not be repeated here.

[0123] After refining the third sample spatial features and the third sample fusion features, the processor 620 can converge the preset graph network based on the refined updated spatial features and updated fusion features, thereby obtaining the trained graph network. There are several ways to converge the preset graph network. For example, the processor 620 can obtain the second labeled risk category corresponding to the connected network graph, determine the third predicted risk category of the connected network graph samples based on the updated spatial features, and compare the second labeled risk category with the third predicted risk category to obtain the third independent classification loss information. Based on the updated fusion features, it can determine the fourth predicted risk category of the connected network graph samples, compare the second labeled risk category with the fourth predicted risk category to obtain the third fusion classification loss information, and fuse the third independent classification loss information and the third fusion classification loss information. Finally, it can converge the preset graph network based on the fused graph network loss information to obtain the trained graph network.

[0124] The method for determining the third independent classification loss information is similar to that for determining the first independent classification loss information, and the method for determining the third fused classification loss information is similar to that for determining the first fused classification loss information, as detailed above. They will not be repeated here.

[0125] After obtaining the third independent classification loss information and the third fused classification loss information, the processor 620 can fuse the third independent classification loss information and the third fused classification loss information to obtain the graph network loss information. The fusion method is similar to the method of fusing the first independent classification loss information and the first fused classification loss information, as shown in formula (3):

[0126] Loss total3 =Loss subspace-cls3 +Loss fusion-cls3 (1)

[0127] Among them, Loss total3 For graph network loss information, Loss subspace-cls3 Loss is the third independent category loss information. fusion-cls3 This is for the third fusion classification loss information.

[0128] After obtaining the graph network loss information, the processor 620 can converge the preset graph network based on the graph network loss information to obtain the trained graph network. The method of converging the preset graph network is similar to the method of converging the preset content detection network, as described above, and will not be repeated here.

[0129] After training the preset content detection network and the preset graph network, the processor 620 can use the trained content detection network and graph network as a content detection model.

[0130] It should be noted that the content detection model can be trained on the device side (terminal or client) or on the cloud side (server / remote server). The content detection model is then deployed on the client or terminal after training.

[0131] S130: Based on target fusion features and spatial features, determine the risk detection results of the target content and output the risk detection results.

[0132] The risk detection result can include either risky content or normal content. The risky content can be fake or forged content generated by deepfakes models (algorithms) or other content forgery algorithms. For example, it could be face-swapped image or video content, or it could be fake image or video content, and so on.

[0133] There are several ways to determine the risk detection results of target content based on target fusion features and spatial features, as follows:

[0134] For example, the processor 620 can determine the fusion attack probability of the target content based on the target fusion features, determine the independent attack probability of the target content in each subspace based on the spatial features, and determine the risk detection result of the target content based on the fusion attack probability and the independent attack probability.

[0135] The fusion attack probability can be defined as the probability that the target content is classified as risky content after risk classification based on the target fusion features. There are various ways to determine the fusion attack probability of target content based on the target fusion features. For example, the processor 620 can use the risk classification subnetwork in the content detection network of the content detection model to classify the target fusion features for risk, thereby obtaining the fusion attack probability of the target content.

[0136] The independent attack probability can be defined as the probability that the target content is classified as risky content after risk classification based on spatial features. The method for determining the independent attack probability of the target content in each subspace is similar to the method for determining the fused attack probability, as detailed above, and will not be repeated here.

[0137] After determining the probability of a fused attack and the probability of an independent attack, the processor 620 can determine the risk detection result of the target content based on these probabilities. There are several ways to determine the risk detection result of the target content. For example, the processor 620 can determine the average probability of the independent attacks, and determine the target content as risky content when the fused attack probability is greater than a preset first probability threshold or when the probability threshold is greater than a preset second probability threshold; or, when the fused attack probability is less than the preset first probability threshold and the average probability is less than a preset second probability threshold, determine the target content as normal content.

[0138] After determining the risk detection result of the target content, the processor 620 can output the risk detection result of the target content. There are several ways to output the risk detection result of the target content. For example, the processor 620 can directly send the risk detection result to at least one of the client 200, terminal, or server corresponding to the target user 100, so that the client 200, terminal, or server can respond to the target content or the business request corresponding to the target content based on the risk detection result. Alternatively, the risk detection result can be directly visualized, and so on.

[0139] There are various ways to visualize risk detection results. For example, the processor 620 can directly display the risk detection result, or it can display the risk detection result through sound and light (for example, by broadcasting the risk detection result by voice, or by displaying different types of risk detection results by displaying different colored lights, or by displaying the risk detection result through sound and light linkage), or it can display the risk detection result for specific types of risk detection results (for example, only displaying risk detection results for risky content, or only displaying risk detection results for normal content, etc.), and so on.

[0140] In some embodiments, after determining or outputting the risk detection result of the target content, the processor 620 may respond to the target content or the business request corresponding to the target content based on the risk detection result. There may be various ways to respond. For example, the processor 620 may directly intercept the target content or the business request corresponding to the target content. Alternatively, the processor 620 may directly perform secondary or multiple verifications on the target content and, based on the secondary verification results, provide a final response to the target content or the business request corresponding to the target content, and so on.

[0141] In the scenario of target content detection for face images or videos, this solution can employ an efficient edge-side content detection method based on continuous subspaces to detect deepfakes. The overall content detection process can be as follows: Figure 4 As shown, it can include four parts: subspace definition and acquisition, training of content detection model based on continuous subspace learning, model compression based on dynamic compression, and risky content detection. Specifically, it can be as follows:

[0142] (1) Definition and Acquisition of Subspaces: Traditional deepfakes detection often treats the entire face image as a whole, modeling all regions with uniform features. However, in reality, the probability of tampering in different regions varies during deepfakes generation, and the clues after tampering also differ. Therefore, using the same weight to homogenize feature extraction for all regions is inappropriate. This scheme adopts feature extraction and deepfakes content based on continuous subspaces. The subspaces mainly include the facial feature region subspace and the contour region subspace. Subspace connectivity graphs can also be obtained, as detailed above, and will not be elaborated further here.

[0143] (2) Training of Content Detection Model Based on Continuous Subspace Learning: Traditional deepfake detection uses high-resolution, complex feature extraction models for multiple facial regions to improve performance, but this leads to a linear increase in computation, making it unsuitable for on-device deployment and application. The content detection model in this solution is a low-resolution, multi-expert fusion model. This model improves efficiency by reducing the input resolution (each subspace only requires a lower resolution to extract high-performance features) and increasing model compactness. The training of the content detection model mainly includes training the content detection network and the graph network. The specific training process can be found above and will not be elaborated further here.

[0144] (3) Dynamic compression-based model compression: Traditional model compression usually uses static compression methods. During the training phase, the parameters of the original model will continuously iterate, which may damage the performance of the original model. Therefore, this scheme introduces a dynamic compression model compression method. A linear convolutional layer and a batch normalization layer (BN layer) are introduced as compression layers after each network layer. Then, during the training phase, only the introduced compression layer is updated. Finally, the feature channels are pruned by the feature channel weights of the BN layer, and the 1*1 convolutional layer is merged into the parameters of the previous layer to complete the model compression. The specific compression process can be found above, and will not be repeated here.

[0145] (4) Risk Content Detection: The lightweight content detection network and graph network trained in the above three steps and deployed on the edge are used as the content detection model. The original content image of the target content is obtained. After alignment, the spatial image of each subspace is input into the feature extraction subnetwork and feature fusion subnetwork of the content detection network to obtain the initial spatial features and initial fused features of each subspace. Then, the initial spatial features and initial fused features are input into the graph network to obtain the refined spatial features and target fused features. Spatial features and target fusion features are input into the risk classification subnetwork of the content detection model to obtain the deepfakes attack probabilities [p10, p11, p12, p13, p14, p15] of the subspace (the attack probabilities corresponding to the facial feature region subspace and the contour region subspace, respectively). The mean probability (p1 = 1 / 6 * (p10 + p11 + p12 + p13 + p14 + p15)) and the fusion attack probability p2 corresponding to the fusion feature are calculated. For pre-set thresholds T1 and T2, if p1 is greater than T1 or p2 is greater than T2, it is judged as a deepfakes attack; otherwise, it is judged as a normal sample. After determining the risk detection result of the target content, the risk detection result can be output, specifically as follows: Figure 5 As shown.

[0146] This solution involves aligning the facial regions in the acquired face images or videos to a standard space, and then segmenting the labeled space into several representative contiguous subspaces (where deepfake traces are typically concentrated). After defining the subspaces, features are extracted for each subspace, and the relationships between the subspaces are learned to obtain a high-performance initial deepfake content detection model. This initial model is then dynamically compressed to obtain a lightweight content detection model. This lightweight model is deployed on the device for purely local deepfake detection, thereby improving the accuracy and efficiency of content detection.

[0147] In summary, the content detection method P100 and system 001 provided in this specification acquire the target content image and extract spatial images corresponding to multiple continuous subspaces from the target content image. These multiple continuous subspaces include contour regions and keypoint regions corresponding to at least one keypoint. The contour regions include all regions in the target content image except for the keypoint regions. Then, based on the connection relationships between the multiple continuous subspaces, feature extraction is performed on the spatial images to obtain target fusion features and spatial features of each subspace within the multiple continuous subspaces. Based on the target fusion features and spatial features, the risk detection result of the target content is determined and output. Since this scheme can extract spatial images corresponding to multiple continuous subspaces from the target content image and use different weights to extract features from different regions based on the connection relationships between the multiple continuous subspaces, it can extract features from regions where false or forged traces are concentrated, thereby improving the accuracy of feature extraction. Furthermore, dividing the target content image into multiple regions reduces the input resolution and significantly reduces the computational load during risk detection, thus improving the accuracy of content detection.

[0148] This specification, in another aspect, provides a non-transitory storage medium storing at least one set of executable instructions for performing content detection. When the executable instructions are executed by a processor, they instruct the processor to implement the steps of the content detection method P100 described herein. In some possible embodiments, various aspects of this specification can also be implemented as a program product comprising program code. When the program product is run on a computing device 600, the program code causes the computing device 600 to perform the steps of the content detection method P100 described herein. The program product for implementing the above method may employ a portable compact disc read-only memory (CD-ROM) containing program code and may run on the computing device 600. However, the program product of this specification is not limited thereto. In this specification, a readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with an instruction execution system. The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. The computer-readable storage medium may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable storage medium may also be any readable medium other than a readable storage medium that can send, propagate, or transmit programs for use by or in connection with an instruction execution system, apparatus, or device. Program code contained on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof. Program code for performing the operations described herein can be written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Java and C++, and conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on computing device 600, partially on computing device 600, as a standalone software package, partially on computing device 600 and partially on a remote computing device, or entirely on a remote computing device.

[0149] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0150] In summary, after reading this detailed disclosure, those skilled in the art will understand that the foregoing detailed disclosure is presented by way of example only and is not restrictive. Although not explicitly stated herein, those skilled in the art will understand that this specification requires various reasonable changes, improvements, and modifications to the embodiments. These changes, improvements, and modifications are intended to be made by this specification and are within the spirit and scope of the exemplary embodiments described herein.

[0151] Furthermore, certain terms in this specification have been used to describe embodiments of this specification. For example, "an embodiment," "an embodiment," and / or "some embodiments" mean that a particular feature, structure, or characteristic described in connection with that embodiment may be included in at least one embodiment of this specification. Therefore, it is to be emphasized and understood that two or more references to "an embodiment" or "an embodiment" or "alternative embodiment" in various parts of this specification do not necessarily refer to the same embodiment. Moreover, specific features, structures, or characteristics may be suitably combined in one or more embodiments of this specification.

[0152] It should be understood that in the foregoing description of the embodiments in this specification, various features are combined in a single embodiment, drawing, or description for the purpose of simplifying the description and aiding in the understanding of a feature. However, this does not mean that the combination of these features is necessary, and those skilled in the art may readily identify some of the devices as separate embodiments when reading this specification. That is, the embodiments in this specification can also be understood as an integration of multiple secondary embodiments. It is also valid when each secondary embodiment contains fewer than all the features of a single foregoing disclosed embodiment.

[0153] Each patent, patent application, publication of the patent application, and other materials such as articles, books, specifications, publications, documents, articles, etc., cited herein may be incorporated by reference. All contents used for all purposes, except for any history of prosecution documents relating to it, that may be inconsistent with or conflict with this document, or any such history of prosecution documents that may have a limiting effect on the widest extent of the claims, are now or hereafter associated with this document. For example, in the event of any inconsistency or conflict between the description, definition, and / or use of terms associated with any of the included materials and the terms, description, definition, and / or used in connection with this document, the terms used herein shall prevail.

[0154] Finally, it should be understood that the embodiments disclosed herein are illustrative of the principles of the embodiments described in this specification. Other modified embodiments are also within the scope of this specification. Therefore, the embodiments disclosed in this specification are merely examples and not limitations. Those skilled in the art can implement the applications described in this specification using alternative configurations based on the embodiments in this specification. Therefore, the embodiments in this specification are not limited to the embodiments precisely described in the applications.

Claims

1. A content detection method, comprising: Obtain the target content image corresponding to the target content, and extract the spatial images corresponding to multiple continuous subspaces in the target content image. The multiple continuous subspaces include a contour region and a key point region corresponding to at least one key point. The at least one key point includes a facial key point corresponding to at least one part of the facial parts. The contour region includes the region in the target content image other than the key point region. Based on the connection relationships between the multiple continuous subspaces, feature extraction is performed on the spatial image to construct a connection network graph; Based on the connection network graph, the target fusion feature and the spatial feature of each subspace in the plurality of continuous subspaces are obtained, and the target fusion feature is the feature corresponding to the fusion of the spatial features; as well as Based on the target fusion features and the spatial features, the risk detection result of the target content is determined and the risk detection result is output.

2. The content detection method according to claim 1, wherein, The step of obtaining the target content image corresponding to the target content includes: Obtain the original content image corresponding to the target content; and The original content image is face-aligned to obtain the target content image.

3. The content detection method according to claim 2, wherein, The step of performing facial alignment on the original content image to obtain the target content image includes: Perform facial detection on the original content image; When the face detection in the original content image passes, at least one initial key point corresponding to the facial region is extracted from the original content image; and Based on the location information of the at least one initial key point, the facial region in the original content image is aligned with a preset facial template to obtain the target content image.

4. The content detection method according to claim 1, wherein, The step of extracting spatial images corresponding to multiple continuous subspaces from the target content image includes: Obtain at least one key point in the target content image; Based on the at least one key point, at least one region image corresponding to a facial feature is segmented from the target content image to obtain at least one facial feature image; and The region image other than the facial part image in the target content image is used as the facial contour image, and the facial part image and the facial contour image are used as the spatial images of the multiple continuous subspaces.

5. The content detection method according to claim 1, wherein, The step of extracting features from the spatial image based on the connection relationships between the multiple continuous subspaces and constructing a connection network graph includes: Based on the spatial type of each subspace, determine the connection relationships between the plurality of consecutive subspaces; and Based on the aforementioned connectivity, a content detection model is used to extract features from the spatial image and construct a connectivity network graph.

6. The content detection method according to claim 5, wherein, The content detection model includes a content detection network and a graph network; as well as The step of extracting features from the spatial image using a content detection model based on the connectivity relationship and constructing a connectivity network graph includes: The content detection network is used to extract features from the spatial image to obtain initial spatial features for each subspace, and the initial spatial features are fused to obtain initial fused features; as well as Based on the connection relationship, the initial spatial features and the initial fusion features are respectively used as nodes to construct the connection network graph.

7. The content detection method according to claim 6, wherein, The process of obtaining target fusion features and spatial features of each subspace in the plurality of continuous subspaces based on the connection network graph includes: Based on the connection network graph, the initial spatial features and the initial fusion features are purified using the graph network to obtain the target fusion features and the spatial features of each subspace.

8. The content detection method according to claim 7, wherein, The connection network graph includes a first node corresponding to the initial spatial feature and a second node corresponding to the initial fusion feature; and The step of using the graph network to refine the initial spatial features and the initial fused features to obtain the target fused features and the spatial features corresponding to each subspace includes: Select the neighboring nodes corresponding to each node in the connection network graph. Based on the preset information transfer function corresponding to the graph network and the neighboring nodes corresponding to the first node, the initial spatial features are refined to obtain the spatial features of each subspace, and Based on the preset information transmission function and the neighboring nodes corresponding to the second node, the initial fusion features are refined to obtain the target fusion features.

9. The content detection method according to claim 5, wherein, The training of the content detection model includes the following steps: A preset content detection network is trained to obtain an initial content detection network, and the initial content detection network is dynamically compressed to obtain the final content detection network; and The preset graph network is trained to obtain the graph network, and the content detection network and the graph network are used as the content detection model.

10. The content detection method according to claim 9, wherein, The preset content detection network includes a feature extraction subnetwork, a feature fusion subnetwork, and a risk classification subnetwork; as well as The step of training a preset content detection network to obtain an initial content detection network includes: A first spatial image sample is obtained for each subspace, and the feature extraction subnetwork is used to extract features from the first spatial image sample to obtain the first sample spatial features. The feature fusion subnetwork is used to fuse the features of the first sample space to obtain the first sample fusion feature. The risk classification subnetwork is then used to determine the first predicted risk category corresponding to the first sample space feature and the second predicted risk category corresponding to the first sample fusion feature. Based on the first predicted risk category and the second predicted risk category, the preset content detection network is converged to obtain the initial content detection network.

11. The content detection method according to claim 10, wherein, The process of converging the preset content detection network to obtain the initial content detection network includes: Obtain the first labeled risk category of the first spatial image sample, and compare the first labeled risk category with the first predicted risk category to obtain the first independent classification loss information; The first labeled risk category is compared with the second predicted risk category to obtain the first fused classification loss information; and The first independent classification loss information and the first fused classification loss information are fused together, and the preset content detection network is converged based on the fused target classification loss information to obtain the initial content detection network.

12. The content detection method according to claim 9, wherein, The step of dynamically compressing the initial content detection network to obtain the content detection network includes: A compression layer is added after each network layer in the initial content detection network to obtain a candidate content detection network. The compression layer includes a batch normalization layer and a target convolutional layer of a preset size. Obtain second spatial image samples for each subspace, and train the candidate content detection network based on the second spatial image samples to obtain the trained current content detection network; Obtain the feature channel weights of each batch normalized layer in the current content detection network, and compress the current content detection network based on the feature channel weights; and The target convolutional layer is linearly superimposed with the convolutional layer in the compressed content detection compression network to obtain the content detection network.

13. The content detection method according to claim 12, wherein, The step of training the candidate content detection network based on the second spatial image samples to obtain the trained current content detection network includes: The candidate content detection network is used to extract features from the second spatial image samples to obtain the second sample spatial features and the initial feature channel weights of the batch normalization layer corresponding to the second sample spatial features. The second sample spatial features are fused, and based on the fused second sample features, the second sample spatial features, and the initial feature channel weights, the target compression loss information of the second spatial image sample is determined; and Based on the target compression loss information, the compression layer in the candidate content detection network is converged to obtain the current content detection network.

14. The content detection method according to claim 13, wherein, Determining the target compression loss information of the second spatial image sample includes: Based on the second sample fusion features, the second fusion classification loss information is determined, and based on the second sample spatial features, the second independent classification loss information is determined. Based on the initial feature channel weights, weighted sparse loss information is determined. The constraint condition for this weighted sparse loss information is that the initial feature channel weights corresponding to a preset number of feature channels are less than a preset weight threshold. The second fused classification loss information, the second independent classification loss information, and the weighted sparse loss information are fused to obtain the target compression loss information.

15. The content detection method according to claim 12, wherein, The compression of the current content detection network based on the feature channel weights includes: Select at least one feature channel weight that is less than a preset weight threshold from the feature channel weights to obtain the target feature channel weights; In each network layer of the current content detection network, the target feature channel corresponding to the target feature channel weight is identified; and The target feature channels are cropped to obtain the content detection and compression network.

16. The content detection method according to claim 9, wherein, The step of training a preset graph network to obtain the graph network includes: Obtain a connection network graph sample, which includes nodes corresponding to the third sample space features and nodes corresponding to the third sample fusion features, wherein the third sample fusion features are the features obtained after fusing the third sample space features; Based on the connected network graph samples, a preset graph network is used to refine the third sample spatial features and the third sample fusion features respectively, so as to obtain the updated spatial features corresponding to the third sample spatial features and the updated fusion features corresponding to the third sample fusion features; Based on the updated spatial features and the updated fusion features, the preset graph network is converged to obtain the trained graph network.

17. The content detection method according to claim 16, wherein, The step of converging the preset graph network to obtain the trained graph network includes: Obtain the second labeled risk category corresponding to the connected network graph sample; Based on the updated spatial features, the third predicted risk category of the connected network graph sample is determined, and the second labeled risk category is compared with the third predicted risk category to obtain the third independent classification loss information; Based on the updated fusion features, the fourth predicted risk category of the connected network graph sample is determined, and the second labeled risk category is compared with the fourth predicted risk category to obtain the third fusion classification loss information; and The third independent classification loss information and the third fused classification loss information are fused together, and the preset graph network is converged based on the fused graph network loss information to obtain the trained graph network.

18. The content detection method according to claim 5, wherein, The content detection model is deployed on the client or terminal.

19. The content detection method according to claim 1, wherein, The step of determining the risk detection result of the target content based on the target fusion features and the spatial features includes: Based on the target fusion features, the probability of a fusion attack on the target content is determined; Based on the spatial characteristics, determine the independent attack probability of the target content in each subspace; and Based on the fusion attack probability and the independent attack probability, the risk detection result of the target content is determined.

20. The content detection method according to claim 19, wherein, The risk detection result includes either risky content or normal content; as well as The risk detection results for determining the target content include: Determine the probability mean of the independent attack probabilities, and When the probability of a fusion attack is greater than a preset first probability threshold or the average probability is greater than a preset second probability threshold, the risk detection result of the target content is determined to be the risk content.

21. The content detection method according to claim 20, wherein, Also includes: When the probability of a fusion attack is less than the preset first probability threshold and the average probability is less than the preset second probability threshold, the risk detection result of the target content is determined to be normal content.

22. A content detection system, comprising: At least one storage medium storing at least one instruction set for content inspection; as well as At least one processor is communicatively connected to the at least one storage medium. When the content detection system is running, the at least one processor reads the at least one instruction set and executes the content detection method according to any one of claims 1-21 according to the instructions of the at least one instruction set.