An end-to-end virtual social robot detection method based on multimodal information fusion

By constructing an end-to-end virtual social robot detection method based on multimodal information fusion, and utilizing various attention mechanisms and feature representation models, the method addresses the issues of subjectivity and insufficient generalization performance of existing methods, and achieves efficient identification and cross-platform adaptation of complex virtual social robots.

CN117494036BActive Publication Date: 2026-06-30SICHUAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SICHUAN UNIV
Filing Date
2023-09-22
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing methods for detecting virtual social robots rely on feature engineering, which suffers from subjectivity and insufficient generalization performance, making it difficult to effectively identify complex robots. Furthermore, Chinese platform datasets are scarce and lack multimodal information fusion, resulting in poor detection performance.

Method used

We construct an end-to-end virtual social robot detection method based on multimodal information fusion. By expanding the dataset and extracting modal features using TabNet, CNNnet, RoBERTa, and Pytorch-BigGraph models, we combine cross-fusion attention mechanism, channel attention mechanism, and spatial attention mechanism to explore the correlation and importance between multimodal features and achieve the classification of virtual social robots and humans.

Benefits of technology

It improves the generalization and evolutionary adaptation capabilities of virtual social robot detection, effectively identifies complex virtual social robots, has better cross-robot category transfer capabilities, and outperforms existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117494036B_ABST
    Figure CN117494036B_ABST
Patent Text Reader

Abstract

This invention relates to the field of virtual social robot detection technology, and proposes an end-to-end virtual social robot detection method based on multimodal information fusion. First, based on a publicly available dataset, a large-scale virtual social robot detection research dataset is constructed by expanding the data scale and information dimensions. Then, diverse end-to-end feature representation learning models are built based on RoBERTa, PBG, Tab Net, and CNN to extract low-dimensional feature representations from various modal data, including user text, attributes, and heterogeneous relationships. Finally, based on channel attention, spatial attention, gating, and cross-fusion mechanisms, the correlations and importance among the low-dimensional representations of multimodal features are mined, and multimodal information is fused to achieve the classification of virtual social robots and humans. The detection method of this invention has better adaptability to evolving virtual social robots, cross-class transferability, and good generalization ability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of virtual social robot detection technology, specifically to an end-to-end virtual social robot detection method based on multimodal information fusion. Background Technology

[0002] In the internet age, with the rapid development of information technology, people's social interactions have undergone significant changes. Online social networks have become an indispensable method of social participation in public life. People are keen to obtain real-time information through social networks, express their opinions and stances on hot topics, and create personal influence.

[0003] However, due to the openness of social networks, they are rife with malicious content such as fake news, offensive statements, and misleading opinions. Numerous studies have shown that virtual social bots play a crucial role in the spread of this malicious content. Malicious organizations exploit the automated nature of virtual social bots to manipulate trending topics, control online public opinion, spread false information, disrupt the online ecosystem, influence group viewpoints, and wage cyber warfare. Without regulation and control, malicious virtual social bots will seriously undermine social justice and threaten national security. Therefore, a new method is urgently needed to accurately identify virtual social bots and maintain legitimate order in cyberspace.

[0004] Existing researchers are working to identify virtual social bots by analyzing information such as tweets, attributes, and social relationships of social users. However, these studies generally suffer from the following problems: (1) Most detection schemes are based on feature engineering methods, which are subjective and can only identify virtual social bots with certain characteristics; (2) Some end-to-end methods ignore effective modal information fusion mechanisms, which cannot improve the recognition ability of complex bots and have insufficient generalization performance; (3) Existing research is generally concentrated on English platforms, resulting in a scarcity of high-quality Chinese platform research datasets. Existing Chinese research datasets generally suffer from problems such as missing attributes, few tweets, and missing relationships.

[0005] Numerous studies have been conducted on virtual social robot detection. However, these studies generally rely on feature engineering and neglect the organic connections between different modal information, making them unable to identify increasingly complex robots that are constantly evolving. The task of virtual social robot detection still faces challenges in the effective fusion of multimodal features and generalization ability. The challenge of effective multimodal feature fusion requires detection methods to effectively integrate the multidimensional information generated by virtual social robots to achieve detection, while the challenge of generalization performance requires detection methods to adapt to virtual social robots that have become increasingly complex due to evolution and are distributed across different domains. Summary of the Invention

[0006] To address the aforementioned problems, the present invention aims to provide an end-to-end virtual social robot detection method based on multimodal information fusion. This method possesses better adaptability to evolving virtual social robots and cross-robot category transfer capabilities, as well as better generalization ability, effectively overcoming the shortcomings of existing virtual social robot detection methods. The technical solution is as follows:

[0007] An end-to-end virtual social robot detection method based on multimodal information fusion includes the following steps:

[0008] Step 1: Based on the publicly available dataset, construct a virtual social robot detection research dataset by expanding the data scale and information dimensions;

[0009] Step 2: Construct a virtual social robot detection model. Based on TabNet, CNNnet, RoBERTa and Pytorch-BigGraph models, extract modal feature vectors from heterogeneous relationship data, user attribute data and text content data respectively to form a low-dimensional feature representation set of user data.

[0010] Step 3: Propose a cross-fusion attention mechanism, which generates relation weights by calculating the correlation between pairwise modal feature vectors, and uses the relation weights to fuse multimodal fusion features in pairs to generate a preliminary overall semantic representation of the user;

[0011] Step 4: Based on multiple attention mechanisms, including channel attention mechanism, spatial attention mechanism, gating mechanism and the cross-fusion attention mechanism, the correlation and importance between the features of each modality are explored, and effective features are selected from the preliminary semantic representation of the user to realize the classification of virtual social robots and humans.

[0012] Furthermore, step 1 specifically includes:

[0013] Step 1.1: Expand the missing information dimensions of the existing Sina Weibo research dataset SWBD-20K to include some user attributes, historical tweets, and social and interaction relationship data;

[0014] Step 1.2: Supplement the dataset with the latest virtual social robot data based on current hot topics, and expand the labeled scale of the dataset through data annotation;

[0015] Step 1.3: Construct the Weibo23 virtual social robot detection dataset.

[0016] Furthermore, step 2 specifically includes:

[0017] Step 2.1: Define the user's original input data as a triple User = <Topology, Property, Content>; where Topology represents heterogeneous relational data, Property is user attribute data, and Content is text content data; the initial data structure for each data type is as follows:

[0018] Heterogeneous relational data:

[0019] The heterogeneous relational data represents the user's social and interactive relationships, represented as a triple Topology=<N,L,R>, where N represents a node in the network, L represents a connection between two nodes, and R represents the relational type to which the connection belongs.

[0020] User attribute data:

[0021] Two encoding methods are used for configuration file data in user attribute data: count data is encoded in count form; boolean data is encoded in 0-1 type; user attribute data is represented as follows:

[0022] property=[property1,property2,...,property n (1)

[0023] Among them, property i This represents a specific configuration file attribute;

[0024] Text content data:

[0025] All tweets from each user in the dataset are sorted in ascending order of posting time to form a tweet content sequence, which is used for subsequent content language feature extraction; the text content data is represented as:

[0026] Content=[Content1,Content2,...,Content n (2)

[0027] Among them, Content i This indicates a tweet that the user has posted;

[0028] Step 2.2: For the original user input data User=<Topology,Property,Content>, use different feature representation models to extract heterogeneous relational feature representation T, user attribute low-dimensional feature representation P, and text content low-dimensional feature representation C, forming the low-dimensional feature representation set U0=[T,P,C] of the user data.

[0029] Furthermore, step 2.2 specifically includes:

[0030] Step 2.2.1: Heterogeneous Relation Embedding

[0031] After receiving the user's network topology, the graph embedding algorithm PyTorch-BigGraph generates a vector representation of each node based on the current structure, where the heterogeneous relation dimension feature representation is defined as T; the entire embedding process is defined as follows:

[0032] T = PBG(Topology) (3)

[0033] In the process of graph embedding, a parameter vector is used to represent each node and its relationship. During training, the node representation vector is updated using stochastic gradient descent. At the same time, uniform negative sampling and data distribution negative sampling are used to generate relationships between unconnected nodes.

[0034] Step 2.2.2: User Attribute Data Mining

[0035] For user attribute data (Property), Tab Net is used to mine linear relationships. After generating account data representation features through Tab Net mining, the low-dimensional feature representation P of user attributes is obtained by passing it through a fully connected layer and an activation function. The calculation method is as follows:

[0036] P=ReLU(FC(TabNet(Property))) (4)

[0037] Step 2.2.3: Text Content Representation Learning

[0038] Upon receiving text content data Content = [Content1, Content2, ..., Content n Afterwards, RoBERTa was used for word embedding to mine the semantics of each tweet; the embedding process is as follows:

[0039] Embdding i =RoBERTa[Content i (5)

[0040] After word embedding, the sequence E of word vector matrices arranged chronologically is defined as follows:

[0041] E=[Embdding1,Embdding2,...,Embdding n (6)

[0042] Among them, Embdding iFor each tweet, a word embedding vector matrix is ​​generated, and each embedding... i It is a 100*768 matrix; 768 is the fixed embedding vector size of the RoBERTa model, and 100 is the optimal value set based on the average length of user tweets;

[0043] All word embedding matrices in the word vector matrix sequence E are subjected to average pooling column by column, compressing the content word vector matrix in each user text into a 1*768 text content vector, and maintaining the temporal arrangement of the text content vectors to form a text content semantic vector sequence.

[0044] A simplified CNNnet is used to mine the contextual semantics between sequences of semantic vectors in the text content, extracting feature representations of the overall semantics of the user's context, i.e., low-dimensional feature representations of the text content, thus completing the feature extraction at the semantic level of the user's tweet content; the calculation method is as follows:

[0045] C=avgpool(Conv2d(avgpool(concatenation(E)))) (7)

[0046] Here, concatenation(·) means concatenating vectors, and Conv2d and avgpool represent convolution and average pooling operations, respectively.

[0047] Furthermore, the specific process of pairwise multimodal fusion in step 3 is as follows:

[0048] Step 3.1: Perform a linear transformation on the heterogeneous relation-dimensional feature representation T and the low-dimensional feature representation P of user attributes using a linear function, and then multiply them to generate an interaction space metrix. TP :

[0049]

[0050] Among them, w P With w N For learnable parameters, Represents the dot product;

[0051] Step 3.2: Calculate the correlation coefficient W between the original features and the transformed features respectively. TP With W PT The calculation formula is as follows:

[0052]

[0053]

[0054] Step 3.3: Place W TP With W PTThe final relation weights W are obtained by inputting the data into a softmax function. T ′ P With W P ′ T And based on this weight, a fused feature representation TP is generated:

[0055] W′ TP ,W′ PT =softmax(W TP W PT (11)

[0056] TP=W′ TP ·T+W′ PT P (12)

[0057] Step 3.4: Perform further linear transformations on TP to extract deeper implicit correlations from the fused features.

[0058] To prevent gradient vanishing, a residual structure is used to combine the two, resulting in the structure-data fusion mode TP′:

[0059] TP′=TP+BatchNorm1D(Droupout(Relu(TP))) (13)

[0060] Among them, ReLU is the linear rectified function, responsible for linear transformation; Drupout is the regularizer; and BatchNormal1D is the normalization function.

[0061] Step 3.5: Similarly, generate modal feature representations of pairwise fusion of three modalities, including text-structure fusion modality CT′ and data-text fusion modality PC′. Then, concatenate the three to obtain the preliminary overall semantic fusion representation U of the user considering pairwise correlations.

[0062] U=concatenation(TP′; PC′; CT′) (14)

[0063] Furthermore, step 4 specifically includes:

[0064] Step 4.1: For the initial fusion representation U of the overall user semantics containing cross-fusion modal information, treat each fusion modality as a channel and input it into the channel attention to pay attention to those fusion modalities that are more helpful to detection performance:

[0065] CAM(U)=σ(MLP(avgpool(U))+MLP(maxpool(U))) (15)

[0066]

[0067] Wherein, CAM(·) is the channel attention mechanism; avgpool(·) is average pooling, maxpool(·) is max pooling, MLP(·) is multilayer perceptron, σ(·) is the activation function Sigmoid; U1 represents the channel attention output, which is the product of the channel attention matrix CAM(U) and the user's overall semantic preliminary fusion representation U;

[0068] Step 4.2: Use spatial attention mechanism to identify features in each fusion modality that contribute to improving detection performance:

[0069] SAM(U1)=σ(f 7*7 ([avgpool(U1),maxpool(U2)])) (17)

[0070]

[0071] Where U2 represents the spatial attention output, which is the product of the spatial attention matrix SAM(U1) and U1; f 7*7 This is a convolutional neural network with a 7x7 kernel.

[0072] Step 4.3: Input the original features into the user's overall semantic preliminary fusion representation U and the spatial attention output U2, and input them into the residual mechanism to select the features that are more conducive to detection after feature fusion, as the final user fused overall semantic feature U', the formula of which is as follows:

[0073] Z = sigmoid(W·[U,U2]+b) (19)

[0074] U'=tanh(U)⊙Z+U2⊙(1-Z) (20)

[0075] Where ⊙ represents the Hadamard product operation, Z is the user semantic vector after nonlinear transformation by the sigmoid function, sigmoid(•) is the activation function, W is the learnable weight, b is the learnable parameter, and tanh(•) is the hyperbolic tangent function.

[0076] Step 4.4: Model Output

[0077] The end-user's fused overall semantic features U' are input into a fully connected layer to achieve binary classification detection:

[0078] y out =Droupout(FC(U')) (21)

[0079]

[0080] Among them, y out For the final overall semantic vector of the user, The prediction results for user categories;

[0081] The optimization objective of the model is to minimize the cross-entropy loss function. The classification loss function is defined by the negative log-likelihood of the correct label, and its definition is as follows:

[0082]

[0083] Where ε is the loss entropy; y is the user's real label, which is 1 when the user is human and 0 otherwise.

[0084] The beneficial effects of this invention are:

[0085] 1) This invention constructs a high-performance crawler to collect comprehensive and diverse user data from Sina Weibo. It features comprehensive information dimensions, large data scale, and includes the latest virtual social robots, which can provide data support for this invention.

[0086] 2) This invention achieves high-generalization performance in virtual social robot detection by extracting and fusing multimodal user feature representations. The proposed virtual social robot detection model outperforms most current advanced virtual social robot detection methods, exhibits better adaptability to evolving virtual social robots, cross-robot category transfer capabilities, and good generalization ability. Furthermore, the selection of various components and modal features in the model contributes to the final detection performance. Attached Figure Description

[0087] Figure 1 This is the overall framework of the end-to-end virtual social robot detection method based on multimodal information fusion of the present invention.

[0088] Figure 2 This is a schematic diagram of the virtual social robot detection model structure of the present invention.

[0089] Figure 3(a) is a schematic diagram of the Feature Transformer.

[0090] Figure 3(b) is a schematic diagram of the Attentive Transformer.

[0091] Figure 4(a) shows the capture of virtual social robots within a 3-month user registration period: prediction results from the Weibo23 dataset.

[0092] Figure 4(b) shows the capture of virtual social robots within a 3-month user registration period: prediction results from the Twibot20 dataset.

[0093] Figure 5(a) shows the performance comparison of cross-domain transfer experiments with a competitive baseline: training the model on commercial domain data and transferring it to other domains for testing.

[0094] Figure 5(b) shows a comparison of cross-domain transfer experiment performance with a competitive baseline: the model was trained on entertainment domain data and then transferred to other domains for testing.

[0095] Figure 5(c) shows the performance comparison of cross-domain transfer experiments with a competitive baseline: the model was trained on political domain data and then transferred to other domains for testing.

[0096] Figure 5(d) shows the performance comparison of cross-domain transfer experiments with a competitive baseline: the model was trained on sports domain data and then transferred to other domains for testing. Detailed Implementation

[0097] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.

[0098] This invention proposes an end-to-end virtual social robot detection method based on multimodal information fusion.

[0099] First, based on the publicly available dataset, a large-scale virtual social robot detection research dataset with richer information dimensions was constructed by expanding the data scale and information dimensions.

[0100] Then, based on RoBERTa (A Robustly Optimized BERT), PBG (Pytorch-BigGraph), Tab Net, and CNN (Convolutional Neural Network), we constructed diverse end-to-end feature representation learning models to extract low-dimensional feature representations from various modalities such as user text, attributes, and heterogeneous relationships.

[0101] Ultimately, based on multiple attention mechanisms, namely channel attention, spatial attention, gating, and the cross-fusion mechanism designed and implemented in this invention, the correlation and importance among these low-dimensional feature representations of multimodal features are mined, and multimodal information is organically fused to achieve the classification of virtual social robots and humans.

[0102] The overall framework of this invention mainly consists of two parts: a data collection and expansion module and a virtual social robot detection module, such as... Figure 1 As shown.

[0103] (1) Data Collection and Expansion Module: In this module, a high-performance web crawler is constructed to collect diverse user data from Sina Weibo. First, the missing information dimensions of the existing Sina Weibo research dataset SWBD-20K are expanded to include some user attributes, historical tweets, and social and interaction relationship data. Then, the latest virtual social robot data is supplemented based on current hot topics. Finally, the Weibo23 dataset is constructed, which has the characteristics of comprehensive information dimensions, large data scale, and inclusion of the latest virtual social robots, and can provide data support for this invention.

[0104] (2) Virtual Social Robot Detection Module: The core function of this module is to achieve high-generalization performance in virtual social robot detection by extracting and fusing multimodal user feature representations. First, diverse feature representation learning models are constructed based on TabNet, CNNnet, RoBERTa, and PyTorch-BigGraph to extract feature representations from various modalities, including heterogeneous relationship data, user attribute data, and text content data. Then, a cross-fusion attention mechanism is proposed to fuse the multimodal features obtained in the previous step based on the correlation between multimodal features, generating a holistic user semantic graph. Finally, spatial attention, channel attention, and gating mechanisms are used to select more effective features from the user semantic graph based on feature importance, thereby achieving virtual social robot detection.

[0105] 1. Data Collection and Expansion Module

[0106] Currently, there are relatively few high-quality research datasets related to Chinese social networks, with the SWBD-20K dataset being a representative example. However, this dataset suffers from shortcomings such as missing user attribute data, scarce historical tweet data, and a lack of heterogeneous user relationships, all of which are crucial for the model's generalization ability. Meanwhile, virtual social bots are evolving and becoming increasingly diverse. The goal of this invention is to propose an evolutionarily adaptive method for detecting virtual social bots, enabling timely discovery of new types. This requires a dataset that reflects the latest characteristics of virtual social bots. Therefore, the data dimensions and scale of SWBD-20K were expanded to construct the virtual social bot dataset of this invention.

[0107] 1.1 Data Collection Methods

[0108] First, addressing the issue of missing user data dimensions in the SWBD-20K dataset, information dimensions were expanded. Specifically, user data in SWBD-20K was first cleaned by removing accounts suspended by Weibo, resulting in a seed user set. Then, the Sina Weibo API was used to comprehensively supplement the seed users with three key pieces of information:

[0109] (1) Attribute information. Almost all available attributes of Sina Weibo users were obtained using the Weibo API, thereby obtaining attributes such as account creation time and account credit rating that were missing in the previous dataset.

[0110] (2) Text semantic information. We collected approximately 150 tweets recently posted by seed users, as well as tweet metadata information such as the number of likes, comments, retweets, and tweet posting time.

[0111] (3) Social and Interaction Relationship Information. Due to the limitations of the Weibo platform, we collected as much information as possible about all available user followers and followers. At the same time, we collected the available likes, reposts, and comments from these users' first 150 tweets.

[0112] Then, in order to better reflect the characteristics of the latest virtual social robots on the Sina Weibo platform, user data from the most popular topics on the Weibo platform were collected, and the scale of the dataset was expanded through data annotation.

[0113] 1.2 Data Labeling Methods

[0114] During the manual annotation phase of data scaling, the differences between virtual social bots and humans in SWBD-20K were referenced, and the annotation methods used in the SWBD-20K construction process were adopted, including six annotation metrics for virtual social bot data: completeness of user account information, reasonableness of user social relationships, frequency of interaction with other users, originality of user posts, regularity of posting time, and quality of user tweet content. Based on these metrics, labels were assigned by comparing profiles, tweet content, and social relationships across multiple accounts, ensuring annotation quality.

[0115] 2. Virtual Social Robot Detection Model

[0116] This invention proposes a virtual social robot detection model capable of effectively extracting and fusing multimodal features. The model extracts modal feature vectors from heterogeneous relationship data, user attribute data, and text content data based on TabNet, CNNnet, RoBERTa, and PyTorch-BigGraph models, respectively. Then, it mines the correlation and importance between modalities using cross-fusion attention, spatial attention, channel attention, and gating mechanisms, organically fusing multimodal features to achieve virtual social robot detection with high generalization ability. The structure of the model is as follows: Figure 2 As shown.

[0117] 2.1 Model Input Construction

[0118] The model proposed in this invention takes as its initial input the raw, multimodal data of a specific social user. This data is defined as a triple User = <Topology, Property, Content>. Here, Topology represents heterogeneous relational data, Property represents user attribute data, and Content represents text content data. The initial data construction details for each data type are as follows:

[0119] Heterogeneous Relationship Data: The heterogeneous relationship data used in this invention includes five types of social and interactive relationships: following, fans, likes, reposts, and comments. This invention constructs a large-scale heterogeneous relationship network of Sina Weibo users, represented as a triple of Topology = <N, L, R>, where N represents a node in the network, L represents a connection between two nodes, and R represents the relationship type of that connection. Research has confirmed that virtual social robots can work collaboratively, exhibiting closer relationships and unique characteristics in the network results. Therefore, the heterogeneous relationship data in this invention is input into the model as a feature dimension.

[0120] Table 1 User Personal Attributes Table

[0121] Attribute Name type Number of tweets posted by users Count Number of users following Count User followers Count User Level Count User name length Count Is the user officially verified? bool Does the user use the default name? bool Is the user using the default profile picture? bool User gender bool Is the user's profile length appropriate? count Does the user use the default background image? bool Average number of likes on user tweets Count Average number of retweets per user tweet Count Average number of comments on user tweets Count

[0122] User attribute data: In existing research, feature engineering methods are typically used to mine the relationships between features in account attribute data, such as the fan-to-follow ratio. This introduces the subjective problem of feature engineering. Therefore, this invention only selects user attribute data that can be directly obtained from the Weibo API, as shown in Table 1. Two encoding methods are adopted for these configuration data: (1) For count data, a count format is adopted. For example: the number of fans of a user, the number of followers of a user, etc. (2) For 0-1 type data, a Boolean encoding is adopted. For example: whether to use the default name, whether to use the default background image, gender, etc. Finally, the definition of user attribute data property is:

[0123] property=[property1,property2,...,property n (1)

[0124] Among them, property i This represents a specific configuration file attribute.

[0125] Text content data: In online social networks, virtual social bots often scrape tweets from legitimate users to circumvent platform detection mechanisms. This results in tweets that are semantically incoherent and lack contextual coherence. Therefore, all tweets from each user in the dataset are sorted in ascending order of posting time to form a tweet content sequence for subsequent content language feature extraction. This sequence is defined as follows:

[0126] Content=[Content1,Content2,...,Content n (2)

[0127] Among them, Content i This indicates a tweet that the user has posted.

[0128] 2.2 Multimodal Feature Representation Learning

[0129] Then, as Figure 2 As shown, for the original user input User=<Topology,Property,Content>, different feature representation models are used to extract their feature representations T, P, and C, forming a low-dimensional feature representation set U0=[T,P,C] for the user data. Implementation details are as follows:

[0130] 2.2.1 Heterogeneous Relation Embedding

[0131] In heterogeneous relationship embedding, the model learns the structural features of users based on the relationships and interactions between them. Due to the large scale of the heterogeneous relationship graph of Weibo users, the large-scale graph embedding algorithm PBG (PyTorch-BigGraph) proposed by Google is used for embedding. PBG allows for graph embedding with relatively few computational resources, thus it can be effectively applied to large-scale heterogeneous relationship network data from Weibo. After receiving the user's network topology, PBG generates a vector representation of each node based on the current structure, where the heterogeneous relationship structural feature representation is defined as T. The entire embedding process can be defined as follows:

[0132] T = PBG(Topology) (3)

[0133] In the graph embedding process of PBG, a parameter vector is used to represent each node and its relationships. During training, stochastic gradient descent is used to update the node representation vectors. Nodes with connections will have the highest evaluation score, while unconnected nodes will have the lowest. Simultaneously, PBG introduces negative sampling techniques to generate relationships between unconnected nodes through two methods (uniform negative sampling and data distribution negative sampling). These negatively sampled edges also participate in the model optimization.

[0134] 2.2.2 User Attribute Data Mining

[0135] For user attribute data, TabNet is used to mine linear relationships. User account information is essentially heterogeneous tabular data, characterized by dense numerical features and sparse categorical features. While DNN architectures have been successful in image and language data tasks, they have been proven ineffective at extracting features from heterogeneous tabular data and require significant computational and storage resources. TabNet is a high-performance, interpretable, canonical deep tabular data learning architecture that uses sequential attention to select important features at each decision step to achieve the final decision. Furthermore, the TabNet network can process raw tabular data without any preprocessing and is trained using a gradient descent-based optimization strategy, enabling flexible integration into end-to-end learning and effective extraction of features from heterogeneous data.

[0136] Specifically, as shown in Figure 3, in TabNet, the Sequential Attention in each decision step includes two important operations: (1) using the Attentive Transformer to select the most important features for processing in the next step; and (2) using the Feature Transformer to process the features into more useful representations, iterating sequentially. Finally, the output of the Feature Transformer is used as the final output for prediction. After the account data representation features are mined and generated by TabNet, the low-dimensional feature representation P of user attributes is obtained after passing through a fully connected layer and activation function. The calculation method is as follows:

[0137] P=ReLU(FC(TabNet(Property))) (4)

[0138] 2.2.3 Text Content Representation Learning

[0139] Upon receiving the text content sequence Content = [Content1, Content2, ..., Content...] n Afterwards, RoBERTa is used for word embedding to extract the semantics of each tweet. The RoBERTa model is an enhanced version of BERT, a more refined version of BERT, and has been applied to numerous natural language processing tasks with effective results. The embedding process is as follows:

[0140] Embdding i =RoBERTa[Content i (5)

[0141] After word embedding, the sequence E of word vector matrices arranged chronologically is defined as follows:

[0142] E=[Embdding1,Embdding2,...,Embdding n (6)

[0143] Each Embdding i This is a 100*768 matrix. Following this, the present invention performs column-wise average pooling on all word embedding matrices in the word vector matrix sequence E, compressing the content word vector matrix of each user text into a 1*768 text content vector, while maintaining the temporal arrangement of these text content vectors, thus forming a text content semantic vector sequence. Finally, a simplified CNNnet is used to mine the contextual semantics between the text content semantic sequences, extracting the overall text semantic feature representation C of the user context, completing the feature extraction at the semantic level of the user tweet content. The calculation method is as follows:

[0144] C=avgpool(Conv2d(avgpool(concatenation(E)))) (7)

[0145] Here, concatenation(•) means concatenating vectors, and Conv2d and avgpool represent convolution and average pooling operations, respectively.

[0146] 2.3 Multimodal Feature Fusion

[0147] After acquiring low-dimensional feature representations (T, P, C) of user data from different modalities, to more effectively improve the detection performance of virtual social robots, it is necessary to effectively fuse user data from different modalities to achieve heterogeneous information complementarity, thereby comprehensively grasping the overall semantic differences between users and complex virtual social robots. Therefore, as... Figure 2 As shown, a high-quality overall user semantic graph is generated by fusing low-dimensional feature representations based on the correlation between modalities and the importance of modalities at different scales. Specific implementation details are as follows:

[0148] 2.3.1 Cross-integration attention mechanism

[0149] The cross-fusion attention mechanism proposed in this invention generates relationship weights by calculating the correlation between pairwise modal feature vectors. These relationship weights are then used to fuse multimodal features pairwise, thereby generating a more closely related multimodal semantic representation of users. The fusion process of social structural modality T and attribute modality P is illustrated using this example to illustrate the pairwise fusion process.

[0150] First, a metrix interaction space is generated by multiplying the features T and P using a linear function after a linear transformation. TPThen, the correlation coefficient W between the original features and the transformed features is calculated respectively. TP With W PT The calculation formula is as follows:

[0151]

[0152]

[0153]

[0154] Among them, w P With w N For learnable parameters, This represents the dot product. Then W... TP With W PT The final relation weights W′ are obtained by inputting the data into a softmax function. TP With W′ PT And based on this weight, a fused feature representation TP is generated:

[0155] W′ TP ,W′ PT =softmax(W TP W PT (11)

[0156] TP=W′ TP •T+W′ PT ·P (12)

[0157] Then, to further extract deeper implicit correlations from the fused features, TP was further linearly transformed. At the same time, to prevent the gradient vanishing problem, a residual structure was used to combine the two:

[0158] TP′=TP+BatchNorm1D(Droupout(Relu(TP))) (13)

[0159] Among them, ReLU is a linear rectified function responsible for linear changes; Dropout is a regularizer to prevent overfitting; and BatchNormal1D is a normalization function to solve the gradient explosion problem.

[0160] Finally, after the cross-fusion module, three modal feature representations were generated, namely, text-structure fusion modality CT′, structure-data fusion modality TP′, and data-text fusion modality PC′. These three representations were then concatenated to obtain a preliminary overall semantic fusion representation U for the user, considering pairwise correlations.

[0161] U=concatenation(TP′; PC′; CT′) (14)

[0162] 2.3.2 Attention Module

[0163] After obtaining the initial semantic representation U of the user, two attention mechanisms were used: spatial attention and channel attention. At two scales, namely between fusion modalities and within fusion modalities, features that are helpful to the detection performance in the fusion modalities generated after pairwise fusion (i.e., text-result fusion modal, structure-data fusion modal, and data-text fusion modal) were identified.

[0164] First, a preliminary fusion representation U of the overall user semantics, containing cross-fusion modal information, is generated. Each fusion modality is treated as a channel and input into the channel attention function to focus on fusion modalities that are more helpful to detection performance.

[0165] CAM(U)=σ(MLP(avgpool(U))+MLP(maxpool(U))) (15)

[0166]

[0167] Wherein, CAM(·) is the channel attention mechanism; avgpool(·) is average pooling, maxpool(·) is max pooling, MLP(·) is multilayer perceptron, σ(·) is the activation function Sigmoid; U1 represents the channel attention output, which is the product of the channel attention matrix CAM(U) and the user's overall semantic preliminary fusion representation U.

[0168] Then, a spatial attention mechanism is used to identify features in each fusion modality that contribute to improving detection performance:

[0169] SAM(U1)=σ(f 7*7 ([avgpool(U1),maxpool(U2)])) (17)

[0170]

[0171] Where U1 represents the channel attention output, which is the product of the channel attention matrix CAM(U) and the user's overall semantic preliminary fusion representation U; U2 represents the spatial attention output, which is the product of the spatial attention matrix SAM(U1) and U1. In channel and spatial attention, σ represents the activation function Sigmoid, avgpool and maxpool represent average pooling and max pooling respectively, MLP is a multilayer perceptron, and f 7*7 This is a convolutional neural network with a 7x7 kernel.

[0172] After completing the attention calculation, to prevent network degradation due to the increase in the number of layers, the original features are input into the user's overall semantic preliminary fusion representation U and the spatial attention output U2. These are then input into the residual mechanism to select the features that are more conducive to detection after feature fusion, which is used as the final user overall semantic representation U', and its formula is as follows:

[0173] Z = sigmoid(W[U,U2] + b) (19)

[0174] U'=tanh(U)⊙Z+U2⊙(1-Z) (20)

[0175] Where ⊙ represents the Hadamard product operation. Z is the user semantic vector after nonlinear transformation by the sigmoid function, where sigmoid(·) is the activation function, W is the learnable weight, b is the learnable parameter, and tanh(·) is the hyperbolic tangent function.

[0176] 2.4 Model Output

[0177] The final user's fused overall semantic features U' are input into a fully connected layer to achieve binary classification detection. Dropout prevents overfitting, and softmax is used as the activation function to achieve the final classification. The calculation formula is as follows:

[0178] y out =Droupout(FC(U')) (21)

[0179]

[0180] Among them, y out For the final overall semantic vector of the user, The prediction results for user categories.

[0181] The optimization objective of the model is to minimize the cross-entropy loss function. The classification loss function is defined by the negative log-likelihood of the correct label, and its definition is as follows:

[0182]

[0183] Where ε is the loss entropy; y is the user's real label, which is 1 when the user is human and 0 otherwise.

[0184] 2.5 Model Training Process

[0185] This invention combines the constructed Weibo23 virtual social bot detection dataset and the Twibot20 dataset to train the parameters of the Weibo version model and the Twitter version model, respectively. Specifically, this invention first uses the Google-released RoBERTa-wwm-ext model to extract the initial semantic vectors of user text. Then, it uses the Google-released PyTorch-BigGraph to extract the vectors of heterogeneous relationship nodes, while simultaneously encoding the initial user attributes into vector form. The generated user vectors are then input into other components of the model for parameter tuning. During tuning, the learning rate is set to 1e-5, the weight decay rate is set to 1e-5, Dropout is set to 0.5, the training epochs are set to 64, and the batch size in each epoch is set to 32. After training, the model with the highest accuracy on the validation set is selected as the final tuned version.

[0186] 3. Experiment

[0187] This invention designed five experiments to evaluate the effectiveness of the proposed virtual social robot detection method. All experiments were conducted on a server equipped with an Intel(R) Xeon(R) E5410 CPU, three NVIDIA Tesla V100 GPUs with 32GB of VRAM, and 128GB of RAM. The datasets used were the Weibo23 dataset and the Twibot20 dataset collected in this project. In the experiments, the Weibo23 dataset was divided into 60% training set, 20% validation set, and 20% test set. Following the suggestion of the Twibot20 publisher, the Twibot20 dataset was divided into 70% training set, 20% validation set, and 10% test set. Each experiment was repeated 10 times, and the average value was used as the final result.

[0188] 3.1 Performance Analysis

[0189] Nine advanced virtual social robot detection methods proposed in recent years were selected as benchmark methods and compared with the detection method proposed in this invention through different experiments. Their characteristics are as follows:

[0190] (1) Yang et al. A highly generalizable and scalable method for detecting virtual social bots based on eight types of user metadata and twelve derived handcrafted features.

[0191] (2) Abreu et al. A virtual social bot detection model based on reduced feature sets, using only user profile attribute data.

[0192] (3) Rodrifuez-Ruiz et al. A single-classification method for detecting virtual social robots extracts features from content, time, and attributes.

[0193] (4) BotHunter. A method for detecting virtual social robots based on user attributes, relationships, content and time information, using random forest as a classifier.

[0194] (5) LOBO. LOBO uses feature engineering to extract 19 features from user attributes and tweet content, which can effectively capture virtual social bots in botnets.

[0195] (6) Miller et al. Miller et al. used 107 handcrafted features and an improved flow clustering algorithm to detect virtual social bots.

[0196] (7)DeeProBot. A social robot detection model based on LSTM-MLP architecture that simultaneously captures the differences in configuration attributes and content between human users and virtual robot users.

[0197] (8)SATAR. A self-supervised virtual social robot representation learning framework that uses semantic information, attribute information and relational information in combination, and adopts a collaborative attention mechanism to aggregate this information.

[0198] (9) RGA. A Sina Weibo virtual robot detection model based on the Resnet-BiGRU-Attention architecture extracts the differences between virtual social robots and humans in user attributes, time, relationships, and content.

[0199] The experimental results are shown in Table 2. From the table, we can see that:

[0200] (1) The detection method proposed in this invention outperforms most current state-of-the-art virtual social robot detection methods on the Weibo23 and Twibot20 datasets, demonstrating the superior generalization ability of the proposed method. Specifically, on the Weibo dataset, compared with the second-best performing detection method, the accuracy and F1 score of this invention are improved by 1.8% and 2%, respectively. On Twibot20, compared with the second-best performing detection method, the accuracy and F1 score of this invention are improved by 3% and 2%, respectively, indicating that the model of this invention has better performance in both accuracy and balance. Furthermore, although the methods of Miller et al. and Rodrifuez-Ruiz et al. achieve better recall than the model of this invention, their accuracy and F1 score are significantly lower than those of the model of this invention.

[0201] (2) Methods using more comprehensive modal information, such as RGA, SATAR, and the detection method proposed in this invention, outperform other methods with relatively simple modal information in terms of detection accuracy and balance. Furthermore, the method of this invention performs better than SATAR and RGA, which also use rich modal information. This indicates that the method of multimodal feature representation learning and fusion can better combine modal information, thereby comprehensively capturing the semantic differences between virtual social robots and humans.

[0202] Table 2. Performance of different virtual social bot detection methods on the Weibo23 and Twibot20 datasets.

[0203]

[0204]

[0205] In summary, the model of this invention can better integrate information from different modalities and has excellent generalization ability and detection accuracy.

[0206] 3.2 Feature Validity Analysis

[0207] To evaluate the contribution of the three categories of model information (user attribute features, content features, and social relationship structure features) proposed in this invention to the proposed detection model, feature ablation experiments were conducted on the full feature set and seven feature subsets, as shown in Table 3.

[0208] Table 3 Feature Set Description

[0209] Feature set Included feature categories Full feature set Structure, content, attributes Full feature set\attributes Structure and content Full feature set\content Structure, properties Full feature set\structure Content, attributes Full feature set\attributes and content structure Full feature set\attributes and structure content Full feature set\content and structure property

[0210] Table 4. Experimental results of modal feature ablation using the method of the present invention.

[0211]

[0212] Table 4 presents the feature ablation results on two benchmark datasets. The results show that, on both datasets, combining all three modalities simultaneously outperforms single or pairwise combinations in both accuracy and F1 score. This indicates that combining multiple modalities can effectively improve the performance of virtual social bot detection methods. Furthermore, it can be seen that the method of this invention experienced the greatest performance drop after text information ablation on Weibo23, and the greatest drop after account information ablation on Twibot20. This also demonstrates that different modalities have varying importance in identifying virtual social bots in different datasets, thus proving the effectiveness of combining different modalities.

[0213] 3.3 Structural Effectiveness Analysis

[0214] The detection method proposed in this invention utilizes three representation learning modules to extract low-dimensional feature vectors from text, data, and social structural modality data, respectively. Then, it employs the cross-fusion and attention mechanisms in the feature fusion module to obtain the overall semantic representation of the user, effectively improving the detection performance of virtual social robots. Specifically, the heterogeneous data mining module uses the TabNet component, the text embedding module uses a simplified CNNNet to learn the RoBERTa-embedded tweet vector sequence, and the feature fusion module uses the cross-fusion and attention modules. To further illustrate the necessity of each component in the detection method, different components in the detection model structure were removed, and the ablation results were observed. Since the social relationship graph embedding module does not involve other structures after obtaining the node representation vectors, the role of this part has already been verified during feature ablation. Therefore, the ablation settings of the key components explored are shown in Table 5, mainly TabNet, CNNNet, the cross-fusion module, and the attention module.

[0215] Table 5 Ablation experiment results of the detection model architecture proposed in this invention.

[0216]

[0217] The ablation experiment results of each component in the detection model proposed in this invention are shown in Table 5. First, after removing TabNet from the heterogeneous data mining module, the overall performance of the Weibo23 dataset decreased, while the accuracy and F1 score of the detection method on the Twibot20 dataset decreased significantly by approximately 7%, demonstrating the effectiveness of TabNet in the heterogeneous data mining module. Second, removing the simplified CNNnet used in the text data mining module caused a certain decrease in various metrics on both datasets, indicating that CNNnet can effectively mine the contextual semantic consistency of tweet embedding sequences. Then, for the cross-fusion module and the attention module, the absence of either resulted in a more significant decrease in precision and recall; their simultaneous presence led to better improvements in all metrics. This indicates that combining the two modules can better utilize the correlations in different modalities to mine the differences between virtual social robots and humans, thereby improving detection performance. In summary, each component in the detection method proposed in this invention contributes to the final detection performance.

[0218] 3.4 Evolutionary Adaptability Analysis

[0219] Studies have shown that the ever-evolving, complex virtual social bots that seek to evade detection cause generalization performance crises in most detection methods. This necessitates that detection methods maintain ideal performance across different timeframes and keep pace with the rapid evolution of bots. Therefore, to demonstrate the adaptability of the proposed method to the evolutionary phenomenon of virtual social bots and thus better prove the generalization ability of the detection method, this study references the research methods of Feng et al. on the evolutionary adaptability of virtual social bots and investigates the prediction results of the proposed method on the validation set and test set of two datasets. Figures 4(a) and 4(b) show the detection results of the proposed method for bot users on different datasets, as well as the detection accuracy within a three-month user registration period. (a) shows the prediction results on the Weibo23 dataset, and (b) shows the prediction results on the Twibot20 dataset. The scatter plot represents the prediction results for bot users, and the line represents the accuracy of bot capture within a three-month user registration period.

[0220] The results show that the detection method proposed in this invention can effectively capture most robot users from different registration periods in the Weibo23 and Twibot20 datasets, and maintains a relatively stable detection accuracy over a three-month user registration period. Specifically, for all robot users registered in Weibo23 between 2009 and 2023, the capture accuracy of the method consistently remained around 0.9 with fluctuations within 0.05 for most of the period. For all robot users registered in Twibot20 between 2006 and 2020, the capture accuracy of the method also remained above 0.9 for most of the period. These results further demonstrate the adaptability of the proposed method to evolving virtual social robots and further prove the generalization ability of the detection method.

[0221] 3.5 Transferability Analysis

[0222] Research indicates that different types of virtual social bots distributed across different user domains exhibit distinct behavioral patterns and characteristics. This necessitates that virtual social bot detection methods effectively capture different types of virtual social bots across various domains, rather than targeting a specific type within a particular domain. Therefore, to demonstrate that the detection method proposed in this invention can effectively capture different types of virtual social bots, thereby further proving its generalization ability, this embodiment conducts cross-domain transfer experiments on the Twibot20 dataset, which has four user domain labels. Specifically, the method is trained on data from one user domain and then tested in other domains. In this experiment, three baseline methods with competitive performance on Twibot20—Yang et al.'s method, RGA, and SATAR—along with the method proposed in this invention, were selected. Cross-domain transfer experiments were conducted in the political, business, entertainment, and sports domains, respectively. The experimental results are shown in Figure 5.

[0223] Experimental results show that, compared to competitive baseline methods, the detection method proposed in this invention maintains better detection accuracy when migrating between different domains. Specifically, the proposed detection method achieves the highest detection accuracy when migrating from each of the four domains to other domains, and is at least 3% more accurate than the second-best method. Furthermore, it can be observed that the performance of the detection method varies significantly when migrating between different domains. For example, RGA's accuracy in the business-political migration is significantly lower (approximately 4%) than its accuracy in the business-entertainment migration. This demonstrates that the generalization ability of virtual social bot detection methods is affected by differences in the categories of virtual social bots. In summary, the above results further demonstrate that the detection method of this invention has better generalization ability and stability, and can effectively capture virtual social bots distributed across different user domains.

[0224] Based on the above experiments, the virtual social robot detection model proposed in this invention outperforms most current advanced virtual social robot detection methods in terms of performance, and exhibits better adaptability to evolving virtual social robots and cross-robot category transfer capabilities, i.e., good generalization ability. Furthermore, the selection of various components and modal features in the model contributes to the final detection performance. Therefore, the virtual social robot detection model of this invention has achieved excellent results in the problem of virtual social robot detection.

Claims

1. An end-to-end virtual social robot detection method based on multi-modal information fusion, characterized in that, Includes the following steps: Step 1: Based on the publicly available dataset, construct a virtual social robot detection research dataset by expanding the data scale and information dimensions; Step 2: Construct a virtual social robot detection model. Based on TabNet, CNNnet, RoBERTa and Pytorch-BigGraph models, extract modal feature vectors from heterogeneous relationship data, user attribute data and text content data respectively to form a low-dimensional feature representation set of user data. Step 3: Propose a cross-fusion attention mechanism, which generates relation weights by calculating the correlation between pairwise modal feature vectors, and uses the relation weights to fuse multimodal fusion features in pairs to generate a preliminary overall semantic representation of the user; Step 4: Based on multiple attention mechanisms, including channel attention mechanism, spatial attention mechanism, gating mechanism and the cross-fusion attention mechanism, the correlation and importance between the features of each modality are explored, and effective features are selected from the preliminary semantic representation of the user to realize the classification of virtual social robots and humans; Step 2 specifically involves: Step 2.1: Define the user original input data as a triple ; wherein, represents the heterogeneous relationship data, represents the user attribute data, represents the text content data; the initial construction of data of each type is as follows: Heterogeneous relational data: The heterogeneous relationship data represents users' social and interactive relationships, denoted as... The triplet, Represents nodes in the network. This indicates that there is a connection between two nodes. Indicates the type of relationship to which this connection belongs; User attribute data: Two encoding methods are used for configuration file data in user attribute data: count data is encoded in count form; 0-1 type data is encoded in Boolean form; user attribute data is represented as follows: ; in, This represents a specific configuration file attribute; Text content data: All tweets from each user in the dataset are sorted in ascending order of posting time to form a tweet content sequence, which is used for subsequent content language feature extraction; the text content data is represented as: ; in, This indicates a tweet that the user has posted; Step 2.2: For the user's original input data Different feature representation models are used to extract heterogeneous relation dimension feature representations. Low-dimensional feature representation of user attributes Low-dimensional feature representation of text content The set of low-dimensional feature representations that make up user data ; The specific process of pairwise multimodal fusion in step 3 is as follows: Step 3.1: Representing the heterogeneous relation dimension features using linear functions Low-dimensional feature representation of user attributes Perform a linear transformation, then multiply to generate an interaction space. : ; in, and For learnable parameters, Represents the dot product; Step 3.2: Calculate the correlation coefficient between the original features and the transformed features respectively. and The calculation formula is as follows: ; ; Step 3.3: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require and The final relation weights are obtained by inputting the data into a softmax function. and And based on this weight, a fused feature representation is generated. : ; ; Where softmax is the activation function; Step 3.4: For Further linear transformations are performed to extract deeper implicit correlations from the fused features. To prevent gradient vanishing, a residual structure is used to combine the two, resulting in the structure-data fusion mode. : ; in, It is a linear rectifier function, responsible for linear changes; For regularizers; This is the normalization function; Step 3.5: Similarly, generate the text-structure fusion modality again. Data-Text Fusion Modality The modal feature representations of the three modalities are fused pairwise, and then the three are concatenated to obtain a preliminary fusion representation of the overall user semantics that takes into account pairwise correlations. : 。 2. The end-to-end virtual social robot detection method based on multimodal information fusion according to claim 1, characterized in that, Step 1 specifically involves: Step 1.1: Expand the missing information dimensions of the existing Sina Weibo research dataset SWBD-20K to include some user attributes, historical tweets, and social and interaction relationship data; Step 1.2: Supplement the dataset with the latest virtual social robot data based on current hot topics, and expand the labeled scale of the dataset through data annotation; Step 1.3: Construct the Weibo23 virtual social robot detection dataset.

3. The end-to-end virtual social robot detection method based on multimodal information fusion according to claim 1, characterized in that, Step 2.2 specifically includes: Step 2.2.1: Heterogeneous Relation Embedding After receiving the user's network topology, the graph embedding algorithm PyTorch-BigGraph generates a vector representation for each node based on the current structure, where the heterogeneous relation dimension feature representation is defined as... The entire embedding process is defined as follows: ; In the process of graph embedding, a parameter vector is used to represent each node and its relationship. During training, the node representation vector is updated using stochastic gradient descent. At the same time, uniform negative sampling and data distribution negative sampling are used to generate relationships between unconnected nodes. Step 2.2.2: User Attribute Data Mining User attribute data Tab Net is used to mine the linear relationships. After the account data representation features generated by Tab Net mining are processed through a fully connected layer and an activation function, the low-dimensional feature representation of user attributes is obtained. The calculation method is as follows: ; in, It is a linear rectifier function, responsible for linear changes; Step 2.2.3: Text Content Representation Learning Upon receiving text content data Then, RoBERTa was used for word embedding to mine the semantics of each tweet; the embedding process is as follows: ; After word embedding, the sequence of word vector matrices is arranged chronologically. Its definition is: ; in, For each tweet, embed a word vector matrix; sequence of word vector matrices All word embedding matrices are average pooled column by column to compress the content word vector matrix in each user text into a 1*768 text content vector, while maintaining the temporal arrangement of the text content vectors to form a text content semantic vector sequence. A simplified CNNnet is used to mine the contextual semantics between sequences of semantic vectors in the text content, extracting feature representations of the overall semantics of the user's context, i.e., low-dimensional feature representations of the text content, thus completing the feature extraction at the semantic level of the user's tweet content; the calculation method is as follows: ; in, This indicates that vectors are concatenated. and These represent convolution and average pooling operations, respectively.

4. The end-to-end virtual social robot detection method based on multimodal information fusion according to claim 1, characterized in that, Step 4 specifically includes: Step 4.1: Preliminary fusion representation of the overall user semantics, including cross-fusion modal information. Each fusion modality is treated as a channel and fed into channel attention to focus on fusion modalities that are more helpful for detection performance: ; ; in, For channel attention mechanism; For average pooling, For max pooling, It is a multilayer perceptron. The activation function is Sigmoid; This represents the channel attention output, which is the channel attention matrix. Initial integration with the user's overall semantics The product; Step 4.2: Use spatial attention mechanism to identify features in each fusion modality that help improve detection performance: ; ; in, The spatial attention output is represented by the spatial attention matrix. and The product; This is a convolutional neural network with a 7x7 kernel. Step 4.3: Input the original features into the user's overall semantic preliminary fusion representation Spatial attention output The residual mechanism selects the features that are more conducive to detection after feature fusion, which are then used as the final user's fused overall semantic features. The formula is as follows: ; ; in, This represents the Hadamard product operation. For the User semantic vector after nonlinear transformation of the function. For activation function, For learnable weights, For learnable parameters, It is the hyperbolic tangent function; Step 4.4: Model Output Integrate end-user semantic features Input is fed into a fully connected layer to perform binary classification detection: ; ; in, For the final overall semantic vector of the user, The prediction results for user categories; The optimization objective of the model is to minimize the cross-entropy loss function. The classification loss function is defined by the negative log-likelihood of the correct label, and its definition is as follows: ; in, Let y be the loss entropy; y is the user's real label, which is 1 if the user is human, and 0 otherwise.