A software development auxiliary robot identification method, device and computer equipment

By acquiring data from the open-source community and building a binary classification model, SoftBot accounts can be accurately identified, solving the problem of distinguishing SoftBot accounts from ordinary user accounts and maintaining the fairness and healthy development of the community.

CN118395300BActive Publication Date: 2026-06-19NAT UNIV OF DEFENSE TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NAT UNIV OF DEFENSE TECH
Filing Date
2024-04-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In open-source communities, the increased difficulty in distinguishing between SoftBot accounts and regular user accounts leads to information interference and unfair competition, affecting community order and vitality.

Method used

By acquiring data from the open-source community, including account metadata, profile pictures, comment text, and behavioral data, and utilizing the social interaction balance index, a binary classification model is constructed. Representation vectors are extracted and fused, and the best-performing binary classification model is trained and selected to achieve accurate identification of SoftBot.

Benefits of technology

Accurately identifying SoftBot accounts helps community administrators maintain community order and vitality, prevents the abuse and misuse of SoftBots, and ensures the fairness, impartiality, and healthy development of the community.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118395300B_ABST
    Figure CN118395300B_ABST
Patent Text Reader

Abstract

This invention relates to a method, apparatus, and computer device for identifying software development assistance bots. The method includes: acquiring data from software development assistance bots and human users in an open-source community, including: account metadata, avatar images, comment text, and behavioral data; introducing a social interaction balance index into the numerical features of the account metadata; calculating the social interaction balance index using the number of followers and those followed by the user; extracting account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors based on the data to obtain a comprehensive representation vector, which is then split into a training set and a test set; learning the representation vectors in the training set to construct a binary classification model; evaluating the constructed binary classification model in the test set; and selecting the best-performing binary classification model as the identification model for account recognition. This method can accurately identify software development assistance bot accounts.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot account recognition technology, and in particular to a software development-assisted robot recognition method, apparatus, and computer equipment. Background Technology

[0002] In today's digital age, software development bots (SoftBots) are playing an increasingly important role in the software development ecosystem. A SoftBot is an automated program designed to assist software developers in completing development tasks, resolving problems, and providing automated support. It can perform various tasks such as code review, automated testing, issue tracking and management, and real-time communication with developers. The use of SoftBots has become widespread in open-source communities and commercial software development, playing a significant role in improving software development efficiency, promoting collaboration, and enhancing code quality.

[0003] However, with the widespread use of SoftBots, the importance of distinguishing SoftBot accounts from ordinary user accounts has become increasingly prominent. This is because the existence of SoftBots may lead to several problems, such as: (1) Information interference: The large number of automated comments, issue tracking, and merge requests published by SoftBots may interfere with the activities of real users, affecting communication and cooperation within the community. (2) Unfair competition: If SoftBots cannot be effectively distinguished from real users, it may lead to unfair competition in the open-source community, affecting the participation enthusiasm of real users. Therefore, designing an accurate method for identifying SoftBot accounts is of great significance for maintaining the order and vitality of the open-source community. This method can help the open-source community and related platforms effectively manage the activities of SoftBot accounts and take corresponding measures to prevent the abuse and misuse of SoftBots, ensuring the fair, just, and healthy development of the community. Summary of the Invention

[0004] Therefore, it is necessary to provide a software development-assisted robot recognition method, device, and computer equipment to address the aforementioned technical problems.

[0005] A software-assisted robot recognition method, the method comprising:

[0006] Acquire data from software development bots and human users in the open-source community; the data specifically includes: account metadata, profile pictures, comment text, and behavioral data.

[0007] Based on data extraction, account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors are used; a social interaction balance index is introduced from the numerical features in the account metadata; the social interaction balance index is calculated using the number of followers and the number of people followed.

[0008] A comprehensive representation vector is obtained based on account metadata vector, profile picture vector, comment text vector, and behavioral data vector, and then split into training set and test set;

[0009] The representation vectors of the training set are learned to learn the feature representations of software development assistance robots and human users, and a binary classification model is constructed.

[0010] The binary classification models with different hyperparameter settings are trained on the training set, and the constructed binary classification models are evaluated on the test set. The performance of the binary classification models with different hyperparameter settings is compared, and the binary classification model with the best performance is selected as the recognition model.

[0011] The recognition model is used to identify the account to be identified, and the recognition result is output.

[0012] A software development-assisted robot recognition device, the device comprising:

[0013] Data related to the account to be identified is obtained through the API of the open-source community platform. The data specifically includes: account metadata, profile picture, comment text, and behavioral data. The numerical features in the account metadata introduce a social interaction balance index. The social interaction balance index is calculated based on the number of followers and followers a user has.

[0014] The vector extraction module is used to extract account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors based on data.

[0015] The vector fusion module is used to obtain a comprehensive representation vector based on account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors, and then split it into training and test sets.

[0016] The model construction module is used to learn the representation vectors of the training set to learn the feature representations of software development assistance robots and human users, and construct a binary classification model.

[0017] The model selection module is used to train binary classification models with different hyperparameter settings on the training set, evaluate the constructed binary classification models on the test set, compare the performance of binary classification models with different hyperparameter settings, and select the binary classification model with the best performance as the recognition model.

[0018] The account recognition module is used to identify the account to be identified using a recognition model and output the recognition result.

[0019] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program performing the following steps:

[0020] Data was acquired from software development bots and human users in the open-source community. This data included account metadata, profile pictures, comment text, and behavioral data. Numerical features in the account metadata were used to introduce a social interaction balance index, which was calculated using the number of followers and those followed by the user.

[0021] Based on data extraction, account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors are extracted;

[0022] A comprehensive representation vector is obtained based on account metadata vector, profile picture vector, comment text vector, and behavioral data vector, and then split into training set and test set;

[0023] The representation vectors of the training set are learned to learn the feature representations of software development assistance robots and human users, and a binary classification model is constructed.

[0024] The binary classification models with different hyperparameter settings are trained on the training set, and the constructed binary classification models are evaluated on the test set. The performance of the binary classification models with different hyperparameter settings is compared, and the binary classification model with the best performance is selected as the recognition model.

[0025] The recognition model is used to identify the account to be identified, and the recognition result is output.

[0026] The aforementioned method, apparatus, and computer equipment for identifying software development bots comprehensively consider multi-dimensional data from both software development bots and real human users in the open-source community, including account metadata, avatar images, comment text, and behavioral data. It extracts and fuses representation vectors, performs feature learning, constructs a binary classification model, and then uses the trained recognition model for account identification. This proposed method can accurately identify software development bot accounts, helping community administrators better maintain the order and vitality of the open-source community. Attached Figure Description

[0027] Figure 1 This is a flowchart illustrating a software-assisted robot recognition method in one embodiment;

[0028] Figure 2 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0030] In one embodiment, such as Figure 1As shown, a software-assisted robot recognition method is provided, including the following steps:

[0031] Step 102: Obtain data from software development assistance robots and human users in the open-source community.

[0032] The aforementioned data specifically includes: account metadata, profile picture, comment text, and behavioral data. This data can be obtained by accessing the open-source community API, such as GitHub.

[0033] The account metadata includes the registered geographical location, account type (user or organization), registration year, number of followers, number of followed projects, personal profile, etc. The comment text includes comment data for issues, commits, and pull requests. The behavioral data includes adding new issues, adding new comments, adding new pull requests, etc.

[0034] Human users refer to real, ordinary human users, as opposed to bot accounts.

[0035] The acquired profile picture can be saved as a file in a specified directory and stored using a file system. The acquired account metadata, comment text, and behavioral data can be stored in a structured form in a database.

[0036] Furthermore, the numerical features in the account metadata introduce a social interaction balance index, which is calculated using the number of followers and those followed by the user. For example, the balance of social interaction can be measured by calculating the geometric mean of the number of followers and those followed, and comparing it with the sum. The value range is [0,1]. A social interaction balance index value close to 0.5 indicates that the number of followers and those followed by the user is roughly equal, while values ​​far from 0.5 indicate an imbalance.

[0037] Step 104: Extract account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors based on the data.

[0038] Specifically, account metadata vectors are extracted based on account metadata, profile picture vectors are extracted based on profile picture images, comment text vectors are extracted based on comment text, and behavioral data vectors are extracted based on behavioral data.

[0039] Step 106: Obtain a comprehensive representation vector based on the account metadata vector, avatar image vector, comment text vector, and behavioral data vector, and split it into a training set and a test set.

[0040] The above four types of vectors can be combined using the embedding method to form a comprehensive representation vector.

[0041] It is understandable that one account corresponds to one comprehensive representation vector.

[0042] Step 108: Learn the representation vectors of the training set to learn the feature representations of the software development assistance robot and human users, and construct a binary classification model.

[0043] Specifically, the Transformer architecture is used to learn the representation vectors of the training set.

[0044] Step 110: Train binary classification models with different hyperparameter settings on the training set, evaluate the constructed binary classification models on the test set, compare the performance of binary classification models with different hyperparameter settings, and select the binary classification model with the best performance as the recognition model.

[0045] Step 112: Use the recognition model to identify the account to be identified and output the recognition result.

[0046] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0047] In one embodiment, the method of extracting account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors based on data includes:

[0048] 1) The categorical features, numerical features incorporating a social interaction balance index, and textual features integrating words and parts of speech in the account metadata are processed and combined to form an account metadata vector. Specifically, this includes:

[0049] Categorical metadata (including user geolocation and account type) is encoded into numerical form. One-hot encoding is used to map each category to a vector, where only one element is 1 and the rest are 0, resulting in the geolocation vector V. location And account type vector V account ;

[0050] For numerical metadata (including registration year, number of followers, number of followed items, and number of repo items), Min-Max Normalization is used to scale each feature data to the range of [0,1] to ensure uniform scale for different features.

[0051] The Social Interaction Balance Index (SIBI) is calculated by combining the number of followers and the number of followers a user is followed. The formula can be expressed as:

[0052]

[0053] Here, follow and followed represent the number of followers and the number of users who are followed, respectively.

[0054] Based on the above feature index values, the numerical metadata vector is obtained:

[0055] V data =[year,follow,followed,repo,SIBI];

[0056] For text-based metadata (such as personal profiles), the Word2Vec word embedding model using Continuous Bag-of-Words (CBOW) is used to convert the text into a vector. Considering that personal profiles of open-source community users are usually short texts, part-of-speech tagging is introduced to provide richer feature representations. In specific implementations, the part-of-speech tagger of NLTK (Natural Language Toolkit) can be used to tag each word in the personal profile text with the corresponding part of speech (noun, verb, adjective, etc.), and the part-of-speech tagging is added as an additional feature to the vector, forming a text-based metadata vector V. profile =[w1,c1,w2,c2,...,w n ,c n ], where w and c represent the word and its corresponding part of speech, respectively;

[0057] Different metadata feature vectors are concatenated and combined to form an account metadata vector, where ⊕ represents the concatenation operation:

[0058] V meta =V location ⊕V account ⊕V data ⊕V profile ;

[0059] 2) Load the profile picture image as a pixel matrix, and after preprocessing, such as resizing and normalization, obtain the profile picture image vector V. image The formula for size adjustment can be expressed as:

[0060] new_image=resize(image,new_width,new_height);

[0061] Where image is the original avatar image, new_width and new_height are the expected new dimensions, and new_image is the adjusted avatar image;

[0062] The normalization formula can be expressed as:

[0063] normalized_image=(image-mean) / std;

[0064] Where image is the adjusted profile picture, mean is the average pixel value, std is the standard deviation of pixel values, and normalized_image is the normalized profile picture.

[0065] 3) Perform word segmentation on the natural language text and code text in the comment text, convert the segmented text into word vector representations, and then concatenate and combine them to form the comment text vector.

[0066] Regular expressions can be used to identify natural language text and code text in comment text. Code text is usually surrounded by code block tags (such as ```), and these tags can be matched to extract the code text.

[0067] The Natural Language Toolkit (NLTK) can be used to segment the natural language text in the comments, and the Word2Vec word embedding model based on the Continuous Bag-of-Words (CBOW) can be used to convert the segmented text into word vector representations V. comment ;

[0068] The CodeTokenizer code segmenter is used to segment the code text in the comments, and the Code2Vec code embedding model is used to convert the segmented code into a code vector representation V. code ;

[0069] The comment text vector is formed by concatenating and combining word vectors and code vectors.

[0070] V text =V comment ⊕V code ;

[0071] For short comment text vectors, they can be padded to the same length by adding a special padding symbol (e.g., -100) at the end. This ensures uniform vector length and facilitates batch processing.

[0072] Extract behavioral sequences from behavioral data in chronological order, such as adding an issue, adding a comment, adding a pull request, etc. Arrange these behaviors in chronological order and encode different types of behaviors (0,1,...,n) to obtain a behavioral data vector V. behavior .

[0073] For each user's behavior sequence, since the number and frequency of behaviors may differ among different users, it is necessary to align the behavior sequences so that they have the same length: pad the end of the behavior sequence so that all sequences have the same length. Padding can be done using special flags such as 0.

[0074] In one embodiment, a comprehensive representation vector is obtained based on account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors, and then split into training and testing sets, including:

[0075] The account metadata vector, profile picture vector, comment text vector, and behavioral data vector are adjusted in dimensions and weighted before being concatenated to form a complete representation vector. The account metadata vector is represented as V. meta The vector representation of the profile picture is V. image The comment text vector is represented as V. text The behavioral data vector is represented as V behavior Then the concatenated representation vector V combined It can be represented as:

[0076] V combined =⊕ i∈{meta,image,text,behavior} W i ·f(V i ,max({D i}));

[0077] Among them, D i Represents vector V i Given the dimension (i∈{meta,image,text,behavior}), the function f performs zero-value padding, aiming to adjust the vector dimension so that different vectors V i Dimensions and maximum D i Keep the value consistent, W i Represents vector V i The weights (the sum of the weights is less than or equal to 1) are used to concatenate the four vectors in order to form a complete representation vector.

[0078] The mean-variance normalization method is used to normalize the concatenated representation vector to obtain the corresponding dataset; assuming the concatenated representation vector V combined The formula for normalization is as follows:

[0079] V normalized =(V combined -μ) / σ;

[0080] Here, μ represents the mean vector of the representation vectors, and σ represents the standard deviation vector of the representation vectors. Normalization involves subtracting the mean from the value of each dimension and then dividing by the standard deviation to ensure that the data distribution for each dimension approximates a standard normal distribution. This ensures that the vectors of different dimensions have relatively balanced numerical importance, which is beneficial for the training and convergence of the binary classification model.

[0081] Split the dataset into training and testing sets according to a certain ratio. The entire dataset should be split into training and testing sets at a specific ratio (e.g., 70% training and 30% testing). Randomness should be maintained during the split to avoid bias between the training and testing sets. This randomness can be ensured by randomly shuffling the order of the dataset before splitting it proportionally.

[0082] In one embodiment, the representation vectors of the training set are learned to learn feature representations of the software development assistance robot and human users, constructing a binary classification model, including:

[0083] The Transformer architecture is used to learn representation vectors from the training set. The concatenated representation vectors are then used as input to the model. A self-attention mechanism is employed to learn the relationships and importance within the input data, encoding this information into the feature vector. The self-attention mechanism in the Transformer architecture can be expressed as the following formula:

[0084]

[0085] Where Q, K, and V represent the vector representations of query, key, and value, respectively, d k This is the dimension of the query vector. The self-attention mechanism calculates the similarity between the query and the key, and then uses this similarity value as a weight to sum the values, obtaining the attention-weighted value.

[0086] During the learning process, the model is trained using the cross-entropy loss function and stochastic gradient descent optimization algorithm to minimize the gap between the predicted results and the true labels. The cross-entropy loss function measures the difference between the model's predicted results and the true labels, and is particularly suitable for binary or multi-class classification tasks. Assume y...i It is the true label of sample i. If is the model's prediction result for sample i (usually the probability after processing by the softmax activation function), then the cross-entropy loss function can be expressed as:

[0087]

[0088] Where N is the number of samples, C is the number of categories, and y i,c This indicates that sample i belongs to category c and is the true label (usually 0 or 1). This represents the model's predicted probability that sample i belongs to category c.

[0089] Stochastic gradient descent is an optimization algorithm used to update model parameters to minimize the loss function. Its update rule can be expressed as:

[0090]

[0091] Where θ represents the model parameters, and α is the learning rate. It is the gradient of the loss function with respect to the parameters.

[0092] The sequence output by the Transformer is pooled, and then the sequence representation is mapped to the output space of the binary classification task through a fully connected layer. Finally, the classification results of SoftBot and human user are output to achieve automated recognition of software development assistance robots.

[0093] First, the sequence output by the Transformer is pooled. Average pooling is used to perform average pooling on each dimension of the sequence, resulting in a fixed-length vector representation that represents the feature information of the entire sequence. The formula is as follows:

[0094] z = AvgPool(H);

[0095] Here, H represents the sequence output by the Transformer, and z represents the vector representation after average pooling. The pooled vector z is then passed through a fully connected layer for linear and nonlinear transformations, mapping it to the output space of the binary classification task. The calculation formula for the fully connected layer is as follows:

[0096] o = ReLU(W·z + b);

[0097] Where W is the weight matrix, b is the bias vector, and ReLU represents the modified linear unit activation function. Finally, the output o of the fully connected layer is passed through the Sigmoid activation function to obtain the classification results of SoftBot and human user. The Sigmoid function maps the output value to between 0 and 1, representing a probability value, which can be used to determine the probability of a sample belonging to SoftBot or human user.

[0098]

[0099] Among them, W o and b o These are the weights and biases of the output layer.

[0100] In one embodiment, binary classification models with different hyperparameter settings are trained on a training set, and the constructed binary classification models are evaluated on a test set. The performance of the binary classification models with different hyperparameter settings is compared, and the binary classification model with the best performance is selected as the recognition model, including:

[0101] The binary classification model with different hyperparameter settings is trained on the training set. The hyperparameters include learning rate, model complexity, and batch size. The learning rate can be set to [10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹]. The model complexity can be set by the number of attention heads, with a range of [4, 6, 8, 12, 16]. The batch size can be set to [2, 4, 8, 16, 32, 64].

[0102] The model's performance under different hyperparameter settings is evaluated on the test set. The model's accuracy, recall, F1 score, and AUC are calculated on the validation set, and the best-performing model is selected as the final recognition model. The F1 score can be used as the evaluation metric.

[0103] In one embodiment, a recognition model is used to identify the account to be identified, and the recognition result is output, including:

[0104] Enter a user account from the open-source community and obtain relevant data about the account to be identified through the API of the open-source community platform, including account metadata, profile picture, comment text, and behavioral data;

[0105] The account metadata, profile picture, comment text and behavioral data of the account to be identified are preprocessed to extract the four-dimensional representation vectors and concatenate them into a comprehensive representation vector;

[0106] The obtained representation vector is input into the recognition model for identification, which determines whether the account to be identified is a SoftBot or a human user, and gives the percentage of SoftBot and human users.

[0107] In one embodiment, a software development-assisted robot recognition device is provided, comprising: a data acquisition module, a vector extraction module, a vector fusion module, a model construction module, a model filtering module, and an account recognition module, wherein:

[0108] The data acquisition module is used to acquire data from software development bots and human users in the open-source community. The data specifically includes: account metadata, profile pictures, comment text, and behavioral data. The numerical features in the account metadata introduce a social interaction balance index. The social interaction balance index is calculated based on the number of followers and followers a user has.

[0109] The vector extraction module is used to extract account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors based on data.

[0110] The vector fusion module is used to obtain a comprehensive representation vector based on account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors, and then split it into training and test sets.

[0111] The model construction module is used to learn the representation vectors of the training set to learn the feature representations of software development assistance robots and human users, and construct a binary classification model.

[0112] The model selection module is used to train binary classification models with different hyperparameter settings on the training set, evaluate the constructed binary classification models on the test set, compare the performance of binary classification models with different hyperparameter settings, and select the binary classification model with the best performance as the recognition model.

[0113] The account recognition module is used to identify the account to be identified using a recognition model and output the recognition result.

[0114] Specific limitations regarding the software development-assisted robot recognition device can be found in the limitations of the software development-assisted robot recognition method described above, and will not be repeated here. Each module in the aforementioned software development-assisted robot recognition device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0115] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 2As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores user data from the open-source community. The network interface is used for communication with external terminals via a network connection. When executed by the processor, the computer program implements a software-assisted robot recognition method.

[0116] Those skilled in the art will understand that Figure 2 The structure shown is merely a block diagram of a portion of the structure related to the present invention and does not constitute a limitation on the computer device to which the present invention is applied. A specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0117] In one embodiment, a computer device is provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of the method described above.

[0118] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps described above.

[0119] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.

[0120] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0121] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.

Claims

1. A software development assistance robot recognition method, characterized by, The method includes: Data is acquired from software development bots and human users in the open-source community; the data specifically includes: account metadata, profile pictures, comment text, and behavioral data; the numerical features in the account metadata introduce a social interaction balance index; the social interaction balance index is calculated using the number of followers and followers a user has. Based on the data, extract account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors; A comprehensive representation vector is obtained based on the account metadata vector, avatar image vector, comment text vector, and behavioral data vector, and then split into a training set and a test set. The representation vectors of the training set are learned to learn the feature representations of software development assistance robots and human users, and a binary classification model is constructed. The binary classification models with different hyperparameter settings are trained on the training set, and the constructed binary classification models are evaluated on the test set. The performance of the binary classification models with different hyperparameter settings is compared, and the binary classification model with the best performance is selected as the recognition model. The recognition model is used to identify the account to be identified, and the recognition result is output. Based on the aforementioned data, account metadata vectors, profile picture vectors, comment text vectors, and behavioral data vectors are extracted, including: The category features, numerical features incorporating the social interaction balance index, and text features that integrate words and parts of speech in the account metadata are processed and combined to form the account metadata vector. The profile picture is loaded as a pixel matrix and preprocessed to obtain a profile picture vector. The natural language text and code text in the comment text are segmented into words, the segmented text is converted into word vector representations, and then concatenated and combined to form the comment text vector; Extract behavioral sequences from behavioral data in chronological order to obtain behavioral data vectors; The categorical features, numerical features incorporating a social interaction balance index, and textual features integrating words and parts of speech in the account metadata are processed and combined to form an account metadata vector, including: Encode the category features in the account metadata, i.e., category-type metadata, into numerical form, and map each category to a corresponding category-type metadata vector; For the numerical features in the account metadata, i.e., numerical metadata, normalization is performed to obtain general numerical features; at the same time, the user's social interaction balance index feature is calculated through the number of followers and the number of followers; then, based on the general numerical features and the social interaction balance index feature, a numerical metadata vector is obtained. For text features in account metadata, i.e. text-based metadata, a word embedding model is used to convert the text into a text vector; at the same time, each word in the text is labeled with its corresponding part of speech, and the part-of-speech tagging is added as an additional feature to the text vector to obtain a text-based metadata vector. The categorical metadata vector, numerical metadata vector, and textual metadata vector are concatenated and combined to form the account metadata vector.

2. The method of claim 1, wherein, The formula for calculating the social interaction balance index is as follows: ; SIBI stands for Social Interaction Balance Index. follow and followed These represent the number of followers and the number of followers a user has, respectively.

3. The method according to claim 1, characterized in that, A comprehensive representation vector is obtained based on the account metadata vector, profile picture vector, comment text vector, and behavioral data vector, and then split into a training set and a test set, including: The account metadata vector, avatar image vector, comment text vector, and behavior data vector are adjusted in dimension and weighted and concatenated to form a complete representation vector. The mean-variance normalization method is used to normalize the concatenated representation vectors to obtain the corresponding dataset; The dataset is split into training and test sets according to a certain ratio.

4. The method according to claim 3, characterized in that, The formula for adjusting the dimensions of the account metadata vector, profile picture vector, comment text vector, and behavioral data vector, and then weighting and concatenating them to form a complete representation vector is as follows: ; Among them, D i Represents vector V i ( The dimension of ) where V meta V represents the account metadata vector. image V represents a vector image of an avatar. text V represents the comment text vector. behavior Represents a behavioral data vector, a function f Perform zero-value completion operation, V combined W represents the complete representation vector. i Represents vector V i The weight.

5. The method according to claim 1, characterized in that, The representation vectors of the training set are learned to learn the feature representations of software development assistance robots and human users, constructing a binary classification model, including: The Transformer architecture is used to learn the representation vectors of the training set. The concatenated representation vectors are used as the input of the model. The relationship and importance in the input data are learned through the self-attention mechanism, and the relationship and importance information in the input data are encoded into the feature vector. During the learning process, the model is trained using the cross-entropy loss function and the stochastic gradient descent optimization algorithm to minimize the gap between the predicted results and the true labels. The sequence output by the Transformer is pooled, and then the sequence representation is mapped to the output space of the binary classification task through a fully connected layer. Finally, the classification results of SoftBot and human user are output to achieve automated recognition of software development assistance robots.

6. The method according to claim 1, characterized in that, The recognition model is used to identify the account to be identified, and the recognition results are output, including: Obtain relevant data about the account to be identified through the API of the open-source community platform, including account metadata, profile picture, comment text and behavioral data; The account metadata, profile picture, comment text and behavioral data of the account to be identified are preprocessed to extract the four-dimensional representation vectors and concatenate them into a comprehensive representation vector; The obtained representation vector is input into the recognition model for identification, which determines whether the account to be identified is a SoftBot or a human user, and gives the percentage of SoftBot and human users.

7. A software development-assisted robot recognition device, characterized in that, The method according to any one of claims 1 to 6 is implemented, wherein the apparatus comprises: The data acquisition module is used to acquire data from software development assistance robots and human users in the open-source community. The data specifically includes: account metadata, profile pictures, comment text, and behavioral data. The numerical features in the account metadata introduce a social interaction balance index. The social interaction balance index is calculated based on the number of followers and followers a user has. The vector extraction module is used to extract account metadata vectors, avatar image vectors, comment text vectors, and behavioral data vectors based on the data. The vector fusion module is used to obtain a comprehensive representation vector based on the account metadata vector, avatar image vector, comment text vector, and behavior data vector, and split it into a training set and a test set. The model construction module is used to learn the representation vectors of the training set to learn the feature representations of software development assistance robots and human users, and construct a binary classification model. The model selection module is used to train binary classification models with different hyperparameter settings on the training set, evaluate the constructed binary classification models on the test set, compare the performance of binary classification models with different hyperparameter settings, and select the binary classification model with the best performance as the recognition model. The account recognition module is used to identify the account to be identified using a recognition model and output the recognition result.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1-6.