A method and apparatus for detecting artificial intelligence generated content in student compositions
By extracting and fusing multi-dimensional features, the problem of existing technologies being unable to identify student essay paragraphs, individual differences, and differences generated by artificial intelligence tools has been solved, thus improving the accuracy of student essay detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU SHENYA TECHNOLOGY CO LTD
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies cannot accurately identify differences in student essays across different age groups and among individual students, and cannot effectively distinguish essays generated by different artificial intelligence tools, resulting in low detection accuracy.
By using a multi-dimensional feature extractor to obtain textual, thematic, and semantic features of student essays, the difference data between the essay collection of the student's grade level and the historical essay collection is identified. The difference data of essay collections generated by multiple different artificial intelligence tools are combined and fused using a target multilayer perceptron to generate detection results.
It has achieved accurate identification of differences in writing sections and individual historical differences in students' compositions, effectively captured the characteristics of compositions generated by different artificial intelligence tools, and improved the accuracy of detection.
Smart Images

Figure CN122196180A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of educational technology, and in particular to a method and apparatus for detecting artificial intelligence-generated content in student essays. Background Technology
[0002] As artificial intelligence tools become more and more widespread, teachers are increasingly finding that student essays may have been generated using artificial intelligence (AIGC). If teachers can detect whether an essay is generated by AI, they can gain a more objective understanding of students' true abilities and provide more precise guidance during the teaching process.
[0003] Currently, the main method for detecting student essays is to extract features from a single language model and use this method to detect whether the student essays were generated by artificial intelligence, thus obtaining the detection results.
[0004] Student essays are a type of text with unique characteristics, distinct from academic papers and short-answer questions. They exhibit age-specific stratification and individualized features, with significant differences in vocabulary and sentence structure among students at different grade levels, and even among students at the same grade level, differences in vocabulary preferences and expression styles. In contrast, academic papers require the inclusion of a large amount of specialized information and adherence to standardized expression methods.
[0005] However, due to the unique textual characteristics of student essays, which are significantly influenced by differences among students of different ages and individuals, using existing detection methods for academic papers and short-answer questions to detect student essays will fail to accurately identify the differences between students of different ages and individuals, resulting in low detection accuracy for AI-generated student essays. In addition, since there are currently various AI tools available, the text content generated by different AI tools will have significant differences in language features. Detection using only a single language model feature will also fail to accurately identify essays generated by different AI tools, further reducing the detection accuracy of AI-generated student essays. Summary of the Invention
[0006] In view of this, this application provides a method and apparatus for detecting AI-generated content in student essays. The main purpose is to improve the current technology's inability to accurately identify the differences in student essays among different age groups and individual students. Text content generated by different AI tools will have significant differences in language features. Detection using only a single language model feature will also lead to the inability to accurately identify essays generated by different AI tools, resulting in low accuracy in detecting AI-generated student essays.
[0007] Firstly, this application provides a method for detecting AI-generated content in student essays, including: The student's essay to be tested is obtained, and features are extracted from the essay using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay. The multi-dimensional features include at least one of text features, topic features, and semantic features. Identify the target grade level of the student, combine the essays of different students in the target grade level to obtain a first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features; Multiple different artificial intelligence tools are used to generate student essays to obtain a second set of essays, and the second difference data between the essay to be detected and the second set of essays is identified based on the multi-dimensional features. Based on the first difference data and the second difference data, a fusion analysis is performed to generate the target detection result corresponding to the essay to be detected. The target detection result is used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
[0008] Optionally, identifying the target grade level of the student, combining essays from different students in the target grade level to obtain a first essay set, and identifying first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features, includes: Identify the target grade level to which the student belongs, determine the set of student essays in the target grade level as the first essay set, and determine the set of student essays in the target grade level as the historical essay set, wherein the first essay set includes essays from other students besides the student in the target grade level; The multi-dimensional feature extractor extracts features from the first set of essays and the set of historical essays respectively, to obtain a first multi-dimensional essay feature matrix corresponding to the first set of essays and a second multi-dimensional essay feature matrix corresponding to the set of historical essays. Based on the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix, the first difference data between the essay to be detected and the first essay set and the historical essay set are identified.
[0009] Optionally, the step of identifying the first difference data between the essay to be detected and the first essay set and the historical essay set based on the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix includes: The multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix are input into the first target attention network layer. The first target attention network layer identifies the first interaction features and the first difference features between the multi-dimensional features and the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix, respectively. Based on the first interaction feature and the first difference feature, a non-linear scoring is performed to generate first difference data between the essay to be detected and the first essay set and the historical essay set, respectively.
[0010] Optionally, a non-linear scoring is performed based on the first interaction feature and the first difference feature to generate first difference data between the essay to be detected and the first essay set and the historical essay set, respectively, including: The multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature are input into the target multilayer perceptron. In the target multilayer perceptron, the multidimensional feature vector of the essay to be detected, the multidimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature are concatenated to generate a concatenated vector. The concatenated vector is non-linearly scored to generate first difference data between the essay to be detected and the first essay set and the historical essay set.
[0011] Optionally, the step of using multiple different artificial intelligence tools to generate student essays to obtain a second essay set, and identifying second difference data between the essay to be detected and the second essay set based on the multi-dimensional features, includes: The essay set corresponding to the target learning stage generated by the multiple different artificial intelligence tools is determined as the second essay set. The second essay set is then subjected to feature extraction by the multi-dimensional feature extractor to obtain the third multi-dimensional essay feature matrix corresponding to the second essay set. The multi-dimensional features and the third multi-dimensional essay feature matrix are input into the second target attention network layer, and the second target attention network layer identifies the second interaction features and the second difference features between the multi-dimensional features and the third multi-dimensional essay feature matrix. A non-linear scoring method is used based on the second interaction feature and the second difference feature to generate second difference data between the essay to be detected and the second essay set.
[0012] Optionally, the step of performing fusion analysis based on the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected includes: The first difference data and the second difference data are input into the target multilayer perceptron, which is trained based on the feature vectors in the multidimensional feature extractor, the first target attention network layer and the second target attention network layer using the binary classification cross-entropy loss function. The target multilayer perceptron is used to fuse and analyze the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected.
[0013] Optionally, the training process of the target multilayer perceptron includes: The training parameters of the target multilayer perceptron are determined, and the training parameters include the bucket vector and keyword vector in the multidimensional feature extractor, the parameters of the first target attention network layer, the parameters of the second target attention network layer, and the parameters of the target multilayer perceptron. The sample data for training the target multilayer perceptron is determined, and the sample data includes positive samples and negative samples, wherein the positive samples are real writing essays submitted by students, and the negative samples are essays generated by various artificial intelligence tools; The first and second difference data of each sample in the sample data are input into the target multilayer perceptron in the initial state, and the sample data is output as the predicted probability of the artificial intelligence tool generating an essay based on the training parameters. The loss data for each sample in the sample data is calculated using the binary cross-entropy loss function, where the loss data is the difference between the true label and the predicted probability of each sample. The average loss data of the sample data is determined based on the loss data, the number of sample data, and the true label of the sample data, and the training parameters are adjusted according to the average loss data. Training of the target multilayer perceptron is completed when the average loss data is less than the loss threshold.
[0014] Optionally, the step of extracting features from the essay to be detected using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay to be detected includes: The text feature extractor in the multi-dimensional feature extractor extracts the word segmentation features and syntactic features corresponding to the essay to be detected, and performs bucket quantization based on the word segmentation features and the syntactic features to obtain the bucket quantized data corresponding to the essay to be detected, and maps the bucket quantized data to the text features; The topic feature extractor in the multi-dimensional feature extractor extracts the keywords and weights corresponding to the essay to be detected, and generates the topic features based on the keywords and weights. The semantic vector corresponding to the essay to be detected is extracted by the semantic feature extractor in the multi-dimensional feature extractor, and the semantic features are generated based on the semantic vector. The semantic feature extractor is obtained by training an open-source large language model using real student essay content as the training set.
[0015] Secondly, this application provides an apparatus for detecting AI-generated content in student essays, comprising: The extraction module is configured to acquire students' essays to be tested, and to extract features from the essays to be tested using a multi-dimensional feature extractor to obtain multi-dimensional features corresponding to the essays to be tested. The multi-dimensional features include at least one of text features, topic features, and semantic features. The identification module is configured to identify the target grade level of the student, combine the essays of different students in the target grade level to obtain a first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features; The identification module is also configured to use multiple different artificial intelligence tools to generate student essays, obtain a second set of essays, and identify second difference data between the essay to be detected and the second set of essays based on the multi-dimensional features; The generation module is configured to perform fusion analysis based on the first difference data and the second difference data to generate a target detection result corresponding to the essay to be detected. The target detection result is used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
[0016] Thirdly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for detecting artificial intelligence-generated content in student essays as described in the first aspect.
[0017] Fourthly, this application provides an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor executes the computer program to implement the method for detecting artificial intelligence-generated content in student essays as described in the first aspect.
[0018] Fifthly, this application provides a computer program product, which includes a computer program that, when executed by a processor, implements the method for detecting artificial intelligence-generated content in student essays as described in the first aspect.
[0019] Using the above technical solution, this application provides a method and apparatus for detecting AI-generated content in student essays, comprising: acquiring a student's essay to be tested; extracting features from the essay using a multi-dimensional feature extractor to obtain multi-dimensional features corresponding to the essay, wherein the multi-dimensional features include at least one of text features, topic features, and semantic features; identifying the target grade level of the student; combining different student essays from the target grade level to obtain a first essay set; and identifying first difference data between the essay to be tested and the first essay set and the student's historical essay set based on the multi-dimensional features; generating student essays using multiple different AI tools to obtain a second essay set; and identifying second difference data between the essay to be tested and the second essay set based on the multi-dimensional features; and performing a fusion analysis based on the first difference data and the second difference data to generate a target detection result corresponding to the essay to be tested, wherein the target detection result is used to determine whether the essay to be tested is a student essay generated by an AI tool. Compared with existing technologies, this application achieves accurate identification of differences between student essays across different academic levels and individual historical differences by identifying first difference data between the essay to be detected and a first set of essays corresponding to the student's academic level and a set of the student's historical essays, based on the multi-dimensional features; it also achieves effective capture of the features of essays generated by different artificial intelligence tools by identifying second difference data between the essay to be detected and a second set of essays generated by multiple artificial intelligence tools, based on the multi-dimensional features; and it improves the detection accuracy of student essays generated by artificial intelligence by generating target detection results through fusion analysis based on the first and second difference data. Attached Figure Description
[0020] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0021] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 The illustration shows a flowchart of a method for detecting AI-generated content in student essays according to an embodiment of this application; Figure 2 The illustration shows a flowchart of a method for detecting AI-generated content in student essays according to an embodiment of this application; Figure 3 This illustration shows a schematic diagram of an example of AIGC generation and detection of student essay content provided in an embodiment of this application; Figure 4 This illustration shows a schematic diagram of the structure of a device for detecting AI-generated content in student essays, according to an embodiment of this application. Figure 5 A schematic diagram of the structure of an electronic device provided in an embodiment of this application is shown. Detailed Implementation
[0023] The embodiments of this application will now be described in more detail with reference to the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features described herein can be combined with each other.
[0024] To address the limitations of existing technologies in accurately identifying differences in student essays across different age groups and among individual students, and considering the significant linguistic feature variations in text generated by different AI tools, relying solely on a single language model for detection can lead to inaccurate identification of essays generated by different AI tools, resulting in low accuracy in detecting AI-generated student essays. This embodiment provides a method for detecting AI-generated content in student essays, such as... Figure 1 As shown, the method includes: Step 101: Obtain the student's essay to be tested, and use a multi-dimensional feature extractor to extract features from the essay to obtain the multi-dimensional features corresponding to the essay.
[0025] Among them, multi-dimensional features include at least one of text features, topic features, and semantic features.
[0026] In this embodiment of the application, the essay to be tested can be various writing assignments submitted by students during their learning process. For example, the essay to be tested in this embodiment of the application can specifically include different types and age groups of essays, such as narrative essays by primary school students, argumentative essays by junior high school students, and descriptive essays by senior high school students.
[0027] In this embodiment, the multi-dimensional feature extractor can be a set of tools capable of extracting multiple features from an essay. The multi-dimensional feature extractor can be used to comprehensively capture the textual, thematic, and semantic features of the essay to be tested. For example, the multi-dimensional feature extractor in this embodiment may specifically include a text feature extractor, a thematic feature extractor, and a semantic feature extractor.
[0028] In the embodiments of this application, text features can be features extracted based on the vocabulary, syntax, and other aspects of the essay. Text features can be used to reflect the characteristics of the essay's linguistic expression. For example, the text features in the embodiments of this application may specifically include bag-of-words features, word count and proportion, part-of-speech distribution proportion, sentence count and length distribution, syntactic complexity distribution, etc.
[0029] In this embodiment, the theme feature can be a feature constructed based on keywords and weights related to the core idea of the essay. The theme feature can be used to reflect the core content expressed in the essay. For example, the theme feature in this embodiment can specifically be a vector feature generated from keywords such as life values and social responsibility and their corresponding weights.
[0030] In this embodiment of the application, semantic features can be vector features extracted based on the deep semantic meaning of the essay, and semantic features can be used to reflect the semantic connotation of the essay. For example, the semantic features in this embodiment of the application can specifically be 4096-dimensional semantic vectors.
[0031] In this embodiment of the application, the student's essay to be tested is obtained, and the multi-dimensional feature extractor is used to extract features from the essay to be tested to obtain the multi-dimensional features corresponding to the essay to be tested. This can be done by obtaining the essay submitted by the student, calling the multi-dimensional feature extractor to extract features from the essay to be tested, obtaining corresponding features from at least one dimension such as text, topic, and semantics, and forming the multi-dimensional features of the essay to be tested.
[0032] Step 102: Identify the target grade level of the student, combine the essays of different students in the target grade level to obtain the first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features.
[0033] In this embodiment of the application, the first essay set may be a collection of authentic essays submitted by other students in the student's target grade level. The first essay set can be used to provide a reference for the general characteristics of essays in the target grade level. For example, in this embodiment of the application, the first essay set may specifically be a collection of authentic essays submitted by all students in the student's target grade level (primary, junior high, or senior high), totaling q essays.
[0034] In this embodiment, the historical essay set can be a collection of authentic essays submitted by the student before the current target learning stage. This historical essay set can reflect the student's writing habits and characteristics. For example, in this embodiment, the historical essay set can specifically be k authentic essays submitted by the student before the target learning stage.
[0035] In this embodiment, the first difference data can be the difference data between the multi-dimensional features of the essay to be detected and the features of the first essay set and the historical essay set. The first difference data can be used to reflect the degree of deviation between the essay to be detected and the general features of the grade level and the historical features of the individual student. For example, in this embodiment, the first difference data can specifically be a difference vector m with dimensions of the number of real essays k in the historical essay set and the number of real essays q in the first essay set.
[0036] In the embodiments of this application, the first difference data between the essay to be detected and the first essay set corresponding to the student's grade level and the student's historical essay set can be generated by comparing and analyzing the multi-dimensional features with the overall features of the first essay set and the overall features of the historical essay set. This can identify the differences between the essay to be detected and these two sets in terms of text, theme, semantics, etc., and thus generate the first difference data.
[0037] Step 103: Use multiple different artificial intelligence tools to generate student essays to obtain a second essay set, and identify the second difference data between the essay to be detected and the second essay set based on the multi-dimensional features.
[0038] In this application embodiment, the artificial intelligence tool can be an AI-generated content (AIGC) tool with text generation capabilities. For example, the artificial intelligence tool in this application embodiment may specifically include common large model tools such as Tongyi Qianwen, Doubao, Wenxin Yiyan, and ChatGPT.
[0039] In this embodiment, the second essay set can be a collection of essays generated by multiple different artificial intelligence tools for the topic of the essay to be tested. The second essay set can be used to provide feature references for essays generated by different AIGC tools. For example, in this embodiment, the second essay set can specifically be a collection of v essays generated by four AIGC tools for essay topic A.
[0040] In this embodiment, the second difference data can be quantitative difference data between the multi-dimensional features of the essay to be detected and the features of the second essay set. The second difference data can be used to reflect the similarity between the features of the essay to be detected and the features of the essay generated by artificial intelligence. For example, in this embodiment, the second difference data can specifically be a difference vector n with dimension v of the number of essays in the second essay set.
[0041] In this embodiment of the application, the features of the essay to be detected are compared one by one with those of each essay in the second essay set. This can identify the differences between the essay to be detected and each essay in the second essay set in terms of vocabulary, syntax, theme expression, semantic connotation, etc., and generate second difference data.
[0042] Step 104: Perform a fusion analysis based on the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected.
[0043] Among them, the object detection results are used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
[0044] In this embodiment, the target detection result can be a probability value that the essay to be detected was generated by an artificial intelligence tool. The target detection result can be used to intuitively reflect the generation source attribute of the essay. For example, in this embodiment, the target detection result can specifically be a probability value of 0.85, indicating that the essay has a high probability of being generated by artificial intelligence.
[0045] In this embodiment of the application, the target detection result corresponding to the essay to be detected can be generated by performing a fusion analysis based on the first difference data and the second difference data. This can be achieved by analyzing and processing the fused difference information through a model to generate the target detection result. Teachers can then use the target detection result to determine whether the essay to be detected is the student's actual writing, thereby gaining an understanding of the student's true level.
[0046] Compared with existing technologies, this embodiment acquires students' essays to be tested and uses a multi-dimensional feature extractor to extract features from the essays to obtain multi-dimensional features, thereby achieving comprehensive capture of the features of students' essays. By identifying the first difference data between the essays to be tested and the first set of essays corresponding to the student's grade level and the student's historical essay set based on the multi-dimensional features, it achieves accurate identification of grade level differences and individual historical differences in students' essays. By identifying the second difference data between the essays to be tested and the second set of essays generated by multiple different artificial intelligence tools based on the multi-dimensional features, it achieves effective capture of the features of essays generated by different artificial intelligence tools. By fusing and analyzing the first and second difference data to generate target detection results, it improves the detection accuracy of student essays generated by artificial intelligence.
[0047] As an optional approach, when performing the task of "identifying the target grade level of the student, combining essays from different students in the target grade level to obtain a first essay set, and identifying the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features," the following methods can be used, but are not limited to them: Figure 2 As shown, the method includes: Step 201: Identify the target grade level of the students, determine the set of student essays in the target grade level as the first essay set, and determine the set of student essays in the target grade level as the history essay set.
[0048] The first collection of essays includes essays written by students other than the students themselves, corresponding to the target learning level.
[0049] For the embodiments of this application, the target educational stage can be a common educational stage such as primary school, junior high school, or senior high school.
[0050] In the embodiments of this application, identifying the target grade level of the student and determining the set of student essays in the target grade level as the first essay set can be done by clearly identifying the target grade level of the student and selecting a large number of real essays submitted by students in the target grade level to form the first essay set. The number of essays in the first essay set can be q. The first essay set can reflect the general characteristics of essays in the target grade level.
[0051] In the embodiments of this application, determining the collection of students' essays in the target learning stage as the historical essay collection can be done by collecting all the real essays submitted by students before the current target learning stage to form the historical essay collection. The number of historical essays in the historical essay collection can be k, and the historical essay collection can reflect the students' own writing style and characteristics.
[0052] Step 202: The multi-dimensional feature extractor extracts features from the first set of essays and the historical set of essays respectively, to obtain the first multi-dimensional essay feature matrix corresponding to the first set of essays and the second multi-dimensional essay feature matrix corresponding to the historical set of essays.
[0053] In the embodiments of this application, the multi-dimensional feature extractor extracts features from the first essay set to obtain the first multi-dimensional essay feature matrix corresponding to the first essay set. This can be achieved by calling the multi-dimensional feature extractor to extract features from each essay in the first essay set, extracting the text features, theme features, and semantic features of each essay. The multi-dimensional features of each essay can be arranged in order to form the first multi-dimensional essay feature matrix. The dimension of the first multi-dimensional essay feature matrix can be q×d, where q can represent the number of essays in the first essay set, and d can represent the total dimension of the multi-dimensional features of a single essay.
[0054] In this embodiment of the application, feature extraction is performed on the historical essay set by the multi-dimensional feature extractor to obtain the second multi-dimensional essay feature matrix corresponding to the historical essay set. This can be achieved by calling the multi-dimensional feature extractor to extract features from each essay in the second essay set, extracting the text features, theme features, and semantic features of each essay. The extracted multi-dimensional features can be organized into the second multi-dimensional essay feature matrix, and the dimension of the second multi-dimensional essay feature matrix can be k×d, where k can represent the number of essays in the historical essay set, and d can represent the total dimension of the multi-dimensional features of a single essay.
[0055] For example, if the first essay set has 500 essays (q=500) and the feature dimension of a single essay is 4096 (d=4096), then the dimension of the first multi-dimensional essay feature matrix can be 500×4096; if the historical essay set has 8 essays (k=8), then the dimension of the second multi-dimensional essay feature matrix can be 8×4096.
[0056] Step 203: Based on multi-dimensional features, the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix, identify the first difference data between the essay to be detected and the first essay set and the historical essay set, respectively.
[0057] In the embodiments of this application, based on multi-dimensional features, a first multi-dimensional essay feature matrix, and a second multi-dimensional essay feature matrix, the first difference data between the essay to be detected and the first essay set and the historical essay set can be obtained by taking the multi-dimensional feature vector p (dimension d) of the essay to be detected, the first multi-dimensional essay feature matrix Q (dimension q×d), and the second multi-dimensional essay feature matrix K (dimension k×d) as input data, and calculating the differences between the features of the essay to be detected and the features of each essay in the first essay set, and the differences between the features of the essay to be detected and the features of each essay in the historical essay set through a feature comparison algorithm; and integrating the differences to obtain the first difference data that can comprehensively reflect the two types of differences.
[0058] Optionally, when performing the task of "identifying the first difference data between the essay to be detected and the first set of essays and the historical set of essays based on multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix", the following methods can be used, but are not limited to: inputting the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix into the first target attention network layer; identifying the first interaction features and the first difference features between the multi-dimensional features and the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix through the first target attention network layer; performing non-linear scoring based on the first interaction features and the first difference features to generate the first difference data between the essay to be detected and the first set of essays and the historical set of essays.
[0059] In this embodiment, the first target attention network layer can be a network layer with a multilayer perceptron (MLP) as the core structure. The first target attention network layer can focus on the key correlation and difference information between the features of the essay to be detected and the features of the two essay sets.
[0060] In this embodiment, after inputting the multi-dimensional feature vector p of the essay to be detected, the first multi-dimensional essay feature matrix Q, and the second multi-dimensional essay feature matrix K into the first target attention network layer, the first target attention network layer can first perform deep processing on the input features to identify the first interaction features (i.e., mutually related feature information) and the first difference features (i.e., mutually distinguishable feature information) between the features of the essay to be detected and each feature in Q. It can simultaneously identify the first interaction features and the first difference features between the features of the essay to be detected and each feature in the second multi-dimensional essay feature matrix K.
[0061] Optionally, when performing the action of "generating first difference data between the essay to be detected and the first essay set and the historical essay set by performing nonlinear scoring based on the first interaction feature and the first difference feature", the following method can be used, but is not limited thereto: inputting the multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature into a target multilayer perceptron; in the target multilayer perceptron, concatenating the multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature to generate a concatenated vector; performing nonlinear scoring on the concatenated vector to generate first difference data between the essay to be detected and the first essay set and the historical essay set.
[0062] In this embodiment of the application, the first interaction feature and the first difference feature are non-linearly scored by the MLP in the first target attention network layer. The formula for the non-linear scoring is as shown in Formula 1, where... It can represent vector concatenation. It can represent element-wise multiplication (capturing interactive features), t can represent the multi-dimensional feature vector of the essay to be detected, and hi can represent the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set. It can represent vector differences (capturing difference features); after non-linear scoring based on the first interaction feature and the first difference feature, a difference vector m with dimension k+q can be output. The difference vector m is the first difference data. Each element in the difference vector m can correspond to the difference score between the essay to be detected and a certain essay in the historical essay set and a certain essay in the first essay set.
[0063] (Formula 1) Optionally, when performing the action of "using multiple different artificial intelligence tools to generate student essays, obtaining a second essay set, and identifying the second difference data between the essay to be detected and the second essay set based on the multi-dimensional features," the following methods may be used, but are not limited to: determining the essay set corresponding to the target learning stage generated by multiple different artificial intelligence tools as the second essay set; extracting features from the second essay set using a multi-dimensional feature extractor to obtain a third multi-dimensional essay feature matrix corresponding to the second essay set; inputting the multi-dimensional features and the third multi-dimensional essay feature matrix into a second target attention network layer; identifying the second interaction feature and the second difference feature between the multi-dimensional features and the third multi-dimensional essay feature matrix through the second target attention network layer; performing non-linear scoring based on the second interaction feature and the second difference feature to generate the second difference data between the essay to be detected and the second essay set.
[0064] In this embodiment, a second set of essays corresponding to the target grade level generated by multiple different artificial intelligence tools is defined. A multi-dimensional feature extractor is used to extract features from the second set of essays, resulting in a third multi-dimensional essay feature matrix. Multiple different artificial intelligence tools can be invoked for the essays to be tested, with each tool generating essays that meet the writing requirements of the target grade level. These generated essays are then aggregated to form the second set of essays, which can contain v essays. The multi-dimensional feature extractor can be used to extract features from each essay in the second set, extracting textual features, thematic features, and semantic features. These features are then arranged in order to construct a third multi-dimensional essay feature matrix V. The dimension of the third multi-dimensional essay feature matrix V can be v×d, where v represents the number of essays in the second set, and d represents the total dimension of the multi-dimensional features of a single essay.
[0065] In this embodiment, the second target attention network layer can be a network layer with a structure similar to that of the first target attention network layer, and the second target attention network layer can use MLP as the core model.
[0066] In this embodiment of the application, the multi-dimensional feature vector p of the essay to be detected and the third multi-dimensional essay feature matrix V are input into the second target attention network layer. The second target attention network layer can identify the second interaction feature (mutually related feature information) and the second difference feature (mutually distinguishable feature information) between each feature in p and V.
[0067] In this embodiment of the application, the second interaction feature and the second difference feature are non-linearly scored by the MLP in the second target attention network layer. The formula for non-linear scoring is as shown in Formula 1. Non-linear scoring can generate a difference vector n with dimension v. The difference vector n is the second difference data. Each element in the difference vector n corresponds to the difference score between the essay to be detected and an AI-generated essay in the second essay set.
[0068] Optionally, when performing the "fusion analysis based on the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected", the following methods can be used, but are not limited to: inputting the first difference data and the second difference data into the target multilayer perceptron, the target multilayer perceptron being trained using the feature vectors in the multidimensional feature extractor, the first target attention network layer and the second target attention network layer using the binary classification cross-entropy loss function; and performing fusion analysis on the first difference data and the second difference data through the target multilayer perceptron to generate the target detection result corresponding to the essay to be detected.
[0069] In the embodiments of this application, the target detection result corresponding to the essay to be detected can be generated by fusing and analyzing the first difference data and the second difference data through the target multilayer perceptron. This can be achieved by inputting the first difference data (difference vector m) and the second difference data (difference vector n) into the target multilayer perceptron model. The target multilayer perceptron model can perform deep fusion and analysis on these two types of difference data. Combined with the discrimination rules learned during the training process, it can output a probability value, which is the target detection result. The target detection result can be used to indicate the probability that the essay to be detected was generated by an artificial intelligence tool.
[0070] As an optional approach, the training process of the target multilayer perceptron includes: determining the training parameters of the target multilayer perceptron, the training parameters including the bucket vector and keyword vector in the multidimensional feature extractor, the parameters of the first target attention network layer, the parameters of the second target attention network layer, and the parameters of the target multilayer perceptron; determining the training sample data of the target multilayer perceptron, the sample data including positive samples and negative samples, wherein positive samples are real writing essays submitted by students, and negative samples are essays generated by various artificial intelligence tools; inputting the first difference data and the second difference data of each sample in the sample data into the target multilayer perceptron in an initial state, and outputting the predicted probability of the sample data being essays generated by artificial intelligence tools based on the training parameters; calculating the loss data of each sample in the sample data using a binary classification cross-entropy loss function, the loss data being the difference data between the true label and the predicted probability of each sample; determining the average loss data of the sample data based on the loss data, the number of sample data, and the true label of the sample data, and adjusting the training parameters according to the average loss data; and completing the training of the target multilayer perceptron when the average loss data is less than a loss threshold.
[0071] In the embodiments of this application, during the training process of the target multilayer perceptron (MLP), the trainable parameters may include the bucket vector in the multidimensional feature extractor, the keyword vector in the topic feature extractor, the relevant parameters of the first target attention network layer, the relevant parameters of the second target attention network layer, and the parameters of the target multilayer perceptron itself; the positive samples in the training set may be a large number of real writing essays submitted by students, and the sample label may be label=1; the negative samples in the training set may be essays generated by various artificial intelligence tools, and the sample label may be label=0.
[0072] For the embodiments of this application, the loss function used for training can be the binary cross-entropy loss function. The formula for the binary cross-entropy loss function can be as shown in Formula 2, where L can represent the average loss data during the model training process. L can be used to measure the degree of deviation between the current prediction result of the model and the actual situation of the sample. The smaller the average loss data, the closer the prediction result of the model is to the actual situation. The calculation of the average loss data can be to first calculate the loss data of a single sample, and then combine the sample number to calculate the average loss data. This can represent the binary label of the i-th sample data. The probability of prediction can be represented by , and N can represent the total number of samples in the training set. Training of the target multilayer perceptron is completed when the average loss is less than the loss threshold.
[0073] (Formula 2) Optionally, when performing the step of "extracting features from the essay to be tested using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay to be tested", the following methods can be used, but are not limited to these: extracting word segmentation features and syntactic features corresponding to the essay to be tested using the text feature extractor in the multi-dimensional feature extractor, and performing bucket quantization based on the word segmentation features and syntactic features to obtain the bucketed quantized data corresponding to the essay to be tested, and mapping the bucketed quantized data to text features; extracting keywords and their corresponding weights corresponding to the essay to be tested using the topic feature extractor in the multi-dimensional feature extractor, and generating topic features based on the keywords and weights; extracting semantic vectors corresponding to the essay to be tested using the semantic feature extractor in the multi-dimensional feature extractor, and generating semantic features based on the semantic vectors, wherein the semantic feature extractor is obtained by training an open-source large language model using real student essay content as the training set.
[0074] In this embodiment of the application, the text feature extractor in the multi-dimensional feature extractor extracts the word segmentation features and syntactic features corresponding to the essay to be detected, and performs bucket quantization based on the word segmentation features and syntactic features to obtain the bucket quantized data corresponding to the essay to be detected. The specific process of mapping the bucket quantized data to text features may include: using a word segmenter (including but not limited to Jieba, HanLP) to segment the content of the essay to be detected, breaking the essay into multiple independent words and determining the part of speech of each word. The word segmentation features can be obtained through statistical methods. The word segmentation features may include bag-of-words features composed of commonly used words, the total number of words, the proportion of single-character words, the proportion of two-character words, the proportion of multi-character words, and the features of nouns (n), verbs (v), adjectives, etc. The distribution ratio of different parts of speech, such as word (a); syntactic features can be obtained by using a syntactic analyzer (including but not limited to Spacy) to perform syntactic analysis on the essay to be tested. These syntactic features can include the total number of sentences, the distribution ratio of sentences of different lengths, and the distribution of syntactic complexity (e.g., counting the number of grammatical relations such as nsubj (subject), dobj (object), and acl (relative clause)). The above word segmentation features and syntactic features are then binned and quantized. For features of integer numerical type (such as the total number of words, the total number of sentences, etc.), they are divided into ranges of 0~50, 50~100, 100~500, 500~1000, and above 1000. Each feature value can correspond to a bucket number. For features of proportional numerical types (such as the proportion of single words, the distribution ratio of nouns, etc.), they can be divided according to the 10% percentile. Each feature value can correspond to a bucket ID. Based on the bucket ID corresponding to each feature, the corresponding vector feature x_i can be obtained by looking up the preset vector table. The vector feature x_i is the text feature.
[0075] In this embodiment, the topic feature extractor in the multi-dimensional feature extractor extracts the keywords and their corresponding weights from the essay to be tested. The specific process of generating topic features based on the keywords and weights may include: using a topic keyword extractor (including but not limited to TextRank4ZH and HanLP) to extract keywords from the content of the essay to be tested, and determining the weight of each keyword. The extraction result can be represented in the form of [(Life Value, 0.3), (Social Responsibility, 0.25), (Dialectical Thinking, 0.2), ...]. Each keyword is converted into a corresponding token ID. The word vector corresponding to each keyword can be obtained by looking up the preset vector table based on the token ID. The vector obtained by multiplying each word vector with its corresponding keyword weight is the topic feature x_j.
[0076] In this embodiment, the specific process of extracting the semantic vector corresponding to the essay to be detected through the semantic feature extractor in the multi-dimensional feature extractor, and generating semantic features based on the semantic vector, may include: using an open-source pre-trained large language model (including but not limited to Qwen3-8B) as the base model, using real student essay content as the training set to post-train the base model. The input corpus during the post-training of the base model can be a large number of real student essays covering the three age groups of primary school, junior high school, and senior high school. The loss function used in the post-training of the base model can be an autoregressive loss function. The training process can be completed on an A100 GPU server. After the training of the base model is completed, a specific prompt can be constructed, and the content of the essay to be detected can be input into the post-trained language model. The model will output a semantic vector, which can be 4096-dimensional. Based on the semantic vector, the semantic features of the essay to be detected can be generated.
[0077] Optionally, this application also provides an example of AI-generated detection of student essay content, such as... Figure 3 As shown, Figure 3The example of AIGC-generated student essays in the application can take the student's historical essay set K (i.e., the student's historical essay set in this application embodiment), the real essay set Q of the student's grade level (i.e., the first essay set in this application embodiment), the essay set V generated by various artificial intelligence tools for the essay topic to be tested (i.e., the second essay set in this application embodiment), and the student's current submitted essay p (i.e., the essay to be tested in this application embodiment) as the core input. All input essays are subject to feature extraction by a multi-dimensional feature extractor that includes a text feature extractor, a topic feature extractor, and a semantic feature extractor. The resulting features are a historical essay feature matrix with dimension k×d (i.e., the second multi-dimensional essay feature matrix in this application embodiment, where k is the number of historical essays and d is the feature dimension of a single essay), an age-group essay feature matrix with dimension q×d (i.e., the first multi-dimensional essay feature matrix in this application embodiment, where q is the number of essays in the grade level), an artificial intelligence-generated essay feature matrix with dimension v×d (i.e., the third multi-dimensional essay feature matrix in this application embodiment, where v is the number of essays generated by artificial intelligence), and a submitted essay feature vector with dimension d.
[0078] Optional, Figure 3 The student writing feature extraction module (i.e., the student writing feature extraction module in this application embodiment) can receive the historical essay feature matrix, the school-age essay feature matrix and the submitted essay feature vector. After capturing the difference correlation features between the essay to be detected and the real essay through the target attention network layer, it can output a feature vector with dimension q+k (i.e., the first difference data in this application embodiment).
[0079] Optional, Figure 3 The AIGC tool writing feature extraction module (i.e., the AIGC tool writing feature extraction module in this application embodiment) can take the submitted essay feature vector and the AIGC generated essay feature matrix as input, and after processing by another independent target attention network layer, output a difference feature vector of dimension v (i.e., the second difference data in this application embodiment). The two types of feature vectors are input into the MLP (multilayer perceptron) (i.e., the target multilayer perceptron in this application embodiment) for deep fusion and nonlinear transformation, perform binary classification judgment and output the probability value of the essay to be detected as generated by AIGC (i.e., the target detection result in this application embodiment). The probability value ranges from [0,1]. The closer the probability value is to 0, the more likely the essay is to be generated by AIGC. The closer the probability value is to 1, the more likely it is to be the student's real writing.
[0080] Compared with existing technologies, this embodiment determines the target grade level of the student and the corresponding first essay set and historical essay set, extracts their feature matrices using a multi-dimensional feature extractor, and then identifies the first difference data based on the multi-dimensional features of the essay to be tested, thereby achieving standardized extraction of relevant difference data related to grade level and individual history. By inputting the multi-dimensional features of the essay to be tested, the first and second multi-dimensional essay feature matrices into the first target attention network layer, interaction features and difference features are identified and non-linear scoring is performed, thereby improving the accuracy of the identification of the first difference data. By determining the second essay set generated by multiple different artificial intelligence tools and extracting the third multi-dimensional essay feature matrix, and inputting it and the multi-dimensional features of the essay to be tested into the second target attention network layer to identify relevant features and score, the accurate acquisition of the second difference data is achieved. By inputting the first and second difference data into a target multilayer perceptron trained with specific methods for fusion analysis, the reliability of the target detection results is further improved. Through the binning quantization of the text feature extractor, the keyword and weight processing of the topic feature extractor, and the semantic vector extraction of the semantic feature extractor (trained with real student essays), the comprehensive and accurate extraction of the multi-dimensional features of the essay to be tested is achieved.
[0081] Furthermore, as Figure 1 and Figure 2 The specific implementation of the method shown in this embodiment provides a device for detecting AI-generated content in student essays, such as... Figure 4 As shown, the device includes: an extraction module 31, an identification module 32, and a generation module 33.
[0082] The extraction module 31 is configured to acquire the student's essay to be tested, and to extract features from the essay to be tested using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay to be tested. The multi-dimensional features include at least one of text features, topic features, and semantic features. The identification module 32 is configured to identify the target grade level of the student, combine different student essays of the target grade level to obtain a first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features; The identification module 32 is also configured to use multiple different artificial intelligence tools to generate student essays, obtain a second essay set, and identify second difference data between the essay to be detected and the second essay set based on the multi-dimensional features; The generation module 33 is configured to perform fusion analysis based on the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected. The target detection result is used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
[0083] In some examples of this embodiment, the identification module 32 is specifically configured to identify the target grade level of the student, determine the set of student essays in the target grade level as the first essay set, and determine the set of student essays in the target grade level as the historical essay set, wherein the first essay set includes essays of other students besides the student corresponding to the target grade level; the multi-dimensional feature extractor performs feature extraction on the first essay set and the historical essay set respectively to obtain a first multi-dimensional essay feature matrix corresponding to the first essay set and a second multi-dimensional essay feature matrix corresponding to the historical essay set; based on the multi-dimensional features, the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix, the first difference data between the essay to be detected and the first essay set and the historical essay set respectively is identified.
[0084] In some examples of this embodiment, the recognition module 32 is further configured to input the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix into the first target attention network layer, and to identify the first interaction features and the first difference features between the multi-dimensional features and the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix through the first target attention network layer; to perform non-linear scoring based on the first interaction features and the first difference features, and to generate the first difference data between the essay to be detected and the first essay set and the historical essay set.
[0085] In some examples of this embodiment, the identification module 32 is further configured to input the multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature into a target multilayer perceptron; in the target multilayer perceptron, the multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature are concatenated to generate a concatenated vector; the concatenated vector is non-linearly scored to generate first difference data between the essay to be detected and the first essay set and the historical essay set, respectively.
[0086] In some examples of this embodiment, the identification module 32 is further configured to determine the set of essays corresponding to the target learning stage generated by multiple different artificial intelligence tools as the second essay set; extract features from the second essay set using a multi-dimensional feature extractor to obtain a third multi-dimensional essay feature matrix corresponding to the second essay set; input the multi-dimensional features and the third multi-dimensional essay feature matrix into a second target attention network layer; identify the second interaction feature and the second difference feature between the multi-dimensional features and the third multi-dimensional essay feature matrix through the second target attention network layer; perform non-linear scoring based on the second interaction feature and the second difference feature to generate second difference data between the essay to be detected and the second essay set.
[0087] In some examples of this embodiment, the generation module 33 is specifically configured to input the first difference data and the second difference data into the target multilayer perceptron, which is trained using a binary classification cross-entropy loss function based on the feature vectors in the multidimensional feature extractor, the first target attention network layer, and the second target attention network layer; the target multilayer perceptron performs fusion analysis on the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected.
[0088] In some examples of this embodiment, the generation module 33 is specifically configured to: determine the training parameters of the target multilayer perceptron, the training parameters including the bucket vector and keyword vector in the multidimensional feature extractor, the parameters of the first target attention network layer, the parameters of the second target attention network layer, and the parameters of the target multilayer perceptron; determine the training sample data of the target multilayer perceptron, the sample data including positive samples and negative samples, wherein the positive samples are real writing essays submitted by students, and the negative samples are essays generated by various artificial intelligence tools; input the first difference data and the second difference data of each sample in the sample data into the target multilayer perceptron in the initial state, and output the predicted probability of the sample data being essays generated by artificial intelligence tools based on the training parameters; calculate the loss data of each sample in the sample data using a binary classification cross-entropy loss function, the loss data being the difference data between the true label and the predicted probability of each sample; determine the average loss data of the sample data based on the loss data, the number of sample data, and the true label of the sample data, and adjust the training parameters according to the average loss data; and complete the training of the target multilayer perceptron when the average loss data is less than a loss threshold.
[0089] In some examples of this embodiment, the extraction module 31 is specifically configured to extract the word segmentation features and syntactic features corresponding to the essay to be detected through the text feature extractor in the multi-dimensional feature extractor, and perform bucket quantization based on the word segmentation features and syntactic features to obtain the bucket quantized data corresponding to the essay to be detected, and map the bucket quantized data to text features; extract the keywords and the weights corresponding to the keywords corresponding to the essay to be detected through the topic feature extractor in the multi-dimensional feature extractor, and generate topic features based on the keywords and weights; extract the semantic vector corresponding to the essay to be detected through the semantic feature extractor in the multi-dimensional feature extractor, and generate semantic features based on the semantic vector. The semantic feature extractor is obtained by training an open-source large language model using real student essay content as the training set.
[0090] It should be noted that other corresponding descriptions of the functional units involved in the device for detecting AI-generated content in student essays provided in this embodiment can be found in [reference needed]. Figure 1 and Figure 2 The corresponding descriptions in [the document] will not be repeated here.
[0091] Based on the above, Figure 1 and Figure 2 Accordingly, this embodiment also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described method. Figure 1 and Figure 2 The method shown.
[0092] Based on this understanding, the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as CD-ROM, USB flash drive, mobile hard drive, etc.) and includes several instructions to cause a computer device (such as personal computer, server, or network device, etc.) to execute the methods of various implementation scenarios of this application.
[0093] like Figure 5 The diagram shown is a hardware structure schematic of an electronic device according to the present invention, comprising: At least one processor 401; and, Memory 402 is communicatively connected to at least one processor 401; wherein, The memory 402 stores instructions that can be executed by at least one processor to enable the at least one processor to perform the aforementioned method for detecting AI-generated content in student essays.
[0094] Figure 5 Take a processor 401 as an example.
[0095] The electronic device may also include an input device 403 and an output device 404.
[0096] The processor 401, memory 402, input device 403, and output device 404 can be connected via a bus or other means. Figure 5 Taking the example of a connection between China and Israel via a bus.
[0097] Memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the program instructions / modules corresponding to the method for detecting AI-generated content in student essays in this application embodiment, for example, Figure 1 and Figure 2 The method flow is shown. The processor 401 executes various functional applications and data processing by running non-volatile software programs, instructions, and modules stored in the memory 402, thereby implementing the method for detecting AI-generated content in student essays in the above embodiments.
[0098] Memory 402 may include a program storage area and a data storage area. The program storage area may store an operating system and applications required for at least one function; the data storage area may store data created based on the use of the method for detecting AI-generated content in student essays. Furthermore, memory 402 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 402 may optionally include memory remotely located relative to processor 401, and this remote memory may be connected via a network to means of performing the method for detecting AI-generated content in student essays. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0099] The input device 403 can receive user clicks and generate signal inputs related to user settings and function controls for methods of detecting AI-generated content in student essays. The output device 404 may include a display device such as a screen.
[0100] One or more modules are stored in memory 402, and when run by one or more processors 401, the method for detecting AI-generated content in student essays in any of the above method embodiments is executed.
[0101] Optionally, the aforementioned physical devices may also include a user interface, a network interface, a camera, radio frequency (RF) circuitry, sensors, audio circuitry, a Wi-Fi module, etc. The user interface may include a display screen, input units such as a keyboard, etc., and optional user interfaces may also include USB interfaces, card reader interfaces, etc. The network interface may optionally include standard wired interfaces, wireless interfaces (such as Wi-Fi interfaces), etc.
[0102] Those skilled in the art will understand that the physical device structure provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or have different component arrangements.
[0103] The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the aforementioned physical device, supporting the operation of information processing programs and other software and / or programs. The network communication module is used to enable communication between the various components within the storage medium, as well as communication with other hardware and software in the information processing physical device.
[0104] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms, or it can be implemented by hardware. By applying the solution of this embodiment, compared with the existing technology, this embodiment obtains the student's essay to be tested, uses a multi-dimensional feature extractor to extract features from the essay to be tested to obtain multi-dimensional features, and achieves comprehensive capture of the features of the student's essay; by identifying the first difference data between the essay to be tested and the first set of essays corresponding to the student's grade level and the student's historical essay set based on multi-dimensional features, it achieves accurate identification of the differences between the student's grade level and individual historical differences; by identifying the second difference data between the essay to be tested and the second set of essays generated by multiple artificial intelligence tools based on multi-dimensional features, it achieves effective capture of the features of essays generated by different artificial intelligence tools; by performing fusion analysis based on the first difference data and the second difference data to generate target detection results, it improves the detection accuracy of student essays generated by artificial intelligence; by identifying the essay to be tested and the second difference data, it achieves effective capture of the features of essays generated by different artificial intelligence tools; by performing fusion analysis based on the first difference data and the second difference data, it generates target detection results, improving the detection accuracy of student essays generated by artificial intelligence; by identifying the essay to be tested and the second difference data, it achieves comprehensive capture of the features of student essays generated by different artificial intelligence tools. The multi-dimensional features of the test essays, along with the first and second multi-dimensional essay feature matrices, are input into the first target attention network layer to identify interaction and difference features and perform non-linear scoring, thereby improving the accuracy of the first difference data identification. A second set of essays generated by multiple AI tools is determined, and a third multi-dimensional essay feature matrix is extracted. This matrix, along with the multi-dimensional features of the essay to be tested, is input into the second target attention network layer to identify relevant features and score them, achieving accurate acquisition of the second difference data. The reliability of the target detection results is further improved by fusing the first and second difference data into a specifically trained target multilayer perceptron. Comprehensive and accurate extraction of the multi-dimensional features of the essay to be tested is achieved through binning quantization of the text feature extractor, keyword and weight processing of the topic feature extractor, and semantic vector extraction of the semantic feature extractor (trained with real student essays).
[0105] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0106] The above are merely specific embodiments of this application, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to these embodiments, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims
1. A method for detecting AI-generated content in student essays, characterized in that, include: The student's essay to be tested is obtained, and features are extracted from the essay using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay. The multi-dimensional features include at least one of text features, topic features, and semantic features. Identify the target grade level of the student, combine the essays of different students in the target grade level to obtain a first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features; Multiple different artificial intelligence tools are used to generate student essays to obtain a second set of essays, and the second difference data between the essay to be detected and the second set of essays is identified based on the multi-dimensional features. Based on the first difference data and the second difference data, a fusion analysis is performed to generate the target detection result corresponding to the essay to be detected. The target detection result is used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
2. The method according to claim 1, characterized in that, The process of identifying the target grade level of the student, combining essays from different students in the target grade level to obtain a first essay set, and identifying first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features includes: Identify the target grade level to which the student belongs, determine the set of student essays in the target grade level as the first essay set, and determine the set of student essays in the target grade level as the historical essay set, wherein the first essay set includes essays from other students besides the student in the target grade level; The multi-dimensional feature extractor extracts features from the first set of essays and the set of historical essays respectively, to obtain a first multi-dimensional essay feature matrix corresponding to the first set of essays and a second multi-dimensional essay feature matrix corresponding to the set of historical essays. Based on the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix, the first difference data between the essay to be detected and the first essay set and the historical essay set are identified.
3. The method according to claim 2, characterized in that, The step of identifying the first difference data between the essay to be detected and the first essay set and the historical essay set, based on the multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix, includes: The multi-dimensional features, the first multi-dimensional essay feature matrix, and the second multi-dimensional essay feature matrix are input into the first target attention network layer. The first target attention network layer identifies the first interaction features and the first difference features between the multi-dimensional features and the first multi-dimensional essay feature matrix and the second multi-dimensional essay feature matrix, respectively. Based on the first interaction feature and the first difference feature, a non-linear scoring is performed to generate first difference data between the essay to be detected and the first essay set and the historical essay set, respectively.
4. The method according to claim 3, characterized in that, The step of performing non-linear scoring based on the first interaction feature and the first difference feature to generate first difference data between the essay to be detected and the first essay set and the historical essay set, respectively, includes: The multi-dimensional feature vector of the essay to be detected, the multi-dimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature are input into the target multilayer perceptron. In the target multilayer perceptron, the multidimensional feature vector of the essay to be detected, the multidimensional feature vector of a single essay in the first essay set and the historical essay set, the first interaction feature, and the first difference feature are concatenated to generate a concatenated vector. The concatenated vector is non-linearly scored to generate first difference data between the essay to be detected and the first essay set and the historical essay set.
5. The method according to claim 2, characterized in that, The process of generating student essays using multiple different artificial intelligence tools to obtain a second set of essays, and identifying second difference data between the essay to be detected and the second set of essays based on the multi-dimensional features, includes: The essay set corresponding to the target learning stage generated by the multiple different artificial intelligence tools is determined as the second essay set. The second essay set is then subjected to feature extraction by the multi-dimensional feature extractor to obtain the third multi-dimensional essay feature matrix corresponding to the second essay set. The multi-dimensional features and the third multi-dimensional essay feature matrix are input into the second target attention network layer, and the second target attention network layer identifies the second interaction features and the second difference features between the multi-dimensional features and the third multi-dimensional essay feature matrix. A non-linear scoring method is used based on the second interaction feature and the second difference feature to generate second difference data between the essay to be detected and the second essay set.
6. The method according to claim 1, characterized in that, The step of fusing and analyzing the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected includes: The first difference data and the second difference data are input into the target multilayer perceptron, which is trained based on the feature vectors in the multidimensional feature extractor, the first target attention network layer and the second target attention network layer using the binary classification cross-entropy loss function. The target multilayer perceptron is used to fuse and analyze the first difference data and the second difference data to generate the target detection result corresponding to the essay to be detected.
7. The method according to claim 6, characterized in that, The training process of the target multilayer perceptron includes: The training parameters of the target multilayer perceptron are determined, and the training parameters include the bucket vector and keyword vector in the multidimensional feature extractor, the parameters of the first target attention network layer, the parameters of the second target attention network layer, and the parameters of the target multilayer perceptron. The sample data for training the target multilayer perceptron is determined, and the sample data includes positive samples and negative samples, wherein the positive samples are real writing essays submitted by students, and the negative samples are essays generated by various artificial intelligence tools; The first and second difference data of each sample in the sample data are input into the target multilayer perceptron in the initial state, and the sample data is output as the predicted probability of the artificial intelligence tool generating an essay based on the training parameters. The loss data for each sample in the sample data is calculated using the binary cross-entropy loss function, where the loss data is the difference between the true label and the predicted probability of each sample. The average loss data of the sample data is determined based on the loss data, the number of sample data, and the true label of the sample data, and the training parameters are adjusted according to the average loss data. Training of the target multilayer perceptron is completed when the average loss data is less than the loss threshold.
8. The method according to claim 1, characterized in that, The step of extracting features from the essay to be detected using a multi-dimensional feature extractor to obtain the multi-dimensional features corresponding to the essay to be detected includes: The text feature extractor in the multi-dimensional feature extractor extracts the word segmentation features and syntactic features corresponding to the essay to be detected, and performs bucket quantization based on the word segmentation features and the syntactic features to obtain the bucket quantized data corresponding to the essay to be detected, and maps the bucket quantized data to the text features; The topic feature extractor in the multi-dimensional feature extractor extracts the keywords and weights corresponding to the essay to be detected, and generates the topic features based on the keywords and weights. The semantic vector corresponding to the essay to be detected is extracted by the semantic feature extractor in the multi-dimensional feature extractor, and the semantic features are generated based on the semantic vector. The semantic feature extractor is obtained by training an open-source large language model using real student essay content as the training set.
9. A device for detecting AI-generated content in student essays, characterized in that, include: The extraction module is configured to acquire students' essays to be tested, and to extract features from the essays to be tested using a multi-dimensional feature extractor to obtain multi-dimensional features corresponding to the essays to be tested. The multi-dimensional features include at least one of text features, topic features, and semantic features. The identification module is configured to identify the target grade level of the student, combine the essays of different students in the target grade level to obtain a first essay set, and identify the first difference data between the essay to be detected and the first essay set and the student's historical essay set based on the multi-dimensional features; The identification module is also configured to use multiple different artificial intelligence tools to generate student essays, obtain a second set of essays, and identify second difference data between the essay to be detected and the second set of essays based on the multi-dimensional features; The generation module is configured to perform fusion analysis based on the first difference data and the second difference data to generate a target detection result corresponding to the essay to be detected. The target detection result is used to determine whether the essay to be detected is a student essay generated by an artificial intelligence tool.
10. An electronic device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method of any one of claims 1 to 8.