Knowledge graph-based data question clustering processing method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By acquiring students' historical answer records from online education platforms, identifying the characteristics of difficult test questions and generating comprehensive feature vectors, and dynamically adjusting the subject knowledge graph, the problem of the disconnect between clustering results and dynamic learning information in existing technologies is solved, and the accurate positioning of group-wide weak links is achieved.

CN122196176APending Publication Date: 2026-06-12BEIJING INFORMATION TECH COLLEGE

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING INFORMATION TECH COLLEGE
Filing Date: 2026-03-16
Publication Date: 2026-06-12

Application Information

Patent Timeline

16 Mar 2026

Application

12 Jun 2026

Publication

CN122196176A

IPC: G06F16/35; G06F16/334; G06F16/36; G06F40/30; G06N5/022; G06Q50/20

AI Tagging

Application Domain

Data processing applications Semantic analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, test question clustering methods based on subject knowledge graphs cannot dynamically respond to students' real-time learning feedback, resulting in a disconnect between clustering results and dynamic learning situations, making it difficult to accurately pinpoint the group's weak points.

⚗Method used

By acquiring students' historical answer records from online education platforms, we can identify the characteristics of difficult questions and generate comprehensive feature vectors. We can then dynamically adjust the connection strength of key knowledge points in the subject knowledge graph and select a set of questions that meet the criteria.

🎯Benefits of technology

It enables precise identification of group weaknesses based on dynamic learning conditions, ensuring that the test question set revolves around the core weaknesses and key knowledge chains under the current learning conditions, thereby improving the accuracy of clustering results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196176A_ABST

Patent Text Reader

Abstract

The application provides a knowledge graph-based data question clustering processing method and system, and relates to the technical field of data processing. The application obtains student historical answer records and question information in an online education platform, identifies difficult questions according to student answering behaviors, and extracts group answering mode features. Meanwhile, the semantic features of question text content are analyzed to form a comprehensive feature vector after fusion. When the difficult feature in the comprehensive feature vector points to a key knowledge point in a preset subject knowledge graph, the connection strength of the key knowledge point and the associated knowledge point in the graph is strengthened according to the difficult feature, and an updated knowledge graph is generated. Based on the importance of each knowledge point in the updated knowledge graph and the strengthened connection relationship, the target question set is selected and combined from the question comprehensive feature vector, the knowledge association can be dynamically strengthened based on the actual answer data of students, and the question set for the weak links of group learning can be accurately aggregated.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a method and system for clustering data test questions based on knowledge graphs. Background Technology

[0002] In the process of online education and teaching, intelligent clustering of massive test questions can automatically classify questions with similar test points or similar difficulty, thereby providing support for personalized learning path recommendation, accurate homework assignment and weak point identification, which has important application prospects.

[0003] In existing technologies, test question clustering methods based on subject knowledge graphs are often used. By using a pre-constructed static knowledge graph, the similarity of test questions in terms of knowledge point coverage is calculated, or test questions with the same or similar knowledge points are aggregated together by combining the text content features of the test questions.

[0004] The knowledge graphs relied upon by existing methods are usually statically constructed. The relationships between knowledge points are not combined with real-time, group-based student learning feedback data. As a result, the clustering results reflect more the static logical relationships of test questions in the knowledge system, and cannot dynamically respond to and focus on the confusion and difficulties that students generally encounter in actual learning. Therefore, existing technologies have the technical problem of clustering results being out of touch with dynamic learning situations and making it difficult to accurately locate the weak links of the group. Summary of the Invention

[0005] The purpose of this application is to provide a data question clustering processing method and system based on knowledge graphs, so as to solve the problem that the clustering results are out of touch with the dynamic learning situation in the existing technology, and it is difficult to accurately locate the weak links of the group.

[0006] To address the aforementioned technical problems, in a first aspect, this application provides a data question clustering processing method based on knowledge graphs, comprising:

[0007] Obtain students' historical answer records and test question information generated on online education platforms;

[0008] Based on the student's answering behavior carried by the historical answer records, the characteristics of the difficult questions corresponding to the students' difficult questions are determined. The text content carried by the question information is semantically understood to determine the semantic features of the questions. Based on the characteristics of the difficult questions and the semantic features of the questions, a comprehensive feature vector of the questions is generated.

[0009] If the difficult question features in the comprehensive feature vector of the test question point to the key knowledge points in the preset subject knowledge graph, then the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with the key knowledge points is strengthened to generate an updated subject knowledge graph.

[0010] Based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, feature vectors that meet preset conditions are selected from the comprehensive feature vectors of the test questions, and the test questions corresponding to the feature vectors that meet the preset conditions are combined into a target test question set.

[0011] Optionally, based on the student's answering behavior carried in the historical answer records, the difficult question features corresponding to the student's difficult questions are determined; semantic understanding is performed on the text content carried by the question information to determine the question semantic features; and a comprehensive feature vector of the question is generated based on the difficult question features and the question semantic features, including:

[0012] Analyze the historical answer records of students to identify questions that have more than a preset percentage of students answered incorrectly and whose average answer time exceeds a preset length. Mark the identified questions as difficult questions.

[0013] Extract the group answering pattern of the difficult questions from the historical answering records, and the error consensus degree and time consumption anomaly degree in the group answering pattern constitute the characteristics of the difficult questions;

[0014] The text content in the test question information is analyzed to obtain the text semantic representation of each test question, and the text semantic representation is used as the semantic feature of the test question;

[0015] By combining the features of the difficult test questions with the semantic features of the corresponding test questions, a comprehensive feature vector is generated to represent the relationship between test question attributes and the difficulty of group responses.

[0016] Optionally, the group response pattern of the difficult questions is extracted from the historical response records. The error consensus and time consumption anomaly in the group response pattern constitute the characteristics of the difficult questions, including:

[0017] Based on the aforementioned difficult test questions, obtain all students' answers and answer times for the difficult test questions from the historical answer records;

[0018] Based on the answer results, calculate the proportion of students who answered incorrectly to the total number of students who answered the difficult question, and determine the proportion as the degree of consensus on the error of the difficult question.

[0019] Based on the answering time, calculate the average time that all students spend answering questions beyond the preset time, and determine the average time exceeding the preset time as the time anomaly degree of the difficult test questions;

[0020] The error consensus degree and the time consumption anomaly degree are combined to form the difficult question feature of the difficult question.

[0021] Optionally, if the difficult question features in the comprehensive feature vector of the test questions point to key knowledge points in a preset subject knowledge graph, then the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with the key knowledge points is strengthened to generate an updated subject knowledge graph, including:

[0022] Based on the difficult question features in the comprehensive feature vector of the test questions, determine the knowledge points corresponding to the difficult question features in the preset subject map, and mark the knowledge points as group learning difficulties;

[0023] In the pre-defined subject knowledge graph, find all subsequent knowledge points related to the learning difficulties of the group;

[0024] Increase the weight value of the connection path between the group's learning difficulties and each subsequent knowledge point, so as to modify the weight of the corresponding connection path in the preset subject knowledge graph;

[0025] An updated subject knowledge graph is generated based on a preset subject knowledge graph with updated connection path weights.

[0026] Optionally, the weight values of the connection paths between the group's learning difficulties and each subsequent knowledge point are increased to modify the weights of the corresponding connection paths in the preset subject knowledge graph, including:

[0027] The weight adjustment range is calculated based on the error consensus degree and time consumption anomaly degree in the characteristics of the difficult test questions corresponding to the group learning difficulties.

[0028] Based on the weight adjustment range, the current weight value of the connection path between the group learning difficulties and subsequent knowledge points in the preset subject knowledge graph is increased to obtain the updated weight value.

[0029] The weight values of the connection paths between the group learning difficulties and the subsequent knowledge points are modified to the updated weight values.

[0030] Optionally, based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, feature vectors that meet preset conditions are selected from the comprehensive feature vectors of the test questions, and the test questions corresponding to the feature vectors that meet the preset conditions are combined into a target test question set, including:

[0031] Based on the weight distribution of the connection paths between all knowledge points in the updated subject knowledge graph, calculate the relative importance of each knowledge point in the updated subject knowledge graph.

[0032] Based on the relative importance, important knowledge points with a relative importance higher than a preset threshold are selected from the updated subject knowledge graph;

[0033] From the test questions corresponding to the comprehensive feature vector of the test questions, select the test questions that are related to the important knowledge points and are located on the connection path with increased weight values in the updated subject knowledge graph;

[0034] The selected test questions are grouped to obtain multiple test question clusters corresponding to the important knowledge points, and the multiple test question clusters together constitute the target test question set.

[0035] Optionally, from the questions corresponding to the comprehensive feature vector of the test questions, select test questions that are related to the important knowledge points and are located on the connection path with increased weight values in the updated subject knowledge graph, including:

[0036] From the comprehensive feature vector of the test questions, determine the set of knowledge points associated with each test question in the updated subject knowledge graph;

[0037] From the set of knowledge points, questions belonging to the important knowledge points are selected to form a preliminary set of questions;

[0038] Determine whether the knowledge point associated with each question in the preliminary question set in the updated subject knowledge graph is located on the connection path between the group learning difficulty and the subsequent knowledge point with an increased weight value;

[0039] Questions whose associated knowledge points are located on the connection path with increased weight values are identified as the final selected questions.

[0040] Secondly, this application provides a knowledge graph-based data question clustering processing system, including:

[0041] The acquisition module is used to acquire students' historical answer records and test question information generated on the online education platform;

[0042] The determination module is used to determine the characteristics of difficult questions corresponding to the difficult questions of students based on the student's answering behavior carried in the historical answering records, perform semantic understanding on the text content carried by the question information, determine the semantic features of the questions, and generate a comprehensive feature vector of the questions based on the characteristics of difficult questions and the semantic features of the questions.

[0043] The reinforcement module is used to strengthen the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with the key knowledge points if the difficult question features in the comprehensive feature vector of the test question point to the key knowledge points in the preset subject knowledge graph, so as to generate an updated subject knowledge graph.

[0044] The combination module is used to select feature vectors that meet preset conditions from the comprehensive feature vectors of test questions based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, and to combine the test questions corresponding to the feature vectors that meet the preset conditions into a target test question set.

[0045] Thirdly, this application provides an electronic device, comprising:

[0046] Memory, used to store computer programs;

[0047] A processor, configured to execute the computer program to implement the steps of the knowledge graph-based data question clustering processing method as described in the first aspect above.

[0048] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps of the knowledge graph-based data question clustering processing method described in the first aspect above.

[0049] The knowledge graph-based data question clustering method provided in this application obtains students' historical answer records and question information generated on online education platforms, providing a comprehensive and authentic raw data foundation for subsequent analysis. By identifying the characteristics of difficult questions based on students' answering behavior and performing semantic understanding on the question text to obtain semantic features, a comprehensive feature vector is generated. This combines behavioral features representing the learning difficulty of the group with the inherent semantic attributes of the questions, constructing a more comprehensive digital feature representation of the questions. When the difficulty features in the comprehensive feature vector point to key knowledge points in the knowledge graph, the connection strength between the key knowledge point and its related knowledge points is strengthened to generate an updated knowledge graph. This allows the static knowledge structure to dynamically respond to and focus on the actual learning difficulties of the student group, enhancing the explicitness of key knowledge paths. By selecting questions from the comprehensive feature vector and combining them into a target question set based on the importance of knowledge points and the strengthened connection strength in the updated graph, the final aggregated question set accurately revolves around the core weaknesses and key knowledge chains under the current learning situation.

[0050] Furthermore, by identifying commonly answered and excessively time-consuming questions from the historical answer records of student groups as challenging points, the consensus on errors and the abnormality of time consumption in the group's answering patterns are extracted to form challenging point features. Simultaneously, the text of the questions is analyzed to obtain semantic features, and then the two are combined to generate a comprehensive feature vector. This enables the automatic and quantitative location of group learning difficulties from massive behavioral data, and the dynamic learning difficulty index is deeply bound to the static text semantics of the questions in a calculable feature form, thus laying an accurate feature foundation for subsequent precise data processing based on dynamic knowledge graphs. Attached Figure Description

[0051] To more clearly illustrate the technical solutions of the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0052] Figure 1 A flowchart illustrating a knowledge graph-based data question clustering processing method provided in this application embodiment;

[0053] Figure 2 This application provides a schematic diagram illustrating a specific implementation of a knowledge graph-based data question clustering processing method.

[0054] Figure 3 This is a schematic diagram illustrating a specific implementation of a knowledge graph-based data question clustering processing system provided in this application. Detailed Implementation

[0055] In the process of test question clustering analysis in online education, existing methods generally rely on pre-constructed, fixed subject knowledge graphs. These static graphs define the inherent logical relationships between knowledge points, but cannot perceive or incorporate real-time feedback data generated by students during actual learning. Therefore, test question clustering results based on such graphs can only reflect the static topological relationships of test questions in the knowledge system, and are unable to dynamically capture and focus on the real and common points of confusion and weakness that students face in their learning. This results in a disconnect between the aggregated test question set and the dynamically evolving actual learning situation, making it difficult to accurately serve targeted teaching for the group's weak points.

[0056] To address the disconnect between static knowledge graphs and dynamic learning situations, this application proposes a knowledge graph-based data question clustering method. Its core lies in dynamically adjusting and strengthening key connections within a pre-defined knowledge graph using historical answering behavior data of a student group. Specifically, firstly, common learning difficulties among students are identified and quantified from answer records, and these are integrated with the semantic features of the questions. Then, based on these difficulty characteristics representing the group's learning situation, the connection strength between corresponding knowledge points and their subsequent related points in the knowledge graph is automatically strengthened, thereby generating a dynamic knowledge graph that reflects real-time changes in learning focus. Finally, all question selection and aggregation are guided by this dynamic knowledge graph. Through this collaborative process of perceiving difficulties from learning data, using difficulty data to drive graph evolution, and then using the evolved graph to guide question aggregation, the final question set closely revolves around the core difficulties and key knowledge chains currently encountered by students. This effectively solves the problem of existing technologies relying on static knowledge graphs and resulting in a disconnect between clustering results and dynamic learning situations, achieving precise positioning of group weaknesses.

[0057] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0058] Example 1

[0059] The core of this application is to provide a data question clustering processing method based on knowledge graphs, and a flowchart of one specific implementation is shown below. Figure 1 As shown, the method includes:

[0060] S101. Obtain students' historical answer records and test information generated on the online education platform.

[0061] In one specific implementation, a structured data query request is first sent to the backend database service of the online education platform. This request specifies that the data to be obtained is all historical answer logs of the target student within a specific time range. After receiving this request, the database service executes a predefined query statement, filters all data rows that meet the conditions from its data table storing answer records, and returns these data in a structured list format. This list is the student's historical answer record, which refers to the process and result data generated when the student completes exercises, assignments, or exams on the platform. It mainly includes the student identifier, question identifier, submitted answers, obtained scores, and the time of answering.

[0062] Secondly, after successfully obtaining the historical answer records, all the unique question identifiers that have appeared in the historical answer records are extracted from the historical answer records. Then, a new data query request is sent to the platform's backend database service. The goal of this request is to obtain the complete question details corresponding to these question identifiers. Based on this set of question identifiers, the database service accurately finds and returns the complete information of each question from its question information data table, and finally obtains the question information. The question information refers to the question-related metadata stored in the platform's question bank, which mainly includes the question text, standard answer, related knowledge points, difficulty level, and question type.

[0063] S102. Based on the students' answering behavior carried in the historical answering records, determine the characteristics of the difficult questions corresponding to the students' difficult questions, perform semantic understanding on the text content carried by the question information, determine the semantic features of the questions, and generate a comprehensive feature vector of the questions based on the characteristics of the difficult questions and the semantic features of the questions.

[0064] In the embodiments of this application, such as Figure 2 As shown, S102 specifically includes:

[0065] S1021. Analyze the historical answer records of the student group, identify questions in which more than a preset proportion of students answered incorrectly and the average time taken by students to answer exceeded the preset time, and mark the identified questions as difficult questions.

[0066] Difficult questions refer to questions that exhibit widespread difficulty among students. The core criteria for determining difficulty are based on two quantifiable group behavior indicators: most students answer incorrectly and the average completion time for students is significantly too long.

[0067] In this embodiment of the application, the obtained historical answer records are analyzed, and two screening criteria are set: first, the proportion of students who answer incorrectly must be higher than a certain preset value; second, the average answering time of students must exceed a certain preset reasonable time. All questions are traversed, and for each question, the total number of students who answered and the number of students who answered incorrectly are counted, and the average answering time is calculated. Questions that meet both of the above criteria are selected and marked as difficult questions.

[0068] S1022. Extract the group answering pattern of difficult questions from historical answering records. The error consensus and time consumption abnormality in the group answering pattern constitute the characteristics of difficult questions.

[0069] Specifically, S1022 may include:

[0070] Based on the challenging test questions, the system retrieves all students' answers and time spent on the challenging test questions from historical answer records. Based on the answers, it calculates the proportion of students who answered incorrectly out of the total number of students who answered the challenging test questions, and determines this proportion as the error consensus level of the challenging test questions. Based on the answer time, it calculates the average time all students spend exceeding the preset time, and determines this average time exceedance level as the time consumption anomaly level of the challenging test questions. The error consensus level and the time consumption anomaly level are combined to form the challenging test question characteristics.

[0071] Among them, the group answering mode is an abstract generalization of the common behavioral patterns shown by a group of students when answering the same test question, which is specified into two dimensions: error consensus degree and time consumption anomaly degree. Error consensus degree refers to the degree of consistency of students' mistakes on the test question, usually expressed as the proportion of students who answered incorrectly. Time consumption anomaly degree refers to the degree to which the time spent by students to complete the test question exceeds reasonable expectations.

[0072] In this embodiment of the application, the difficult question features are extracted for each marked difficult question. The difficult question features consist of error consensus degree and time consumption anomaly degree. This process is divided into two parallel computation branches:

[0073] In the error consensus calculation branch, the total number of students who answered the question and the number of students who answered incorrectly are obtained from the historical answer records. The error consensus is the ratio of the number of students who answered incorrectly to the total number of students who answered. In the time consumption anomaly calculation branch, the specific answer time of each student is obtained from the historical answer records, and a basic reasonable answer time for this question type is set as a reference benchmark. The portion of each student's actual time exceeding this time is calculated. If the time does not exceed the limit, it is recorded as 0. The average of these excess portions is the time consumption anomaly. Combining the calculated error consensus and time consumption anomaly values constitutes the difficult question characteristics that describe the answering difficulty pattern of this group of test takers.

[0074] S1023. Analyze the text content in the test question information to obtain the text semantic representation of each test question, and use the text semantic representation as the semantic feature of the test question.

[0075] Among them, the semantic features of the test questions are the digital representation of the meaning expressed by the text content of the test questions, capturing the knowledge content and textual characteristics tested by the test questions.

[0076] In this embodiment, semantic features of the test questions are extracted from the text content of all test questions, including difficult and undifficult questions. A pre-trained natural language processing model, such as a neural network model that can understand the semantics of text, is used to read the text content in the test question information, such as the question stem, options, or question description. Deep semantic analysis, such as word segmentation and understanding the context, is performed on these texts. After complex internal calculations, the text information of each question is converted into a fixed-length mathematical vector containing semantic information. This generated vector is the semantic feature of the test question that can capture and represent the test question.

[0077] S1024. Combine the features of difficult test questions with the semantic features of the corresponding test questions to generate a comprehensive feature vector that represents the relationship between test question attributes and the difficulty of group responses.

[0078] Among them, the comprehensive feature vector is a digital representation that integrates multiple attributes of a test question, specifically referring to a feature vector that integrates the inherent semantic attributes of the test question with the difficulty attributes deduced from student behavior.

[0079] In this embodiment, the features of difficult questions are combined with the semantic features of the questions to generate a comprehensive feature vector. For questions marked as difficult, the features of difficult questions are connected with the corresponding semantic features of the questions to form a longer new vector that integrates the two types of information. For non-difficult questions, the features of difficult questions can be represented by a predefined zero vector or a vector with a default value. This vector is then combined with the semantic features of the questions, and the newly generated vector is the comprehensive feature vector that ultimately represents the relationship between the question attributes and the difficulty of answering by the group.

[0080] As an example, firstly, through step 1021, assume that the preset error ratio threshold is... The reasonable duration threshold is A certain test question has a total of Students answered, among whom When a person answers incorrectly, the error rate is... ,because The first condition is met; the average answering time for these 100 students is ,because The second condition is met, therefore this question is marked as a difficult question.

[0081] Next, in step 1022, for each marked difficult example, all students' answers and specific time spent answering it are retrieved from the historical answer records. The error consensus rate is calculated based on the answer results, using the following formula:

[0082] ;

[0083] in, Indicates the degree of erroneous consensus. This indicates the number of students who answered incorrectly. This indicates the total number of students who answered the question. It is a value between 0 and 1. The larger the value, the more consistent the students' mistakes on the question.

[0084] The time consumption anomaly is calculated based on the response time data. First, the excess time for each student is calculated; if no student exceeds the limit, the value is 0. Then, the average excess time for all students is calculated using the following formula:

[0085] ;

[0086] in, Indicates the degree of time consumption anomaly. Indicates the first The time taken for each student to answer the questions This indicates the preset baseline time. The function is used to ensure that only the time consumed beyond the baseline is accumulated. The larger the value, the more "extra" thinking or struggle time the student group spends on this question, and the higher the degree of abnormality.

[0087] Assumption , ,but Assume the baseline time is... The time taken for 100 students to answer the questions They are all different, among which:

[0088] Eighty students took more than 120 seconds, exceeding the allotted time. The total is 2000 seconds;

[0089] The other 20 students took less than or equal to 120 seconds, according to the formula. Their overtime was 0 seconds;

[0090] Therefore, the total overtime for all students is 2000 + 0 = 2000 seconds.

[0091] According to the time consumption anomaly formula:

[0092] ;

[0093] Therefore, the time consumption anomaly of this question Therefore, the difficulty and characteristics of this test question can be expressed as follows: .

[0094] Then, in step 1023, the input text sequence is typically processed by a semantic analysis model. (in Mapping a word or sub-word unit to a semantic vectors of dimension :

[0095] ;

[0096] in, The encoding function representing the pre-trained semantic model. It is the feature dimension.

[0097] Suppose that a test question about "finding the roots of a quadratic equation in one variable" is processed by a semantic model and transformed into a 256-dimensional vector. This vector encodes information such as mathematical concepts and expressions related to the content of the question.

[0098] Finally, in step 1024, two characteristics are found for each question: for questions marked as difficult, they possess the characteristics of difficult questions. and semantic features of test questions For non-difficult questions, their difficult features can be represented by a specific default vector. Feature fusion operations combine these two feature vectors from different sources and with different dimensions into a unified feature vector. A typical fusion method is vector concatenation, where two vectors are directly concatenated end-to-end to generate a comprehensive feature vector. , can be represented as:

[0099] ;

[0100] in, This represents a vector concatenation operation. Assume... It is a 2-dimensional vector. yes dimensional vector, then It is Dimensional vector.

[0101] Suppose that for the above difficult questions, , Given a 256-dimensional vector, the resulting composite feature vector is obtained through concatenation. It is a new 258-dimensional vector, with the first 2 dimensions reflecting the difficulty of the group's answers and the last 256 dimensions reflecting the textual semantic content of the test questions.

[0102] Through the steps described above, this application achieves the automatic identification of common learning difficulties among students from massive, heterogeneous raw data. It quantifies the degree of error consensus and the degree of time consumption abnormality in a formulaic way, and then integrates them with the deep semantic representation of the test questions to generate unified digital features that can simultaneously and accurately represent the test content and the group difficulty feedback. This provides accurate and calculable data input for subsequent analysis based on dynamic knowledge association.

[0103] S103. If the difficult question features in the comprehensive feature vector of the test questions point to the key knowledge points in the preset subject knowledge map, then strengthen the connection between the key knowledge points and the knowledge points associated with the key knowledge points in the subject knowledge map to generate an updated subject knowledge map.

[0104] The pre-designed subject map refers to a pre-constructed graph structure model used to represent knowledge points and their relationships within a certain subject area. In this map, nodes represent specific knowledge points, edges connecting nodes represent the relationships between knowledge points, and weight values assigned to each edge are used to quantify and represent the connection strength of the relationship. The larger the weight value, the closer or more important the relationship usually is. Key knowledge points refer to the knowledge points in the pre-designed map that are pointed to by the characteristics of students' difficult test questions, reflecting the specific location where students' learning is currently hindered.

[0105] In this embodiment of the application, S103 specifically includes:

[0106] S1031. Based on the characteristics of difficult questions in the comprehensive feature vector of the test questions, determine the knowledge points corresponding to the characteristics of difficult questions in the preset subject map, and mark the knowledge points as group learning difficulties.

[0107] Among them, the group learning difficulty is a specific knowledge point located in the pre-set subject map, representing the core knowledge content commonly tested by test questions that are considered difficult by a considerable proportion of students and take an unusual amount of time to answer.

[0108] In this embodiment, the key knowledge points corresponding to difficult questions in a preset subject graph are determined based on the difficult question feature portion of the comprehensive feature vector of the test questions, and these knowledge points are marked as group learning difficulties. Specifically, the knowledge point tags recorded in the test question information are read, and a node with the same tag name as this knowledge point is searched in the preset subject graph. Since the characteristics of difficult questions include high error consensus and high time consumption anomaly, it indicates that the knowledge point corresponding to this question poses a significant difficulty to the student group, so the found node in the graph is officially marked as a group learning difficulty.

[0109] S1032. In the preset subject knowledge graph, find all subsequent knowledge points of the group learning difficulties.

[0110] Among them, subsequent knowledge points refer to related knowledge points that are directly related to the group's learning difficulties in the pre-set subject map, and are usually learned or applied only after the difficult knowledge has been mastered. This sequential relationship is reflected by the directionality or semantic definition of the edges in the map.

[0111] In this embodiment, starting from the previously marked group learning difficulty node in the preset subject graph, all subsequent knowledge points are searched. Using the established connection relationships in the graph, which are usually directed connections representing the dependency relationship of "learn A first, then learn B", graph traversal algorithms such as depth-first search or breadth-first search are used to find all other knowledge point nodes that can be reached from the group learning difficulty node along the direction of the connection arrow. These found nodes are the subsequent knowledge points that may affect subsequent learning because the previous group learning difficulty has not been mastered.

[0112] S1033. Increase the weight value of the connection path between the group learning difficulty and each subsequent knowledge point, so as to modify the weight of the corresponding connection path in the preset subject knowledge graph.

[0113] Specifically, S1033 may include:

[0114] Based on the error consensus and time-consuming anomaly in the characteristics of the difficult test questions corresponding to the group learning difficulties, the weight adjustment range is calculated; based on the weight adjustment range, the current weight value of the connection path between the group learning difficulties and subsequent knowledge points in the preset subject knowledge graph is increased to obtain the updated weight value; the weight value of the connection path between the group learning difficulties and subsequent knowledge points is modified to the updated weight value.

[0115] In this embodiment, the strength of the connection path between the group learning difficulty and each subsequent knowledge point is enhanced. This process consists of two steps: First, the weight adjustment range is calculated. Based on the criteria for marking the knowledge point as a difficulty, namely the error consensus rate and time consumption anomaly rate in the characteristics of the corresponding difficulty test question, a weight adjustment value is calculated through a predefined function. This function usually makes the weight adjustment range of the knowledge point connection larger for questions with higher error consensus rate and time consumption anomaly rate. Second, the graph weight is updated. In the preset subject graph, the directed connection line connecting the group learning difficulty node and each subsequent knowledge point node is found. The currently stored weight value of this connection line is added to the weight adjustment range calculated in the first step to obtain an updated weight value, and the original old value is replaced with this new value.

[0116] S1034. Based on the preset subject knowledge graph with updated connection path weight values, generate an updated subject knowledge graph.

[0117] In this embodiment of the application, after completing the enhancement operation of the connection weight between all marked group learning difficulties and all their subsequent knowledge points, an updated subject knowledge graph is obtained. This new graph retains all the original knowledge points and connection structures, but the connection lines from the group's weak knowledge points to subsequent related knowledge are given higher weight values, reflecting the knowledge dependency relationship strengthened based on the latest student group learning data.

[0118] As an example, firstly, through step 1031, suppose we are considering a difficult problem, "Solve..." By querying the mapping table, it was determined that the knowledge point being tested was "the factorization method of solving quadratic equations in one variable". Therefore, the knowledge point node "the factorization method of solving quadratic equations in one variable" in the graph was marked as the current learning difficulty for the group.

[0119] Next, through step 1032, find the node "Factorization of Quadratic Equations" in the mathematical knowledge graph. It may be directly connected to the two knowledge point nodes "Relationship between the Graph of a Quadratic Function and the Roots of a Quadratic Equation" and "Solving Quadratic Inequalities" through the relationship "is the foundation of...". Then, the latter two are the subsequent knowledge points found.

[0120] Then, in step 1033, for each subsequent knowledge point, the current weight value of the connection edge between the group learning difficulty and that subsequent knowledge point is read from the preset subject graph and denoted as . A weight adjustment magnitude needs to be calculated. This adjustment range is not fixed, but rather determined by quantitative indicators reflecting the severity of learning difficulties for this group, namely, the degree of consensus on errors in the characteristics of the difficult test questions associated with it. and time consumption anomaly Together, we determine that a specific calculation method is a weighted sum of the two, as shown in the following formula:

[0121] ;

[0122] in, and It is a preset adjustment coefficient used to balance the proportion of the impact of error consensus and time-consuming anomalies on connection strength; and The value is derived from the feature value calculated for the difficult test questions corresponding to the learning difficulties of this group.

[0123] Then use the calculated To increase the current weight value, obtain the updated weight value for the connection path. :

[0124] ;

[0125] Perform an update operation, changing the weight of the edge connecting the group learning difficulty in the graph to the current subsequent knowledge point from the original value. Modified to be calculated .

[0126] Suppose that the group learning difficulty is "factorization of quadratic equations," then the characteristics of the corresponding difficult test questions are as follows: , preset , .

[0127] Then the weight adjustment range .

[0128] Suppose the current weight of the edge connecting this difficult point to a subsequent knowledge point, such as "the relationship between the graph of a quadratic function and the roots of a quadratic equation". .

[0129] Then the updated weights .

[0130] Change the weight of this edge to 3.85, and repeat this process for each subsequent knowledge point found.

[0131] Finally, in step 1034, this weighted and updated graph is output as the updated subject knowledge graph. The above example is only one implementation of this application. In practical applications, the calculation method of the weight adjustment range, the search scope of subsequent knowledge points, etc., can be set according to requirements, and this application does not limit them.

[0132] Through the steps described above, this application enables the targeted evolution of a pre-built knowledge graph driven by quantified student learning data. This dynamically strengthens static knowledge connections along paths directly related to actual learning difficulties, ensuring that the updated knowledge graph not only contains the logical structure of the subject but also integrates weak links that need to be emphasized in current teaching. This provides a core basis for accurately locating test question sets related to these dynamically strengthened knowledge paths.

[0133] S104. Based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between key knowledge points and their related knowledge points, select feature vectors that meet the preset conditions from the comprehensive feature vectors of the test questions, and combine the test questions corresponding to the feature vectors that meet the preset conditions into a target test question set.

[0134] The importance of a knowledge point is a relative metric within the knowledge graph context, reflecting its pivotal position or influence within the entire updated knowledge graph structure.

[0135] In this embodiment of the application, S104 specifically includes:

[0136] S1041. Based on the weight distribution of the connection paths between all knowledge points in the updated subject knowledge graph, calculate the relative importance of each knowledge point in the updated subject knowledge graph.

[0137] In this embodiment, a global analysis is performed on the updated subject knowledge graph to calculate the relative importance of each knowledge point. A graph algorithm specifically designed for analyzing the importance of network nodes is used, treating the knowledge graph as a network composed of knowledge points (nodes) and knowledge dependencies (weighted directed connections). A process of random walks and transmission of learning attention within the network is simulated. When a learning attention is located at a knowledge point node, it jumps to the next knowledge point along the connection line originating from that node with a certain probability. Through numerous simulations and iterative calculations, a stable and convergent score is calculated for each knowledge point node in the graph. This score represents the overall probability that the knowledge point is accessed or relied upon in the network, i.e., its relative importance. The higher the score, the more core and crucial the knowledge point's position within the knowledge system.

[0138] S1042. Based on relative importance, select important knowledge points in the updated subject knowledge graph whose relative importance is higher than a preset threshold.

[0139] Among them, important knowledge points refer to a subset of knowledge points that are selected through a preset threshold and ranked highly in terms of importance. These are the core and key nodes in the current knowledge system.

[0140] In this embodiment, important knowledge points are selected based on the calculated relative importance score of each knowledge point. An importance score threshold is set; this threshold can be a fixed empirical value or dynamically determined based on the distribution of importance scores across all knowledge points. All knowledge point nodes in the knowledge graph are traversed, and their relative importance scores are compared with this preset threshold. All knowledge point nodes with scores greater than or equal to the threshold are selected to form a set of important knowledge points. These important knowledge points are the core basis for subsequent selection of test questions.

[0141] S1043. From the test questions corresponding to the comprehensive feature vector of the test questions, select the test questions that are related to important knowledge points and are located on the connection path with increased weight values in the updated subject knowledge graph.

[0142] Specifically, S1043 may include:

[0143] From the comprehensive feature vector of the test questions, determine the set of knowledge points associated with each test question in the updated subject knowledge graph; from the set of knowledge points, select test questions belonging to important knowledge points to form a preliminary test question set; determine whether the knowledge points associated with each test question in the preliminary test question set are located on the connection path between the group learning difficulty and subsequent knowledge points with increased weight values in the updated subject knowledge graph; and determine the test questions whose associated knowledge points are located on the connection path with increased weight values as the final selected test questions.

[0144] In this embodiment, questions closely related to important knowledge points and located on key teaching paths are selected from all question banks with generated comprehensive feature vectors. First, the set of knowledge points associated with each question in the updated subject knowledge graph is determined, typically by querying the knowledge point tags marked in the question metadata and matching them with nodes in the knowledge graph. Next, questions that simultaneously exist in the set of important knowledge points are selected from the set of associated knowledge points for each question. All questions meeting this condition are collected to form a preliminary question set. Finally, each question in this preliminary question set is further evaluated to check whether the important knowledge points associated with the question are located on enhanced connection paths in the updated subject knowledge graph. Only questions whose associated knowledge points are located on such enhanced key paths leading from group weaknesses to subsequent knowledge are ultimately determined to meet the selection criteria.

[0145] S1044. Group the selected test questions to obtain multiple test question clusters corresponding to important knowledge points. The multiple test question clusters together constitute the target test question set.

[0146] In this embodiment of the application, all the final selected test questions are organized and grouped according to the important knowledge points associated with these test questions. That is, test questions associated with the same important knowledge point are grouped together to form a test question cluster. All such test question clusters are combined to form the final target test question set. This test question set has a clear focus, which focuses on the core points in the knowledge system and particularly emphasizes those test questions that start from the discovered weak links in group learning and lead to these core points on the key learning path.

[0147] As an example, firstly, step 1041 is used to obtain the updated subject knowledge graph, in which the knowledge point "factorization of quadratic equations" is recorded as a knowledge point. Because it was marked as a learning difficulty for the group, the subsequent knowledge point "the relationship between the graph of a quadratic function and the roots of a quadratic equation" is recorded as a knowledge point. Remember "solving quadratic inequalities in one variable" as a knowledge point. The edge weights have been increased from the initial 1.0 and 0.8 to 3.85 and 3.65 respectively; an iterative algorithm based on weight propagation is used to calculate the importance of each knowledge point. The algorithm initializes the importance score of all knowledge points to 1; in the first iteration, the knowledge points... It will receive knowledge points from its predecessors. The importance of dissemination, due to arrive The edge weight is 3.85, which is very high, and It also has a certain basic importance, therefore It will obtain a higher importance propagation value; at the same time, It will also receive and spread knowledge from its predecessors; after multiple rounds of iterative calculations, the importance score tends to stabilize.

[0148] Assuming the final calculation yields the knowledge points The importance score is 8.5, and the knowledge point is... The importance score is 7.2, and the knowledge point is... Its importance score is 6.0, while the importance scores of most other knowledge points in the graph are between 1.0 and 3.0.

[0149] Next, in step 1042, a priority threshold is preset, such as ranking in the top 30% or having a score greater than 5.0. Based on the calculation results assumed in the previous step, the knowledge points... The importance score is 8.5, and the knowledge point is... The importance score is 7.2, and the knowledge point is... The importance score is 6.0, of which , , Their scores were all much higher than those of other common knowledge points, so they were selected and together constitute the current set of important knowledge points.

[0150] Then, in step 1043, it is assumed that there is a question bank containing hundreds of math questions, each of which has generated a comprehensive feature vector and is associated with the set of knowledge points it tests.

[0151] The first round of screening is based on key knowledge points: traversing the question bank to check whether the set of knowledge points associated with each question is included. , or At least one of them, for example, test questions test Test Questions test Test Questions Simultaneously examine and They are all selected and enter the preliminary question set, which is assumed to contain 20 questions; however, a question that only tests the "perfect square formula" is filtered out in this round because its related knowledge points are not in the set of important knowledge points.

[0152] A second round of screening was conducted based on the reinforcement path: These 20 questions underwent detailed verification. The specific verification rule was that any knowledge point associated with a question must be located on a reinforced connection path. The reinforced path is... and Check each question:

[0153] Test Questions Related knowledge points That is The endpoint of this reinforced path, therefore Through screening.

[0154] Test Questions Related knowledge points It is the starting point for strengthening the path. It is the endpoint, thus satisfying the condition. Through screening.

[0155] Test Questions Its related lie in On the path, therefore Through screening.

[0156] Suppose another question Although it tests important knowledge points However, after more granular semantic analysis, it was found that the main focus was on the relationship between "quadratic function graph" and "Vieta's formulas," and it was unrelated to the knowledge link formed by the specific difficulty of "factorization of quadratic equations." During path checking, it was determined that the knowledge connection was not strictly located within... On this path, which is reinforced due to specific learning situations, therefore It was filtered in this round.

[0157] Suppose that 5 questions remain after this rigorous selection process, denoted as 5. .

[0158] Finally, in step 1044, the five final selected questions were grouped based on the core and important knowledge points they tested. Analysis yielded the following conclusions: , , The core examination point is That is, the relationship between the graph of a quadratic function and the roots of a quadratic equation in one variable; , The core examination point is That is, solving a quadratic inequality in one variable.

[0159] Note: Although they are also related However, its core difficulties and teaching objectives are more inclined towards Therefore, it is classified as Group.

[0160] Based on this, two question clusters are formed: one is a cluster ,around ,Include , , The other is a cluster. ,around ,Include , These two question clusters together constitute the target question set output by this processing. The questions within each cluster not only test important core knowledge points but are also all related to the students' current learning difficulties in factorization and their subsequent impact, demonstrating high teaching relevance. The above example details the complete logical chain from calculation and selection to grouping. In practical applications, parameters such as the specific algorithm for importance calculation, the strictness of path matching, and grouping strategies can all be adjusted; this application does not impose any limitations on these aspects.

[0161] This application, through the aforementioned steps, achieves precise question location and intelligent aggregation based on a dynamically evolving knowledge graph. First, the importance of knowledge points, incorporating learning feedback, is calculated. Then, this importance, combined with specifically reinforced weak knowledge paths, is used to perform dual-condition filtering on the question bank. Finally, the selected questions are automatically grouped according to core knowledge points. This method ensures that each question in the final target question set simultaneously meets the two key teaching requirements of assessing core knowledge and relating to actual learning weaknesses, thus providing highly accurate and directly usable exercise resources for targeted teaching of group-wide weaknesses.

[0162] Example 2

[0163] Figure 3 This is a schematic diagram illustrating a specific implementation of a knowledge graph-based data question clustering processing system provided in this application. (Refer to...) Figure 3 The system may include:

[0164] Module 31 is used to acquire students' historical answer records and test information generated on the online education platform;

[0165] The determination module 32 is used to determine the characteristics of difficult questions corresponding to students' difficult questions based on the students' answering behavior carried in the historical answer records, to perform semantic understanding on the text content carried by the question information, to determine the semantic features of the questions, and to generate a comprehensive feature vector of the questions based on the characteristics of difficult questions and the semantic features of the questions.

[0166] The reinforcement module 33 is used to strengthen the connection strength between key knowledge points and related knowledge points in the subject knowledge graph if the difficult question features in the comprehensive feature vector of the test question point to key knowledge points in the preset subject knowledge graph, so as to generate an updated subject knowledge graph.

[0167] The combination module 34 is used to select feature vectors that meet preset conditions from the comprehensive feature vectors of test questions based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between key knowledge points and their related knowledge points, and to combine the test questions corresponding to the feature vectors that meet the preset conditions into a target test question set.

[0168] The knowledge graph-based data test question clustering processing system of this application embodiment is used to implement the aforementioned knowledge graph-based data test question clustering processing method. Therefore, the specific implementation of the knowledge graph-based data test question clustering processing system can be found in the embodiment section of the knowledge graph-based data test question clustering processing method above. The specific implementation can be referred to the description of the corresponding embodiment, and will not be repeated here.

[0169] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for implementing the steps of the knowledge graph-based data question clustering processing method described above when executing the computer program.

[0170] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described knowledge graph-based data question clustering processing methods.

[0171] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as USB flash drives, read-only memory, random access memory, portable hard drives, magnetic disks, or optical disks.

[0172] Embodiments of the present invention also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the knowledge graph-based data question clustering processing method.

[0173] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0174] The foregoing has provided a detailed description of a knowledge graph-based data question clustering processing method and system provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of this application.

Claims

1. A data question clustering processing method based on knowledge graphs, characterized in that, include: Obtain students' historical answer records and test question information generated on online education platforms; Based on the student's answering behavior carried by the historical answer records, the characteristics of the difficult questions corresponding to the students' difficult questions are determined. The text content carried by the question information is semantically understood to determine the semantic features of the questions. Based on the characteristics of the difficult questions and the semantic features of the questions, a comprehensive feature vector of the questions is generated. If the difficult question features in the comprehensive feature vector of the test question point to the key knowledge points in the preset subject knowledge graph, then the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with the key knowledge points is strengthened to generate an updated subject knowledge graph. Based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, feature vectors that meet preset conditions are selected from the comprehensive feature vectors of the test questions, and the test questions corresponding to the feature vectors that meet the preset conditions are combined into a target test question set.

2. The method according to claim 1, characterized in that, Based on the student's answering behavior carried in the historical answer records, the characteristics of the difficult questions corresponding to the students' difficult questions are determined. Semantic understanding is performed on the text content carried by the question information to determine the semantic features of the questions. Based on the difficult question characteristics and the semantic features of the questions, a comprehensive feature vector of the questions is generated, including: Analyze the historical answer records of students to identify questions that have more than a preset percentage of students answered incorrectly and whose average answer time exceeds a preset length. Mark the identified questions as difficult questions. Extract the group answering pattern of the difficult test questions from the historical answering records, and the error consensus degree and time consumption anomaly degree in the group answering pattern constitute the characteristics of the difficult test questions; The text content in the test question information is analyzed to obtain the text semantic representation of each test question, and the text semantic representation is used as the semantic feature of the test question; By combining the features of the difficult test questions with the semantic features of the corresponding test questions, a comprehensive feature vector is generated to represent the relationship between test question attributes and the difficulty of group responses.

3. The method according to claim 2, characterized in that, Extract the group response patterns of the difficult questions from the historical response records. The error consensus and time anomaly in the group response patterns constitute the characteristics of the difficult questions, including: Based on the aforementioned difficult test questions, obtain all students' answers and answer times for the difficult test questions from the historical answer records; Based on the answer results, calculate the proportion of students who answered incorrectly to the total number of students who answered the difficult question, and determine the proportion as the degree of consensus on the error of the difficult question. Based on the answering time, calculate the average time that all students spend answering questions beyond the preset time, and determine the average time exceeding the preset time as the time anomaly degree of the difficult test questions; The error consensus degree and the time consumption anomaly degree are combined to form the difficult question feature of the difficult question.

4. The method according to claim 1, characterized in that, If the difficult question features in the comprehensive feature vector of a test question point to key knowledge points in a pre-defined subject knowledge graph, then the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with those key knowledge points is strengthened to generate an updated subject knowledge graph, including: Based on the difficult question features in the comprehensive feature vector of the test questions, determine the knowledge points corresponding to the difficult question features in the preset subject map, and mark the knowledge points as group learning difficulties; In the pre-defined subject knowledge graph, find all subsequent knowledge points related to the learning difficulties of the group; Increase the weight value of the connection path between the group's learning difficulties and each subsequent knowledge point, so as to modify the weight of the corresponding connection path in the preset subject knowledge graph; An updated subject knowledge graph is generated based on a preset subject knowledge graph with updated connection path weights.

5. The method according to claim 4, characterized in that, Increase the weight value of the connection path between the group's learning difficulties and each subsequent knowledge point to modify the weight of the corresponding connection path in the preset subject knowledge graph, including: The weight adjustment range is calculated based on the error consensus degree and time consumption anomaly degree in the characteristics of the difficult test questions corresponding to the group learning difficulties. Based on the weight adjustment range, the current weight value of the connection path between the group learning difficulties and subsequent knowledge points in the preset subject knowledge graph is increased to obtain the updated weight value. The weight values of the connection paths between the group learning difficulties and the subsequent knowledge points are modified to the updated weight values.

6. The method according to claim 1, characterized in that, Based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, feature vectors that meet preset conditions are selected from the comprehensive feature vectors of the test questions. The test questions corresponding to these feature vectors are then combined into a target test question set, including: Based on the weight distribution of the connection paths between all knowledge points in the updated subject knowledge graph, calculate the relative importance of each knowledge point in the updated subject knowledge graph. Based on the relative importance, important knowledge points with a relative importance higher than a preset threshold are selected from the updated subject knowledge graph; From the test questions corresponding to the comprehensive feature vector of the test questions, select the test questions that are related to the important knowledge points and are located on the connection path with increased weight values in the updated subject knowledge graph; The selected test questions are grouped to obtain multiple test question clusters corresponding to the important knowledge points. These multiple test question clusters together constitute the target test question set.

7. The method according to claim 1, characterized in that, From the questions corresponding to the comprehensive feature vector of the test questions, select those questions that are related to the important knowledge points and are located on the connection paths with increased weight values in the updated subject knowledge graph, including: From the comprehensive feature vector of the test questions, determine the set of knowledge points associated with each test question in the updated subject knowledge graph; From the set of knowledge points, questions belonging to the important knowledge points are selected to form a preliminary set of questions; Determine whether the knowledge point associated with each question in the preliminary question set in the updated subject knowledge graph is located on the connection path between the group learning difficulty and the subsequent knowledge point with an increased weight value; Questions whose associated knowledge points are located on the connection path with increased weight values are identified as the final selected questions.

8. A data question clustering processing system based on knowledge graphs, characterized in that, include: The acquisition module is used to acquire students' historical answer records and test question information generated on the online education platform; The determination module is used to determine the characteristics of difficult questions corresponding to the difficult questions of students based on the student's answering behavior carried in the historical answering records, perform semantic understanding on the text content carried by the question information, determine the semantic features of the questions, and generate a comprehensive feature vector of the questions based on the characteristics of difficult questions and the semantic features of the questions. The reinforcement module is used to strengthen the connection strength between the key knowledge points in the subject knowledge graph and the knowledge points associated with the key knowledge points if the difficult question features in the comprehensive feature vector of the test question point to the key knowledge points in the preset subject knowledge graph, so as to generate an updated subject knowledge graph. The combination module is used to select feature vectors that meet preset conditions from the comprehensive feature vectors of test questions based on the importance of each knowledge point in the updated subject knowledge graph and the connection strength between the key knowledge points and their associated knowledge points, and to combine the test questions corresponding to the feature vectors that meet the preset conditions into a target test question set.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the knowledge graph-based data question clustering processing method as described in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, can implement the knowledge graph-based data question clustering processing method as described in any one of claims 1 to 7.