Method, apparatus, electronic device, and storage medium for data processing

By extracting heterogeneous data from various data sources and using semantic processing models to generate relationships, the problem of slow response time of educational content has been solved, and timely matching of educational content with industry needs has been achieved.

CN122240809APending Publication Date: 2026-06-19NEW JINCIN

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NEW JINCIN
Filing Date
2026-03-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, determining the direction and content of education based on industry talent demand reports regularly published by governments or industry associations takes too long and cannot meet the changing needs of the industry in a timely manner.

Method used

By receiving data processing instructions, the system obtains demand data from the target industry, extracts heterogeneous data to be processed from various data sources, uses a preset semantic processing model to determine the set of capability parameters for type identifiers, generates capability associations, and responds to data query instructions to obtain the matching capability parameter table for target words.

Benefits of technology

It enables timely response to changes in industry needs, improves the matching efficiency between educational data and industry requirements, and ensures that educational content and direction align with industry needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240809A_ABST
    Figure CN122240809A_ABST
Patent Text Reader

Abstract

This invention discloses a method, apparatus, electronic device, and storage medium for data processing, relating to the field of data processing technology. One specific embodiment of the method includes: receiving a data processing instruction; acquiring demand data corresponding to a target industry; locating corresponding data sources; extracting heterogeneous data to be processed from each data source, the data to be processed including standard documents, educational data, and enterprise-related data corresponding to the target industry; determining the type identifier of the data to be processed; processing the data to be processed based on a preset semantic processing model to obtain a set of capability parameters associated with each preset type indicator; acquiring historical capability parameters of the type identifier; generating capability association relationships corresponding to the type identifier based on the capability parameter set and historical capability parameters; responding to a data query instruction; acquiring the corresponding target word and target type identifier; matching the target word with the capability association relationships corresponding to the target type identifier; obtaining and displaying a matching capability parameter table.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a data processing method, apparatus, electronic device, and storage medium. Background Technology

[0002] Currently, to meet the talent needs of various industries, some teaching content in school education is often adjusted according to industry demands. Therefore, matching industry needs with educational data is an important task. One approach is to determine educational directions and corresponding content based on industry talent demand reports regularly published by governments or industry associations. However, this method has a long response time and cannot promptly adapt to changes in industry needs. Summary of the Invention

[0003] In view of this, embodiments of the present invention provide a data processing method, apparatus, electronic device, and storage medium that can solve the problem that while education directions and corresponding educational content can be determined based on industry talent demand reports regularly published by governments or industry associations, this approach has a long response time and cannot meet the changing needs of the industry in a timely manner.

[0004] To achieve the above objectives, according to one aspect of the present invention, a data processing method is provided.

[0005] An embodiment of the present invention provides a data processing method comprising: receiving a data processing instruction, obtaining demand data corresponding to a target industry, locating corresponding data sources, and extracting heterogeneous data to be processed from each of the data sources, wherein the data to be processed includes standard documents, educational data, and enterprise-related data corresponding to the target industry; The type identifier of the data to be processed is determined, and the data to be processed is processed based on a preset semantic processing model to obtain a set of capability parameters that associate the type identifier with each preset type indicator; Obtain the historical capability parameters of the type identifier, and generate the capability association relationship corresponding to the type identifier based on the capability parameter set and the historical capability parameters; In response to a data query command, the system retrieves the corresponding target words and target type identifiers, matches the capability association relationships between the target words and the target type identifiers, and generates and displays a matching capability parameter table.

[0006] In one embodiment, the target type identifier includes a first type identifier and a second type identifier; The target words are matched with the capability associations corresponding to the target type identifiers to obtain and display a matching capability parameter table, including: The target words are matched with the capability association relationships corresponding to the first type identifier and the second type identifier, respectively, to obtain a first matching word list and a second matching word list; Calculate the matching degree between each matching word in the first matching word list and each matching word in the second matching word list, and determine the text pair between the first matching word list and the second matching word list based on the matching degree; Display the first matching word list, the second matching word list, and the text pairs.

[0007] In another embodiment, the data to be processed is processed based on a preset semantic processing model to obtain a set of capability parameters for associating the type identifier with each preset type indicator, including: Obtain the knowledge graph corresponding to the type identifier, identify the data to be processed based on the knowledge graph, and determine the entity text and entity label corresponding to each preset type indicator; Obtain the capability target level and knowledge dictionary corresponding to the preset type indicator, and determine the capability target of the entity text in combination with the entity tag; Based on the capability objectives and the entity labels, structured data of the entity text is generated, and a set of capability parameters for each preset type indicator is obtained.

[0008] In another embodiment, the step of generating structured data of entity text based on the capability objective and the entity label to derive a set of capability parameters for each preset type indicator includes: Obtain a preset capability weight matrix, and determine the capability weight coefficients of the entity text based on the capability objective; Identify entity texts with the same entity label and generate a fusion vector based on the capability weight coefficients; Based on the fusion vector, the entity label, and the capability target, structured data of the entity text is generated, and a set of capability parameters for each preset indicator is obtained.

[0009] In another embodiment, the data to be processed is processed based on a preset semantic processing model to obtain a set of capability parameters for associating the type identifier with each preset type indicator, including: Extract the first set of terms from the data to be processed; Obtain the position information of each term in the first term set, calculate the statistical value of each term, and filter the second term set from the first term set based on the position information, the statistical value and the preset dictionary corresponding to the type identifier; Obtain the source and the text segment corresponding to each term in the second term set, determine the corresponding preset type indicators based on the source and the text segment, generate structured data for each term in the second term set, and derive the capability parameter set for each preset type indicator.

[0010] In another embodiment, the method involves obtaining the position information of each term in the first term set, calculating the statistical value of each term, and filtering a second term set from the first term set based on the position information, the statistical value, and a preset dictionary corresponding to the type identifier, including: For each term in the first term set, calculate the term frequency and determine the first weight of the term; obtain the position information of the term and determine the second weight of the term based on the text structure corresponding to the position information; match the term with the preset dictionary corresponding to the type identifier and determine the third weight of the term based on the matching result; The second set of terms is selected from the first set of terms based on the first weight, the second weight, and the third weight.

[0011] In yet another embodiment, filtering a second set of terms from the first set of terms based on the first weight, the second weight, and the third weight includes: The initial weight of each term in the first term set is determined based on the first weight, the second weight, and the third weight; The terms in the first term set whose initial weight is greater than a preset weight threshold are identified as high-weight terms. Determine the co-occurrence frequency of each term in the first term set with the high-weight term, and determine the fourth weight of each term in the first term set based on the co-occurrence frequency; The second set of terms is selected from the first set of terms based on the initial weight and the fourth weight.

[0012] In yet another embodiment, heterogeneous data to be processed is extracted from each of the said data sources, including: Heterogeneous data to be processed is extracted from each of the aforementioned data sources based on a preset time period; Obtain a parser corresponding to the transmission protocol type associated with each of the data sources, parse the data to be processed, obtain the corresponding text data, and generate a semantic vector corresponding to the text data; Determine the type identifier corresponding to the semantic vector, obtain the preset business strategy, determine the storage location of the semantic vector based on the type identifier and the business strategy, and store the semantic vector.

[0013] To achieve the above objectives, according to another aspect of the present invention, a data processing apparatus is provided.

[0014] An embodiment of the present invention provides a data processing apparatus comprising: an acquisition unit, configured to receive data processing instructions, acquire demand data corresponding to a target industry, locate corresponding data sources, and extract heterogeneous data to be processed from each of the data sources, wherein the data to be processed includes standard documents, educational data, and enterprise-related data corresponding to the target industry; The processing unit is used to determine the type identifier of the data to be processed, process the data to be processed based on a preset semantic processing model, and obtain a set of capability parameters that associate the type identifier with each preset type indicator. A generation unit is used to obtain historical capability parameters of the type identifier and generate capability association relationships corresponding to the type identifier based on the capability parameter set and the historical capability parameters. The matching unit is used to respond to a data query command, obtain the corresponding target words and target type identifiers, match the capability association relationships between the target words and the target type identifiers, and obtain and display the matching capability parameter table.

[0015] To achieve the above objectives, according to another aspect of the present invention, an electronic device is provided.

[0016] An electronic device according to an embodiment of the present invention includes: one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the data processing method provided in the embodiment of the present invention.

[0017] To achieve the above objectives, according to another aspect of the present invention, a computer-readable medium is provided.

[0018] An embodiment of the present invention provides a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the data processing method provided in the embodiment of the present invention.

[0019] To achieve the above objectives, according to another aspect of the present invention, a computer program product is provided.

[0020] A computer program product according to an embodiment of the present invention includes a computer program that, when executed by a processor, implements the data processing method provided in the embodiment of the present invention.

[0021] One embodiment of the above invention has the following advantages or beneficial effects: In this embodiment of the invention, the demand parameters of the target industry can be located at various data sources to obtain heterogeneous data to be processed. This data can include regulatory documents, educational data, and enterprise-related data corresponding to the target industry. In other words, it obtains data related to the demand parameters from regulatory requirements, the education sector, and enterprise-related sectors, ensuring the comprehensiveness of the data to be processed. Then, the type identifier of the data to be processed can be identified, and it is processed using a preset semantic processing model to determine the set of capability parameters for each preset indicator in the corresponding type identifier, representing the set of capability parameters associated with the target industry's demand. Finally, combining the obtained capability parameters with historical capability parameters generates a type identifier. The corresponding capability relationships can reflect the association between each capability parameter and preset indicators. That is, through capability relationships, the capabilities of each preset indicator in different types can be linked. Based on industry needs, comprehensive relationships can be established in the education field and related enterprise fields in a timely manner. In this way, when performing relationship queries, a table of capability parameters that match the target term in the target type can be obtained. Then, the association between the target term and the capability parameters in the education field and related enterprise fields can be accurately obtained. Based on the capability relationships corresponding to the capability parameters, the educational content and educational direction that match the needs of various industries can be determined. This allows education to meet the changes in industry needs in a timely manner and improves the efficiency of matching educational data with industry needs.

[0022] The further effects of the aforementioned unconventional alternative methods will be explained below in conjunction with specific implementation methods. Attached Figure Description

[0023] The accompanying drawings are provided to better understand the invention and are not intended to unduly limit the scope of the invention. Wherein: Figure 1 This is a schematic diagram of a main flow of a data processing method according to an embodiment of the present invention; Figure 2 This is a schematic diagram of a matching chart according to an embodiment of the present invention; Figure 3 This is a schematic diagram of a target word query operation according to an embodiment of the present invention; Figure 4 This is a schematic diagram of another main flow of a data processing method according to an embodiment of the present invention; Figure 5 This is a schematic diagram of the term processing procedure according to an embodiment of the present invention; Figure 6 This is a schematic diagram of another main flow of a data processing method according to an embodiment of the present invention; Figure 7 This is a schematic diagram of another main flow of a data processing method according to an embodiment of the present invention; Figure 8 This is a schematic diagram of another main flow of a data processing method according to an embodiment of the present invention; Figure 9 This is a schematic diagram of the main units of a data processing apparatus according to an embodiment of the present invention; Figure 10 This is an exemplary system architecture diagram in which embodiments of the present invention can be applied; Figure 11 This is a schematic diagram of the structure of a computer system suitable for implementing embodiments of the present invention. Detailed Implementation

[0024] The following description, in conjunction with the accompanying drawings, illustrates exemplary embodiments of the present invention, including various details to aid understanding. These details should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the invention. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0025] It should be noted that, unless otherwise specified, the embodiments and features described in this invention can be combined with each other. The acquisition, transmission, storage, use, and processing of data in this application comply with relevant national laws and regulations. In the embodiments of this application, certain existing industry solutions such as software, components, and models may be mentioned. These should be considered exemplary, intended only to illustrate the feasibility of implementing the technical solution of this application, and do not imply that the applicant has already used or necessarily used such solutions.

[0026] This invention provides a data processing system that can be used in various fields for data processing, such as processing educational data and enterprise-related data to establish capability data associations between them.

[0027] Specifically, in this embodiment of the invention, capability parameters can be used to represent the parameters that reflect capabilities in educational data and enterprise-related data. Generally, educational data provides knowledge to people, which means that capabilities can be provided. That is, the capability parameters corresponding to educational data represent the capabilities that educational data can provide to people, such as knowledge points. Enterprise-related data represents the capabilities that enterprises or industries require of their personnel. That is, the capability parameters corresponding to enterprise-related data represent the capabilities that enterprises require of their personnel in their relevant fields, such as technical capabilities.

[0028] It should be noted that since industry development and demand changes can be reflected through enterprise needs or development directions, this embodiment of the invention determines the capability parameters corresponding to industry needs by analyzing educational data and enterprise-related data. The standard documents of the target industry can provide guidance on development directions, etc., so the standard documents can also be used to analyze the capability parameters corresponding to industry needs.

[0029] This invention provides a data processing method, which can be executed by a data processing system, such as... Figure 1 As shown, the method includes the following steps.

[0030] S101: Receive data processing instructions, obtain the demand parameters corresponding to the target industry, locate the corresponding data sources, and extract heterogeneous data to be processed from each data source.

[0031] The data to be processed includes regulatory documents, educational data, and enterprise-related data corresponding to the target industry. Data processing instructions represent instructions to update capability relationships; these can be sent by external systems or automatically triggered. For example, in this embodiment, a scheduled update task for capability relationships can be set up, such as automatically triggering data processing instructions at regular intervals. The scheduled update task can include information about the target industry, meaning the data processing instructions include target industry information. Alternatively, in this embodiment, data processing instructions sent by other systems can be received, and these instructions include target industry information. In this step, the target industry can be determined based on the data processing instructions, i.e., which industry's needs are being addressed in updating the capability relationships, thereby obtaining the target industry's demand data.

[0032] Demand data for the target industry can represent descriptive data about the industry's required capabilities. For example, if an industry requires big data analysis, then either big data analysis or data analysis can serve as demand data. In this step, demand data can be carried in data processing instructions, meaning demand parameters can be obtained from these instructions. Alternatively, this step can utilize technologies such as web scraping to obtain demand information from the target industry, and then extract demand data from that information. For instance, the target industry's demand information might include: significant results achieved through big data analysis in a particular industry; semantic understanding can determine that "big data analysis" is descriptive data representing the industry's required capabilities, thus identifying "big data analysis" as demand data.

[0033] It should be noted that in this embodiment of the invention, the target industry can be updated multiple times in terms of capability association. Therefore, the demand data obtained in this step can be compared with the historical demand data corresponding to the target industry, and the data that is the same as the historical demand data in the demand data obtained this time can be removed.

[0034] After obtaining the demand data, the corresponding data source can be determined based on the demand data. In this embodiment of the invention, the capability association relationship can be updated based on the standard documents, educational data, and enterprise-related data corresponding to the target industry. Therefore, the data source can include data sources that can obtain standard documents, educational data, and enterprise-related data. Standard documents are usually industry standard documents issued by national agencies or certified security organizations, such as standard industry classification information, standard occupational classification information, industry standards, etc., which can usually be obtained from public platforms. Educational data can include educational course data, professional information, student employment data, educational questionnaires and other survey data related to education, schools, and students. This data is usually stored in the databases of educational institutions, which can be obtained from the databases of educational institutions. Enterprise-related data can include enterprise data, job recruitment data, corresponding industry chain data, etc., which can be obtained from enterprise websites, public platforms, and enterprise databases that allow data access. In this step, we can determine the enterprises corresponding to the target industry based on the companies that apply it, and then determine the data source for the enterprise-related data; we can determine the corresponding professional (e.g., university major) information based on the target industry and requirement parameters, and then determine the data source for the educational data corresponding to that major. The professional information corresponding to each industry can be pre-configured or determined through semantic analysis based on the target industry and requirement parameters; the data source for the standard documents corresponding to the target industry can be pre-configured, and this step can determine the corresponding data source based on the configuration information. After determining each data source, we can obtain the data to be processed from them. Since the data sources are different, the format of the data to be processed is not uniform, and it can include video data, text data, image data, etc., so the data to be processed is heterogeneous data.

[0035] In one implementation, since new industry demands don't develop over long periods, data from a recent timeframe can be acquired when obtaining the data to be processed. Therefore, this step can extract heterogeneous data from various data sources based on a preset time period. The length of this time period can be set according to requirements, for example, the last two years. Because different data sources may use different transmission protocols, this embodiment can preset multiple parsers corresponding to various transmission protocols. Then, when acquiring the data to be processed, the parser corresponding to the transmission protocol can be used to parse the data, converting it into text data and generating corresponding semantic vectors for subsequent data processing.

[0036] Specifically, in this embodiment of the invention, a lightweight LLM real-time identification data source can be preset with a transmission protocol type (such as SCORM, HLS, EdX API, xAPI) to determine the corresponding parser and parse the data to be processed. In addition, an LLM semantic understanding model can be pre-configured to convert the data to be processed into text data and generate semantic vectors. The method of generating semantic vectors is not limited, such as generating them through an attention mechanism.

[0037] It should be noted that in this embodiment of the invention, a multimodal intelligent adaptation gateway can be pre-configured. By combining the intelligent adaptation gateway with the LLM semantic understanding model, unified access, semantic understanding, and dynamic routing of the data to be processed are achieved. The intelligent adaptation gateway can be configured with various parsers to dynamically adapt to and parse the corresponding data to be processed. In this embodiment of the invention, dynamic configuration of the parsers can also be implemented, such as adding or removing parsers as needed.

[0038] The data to be processed can be pre-classified, and the type can represent a dimension of capability analysis. Capability analysis is performed based on the classified types. The classification method is not limited; for example, educational data can be classified into one or more types such as major type, course type, and student type, while enterprise-related data can be classified into one or more types such as enterprise type, industry type, occupation type, recruitment type, and sector type. Appropriate type identifiers can be configured for each classified type. In this embodiment of the invention, the type identifier of the data to be processed can be determined by analyzing semantic vectors.

[0039] It should be noted that the data to be processed can be divided into multiple text data based on different sources, etc. Each text data can be assigned a corresponding type identifier, and each text data can correspond to one or more type identifiers. To facilitate subsequent processing, the semantic vectors of the text data can be stored in this step. The storage method can be determined according to business needs. Specifically, a business strategy can be preset. For example, if the business needs include data analysis based on region, time, etc., then the semantic vectors can be stored according to region, time, etc. The business strategy can include setting data storage locations for different regions and times. Therefore, in this step, the corresponding storage locations can be determined based on the preset business strategy and type identifiers, and then the semantic vectors can be stored separately. The business strategy can also include data storage methods, such as real-time data storage, batch data storage, etc., to facilitate the use of different data storage methods for storing semantic vectors.

[0040] In this embodiment of the invention, when performing semantic understanding on the data to be processed, meaningless data (such as data irrelevant to the course in course videos), illogical data (such as obvious common sense errors), and sensitive data can also be identified to clean the semantic vector. Specifically, data cleaning can be performed by combining preset verification rules with LLM (Local Level Management). For educational data, this can be achieved by combining a preset educational data quality verifier.

[0041] It should be noted that, in this embodiment of the invention, the data to be processed may include multiple modalities. For example, educational data may include video data (such as recorded classroom videos), behavioral logs (such as student behavior during course learning through applications), course packages (such as textbook data packages for teaching courses), etc. To facilitate data unification, this embodiment of the invention can convert multimodal data into unified semantic vectors, such as through a multimodal attention mechanism. For subsequent data processing, semantic tags can be added to the semantic vectors to enhance the understanding of data content and contextual enhancement. Specifically, semantic tags can be added to the semantic vectors using a semantic understanding module based on LLM. Semantic tags may include chapter node tags, course type tags, knowledge point mastery levels, enterprise association tags, industry association tags, recruitment association tags, region of origin, corresponding time period, etc., which can be specifically set according to requirements.

[0042] S102: Determine the type identifier of the data to be processed, process the data to be processed based on the preset semantic processing model, and obtain the set of capability parameters that associate the type identifier with each preset type indicator.

[0043] In this embodiment of the invention, the corresponding type identifier can be determined through semantic analysis of the data to be processed. For example, a semantic vector of the data to be processed can be generated first, and then its semantic understanding can be used to determine its type, thereby determining the type identifier of the data to be processed. In this embodiment of the invention, semantic understanding can be achieved through a pre-trained LLM. The semantic processing model is pre-trained and can be implemented according to specific needs. The data corresponding to each type identifier can have multiple preset type indicators. Each preset indicator represents a different classification in the type data. For example, the preset type indicators in the data corresponding to the course type identifier can include the course name after course classification, such as teaching subjects such as advanced data courses and physics courses; as another example, the preset type indicators in the data corresponding to the major type can include the major name after major classification, such as philosophy, economics, law, education, literature, history, etc.; as yet another example, the preset type indicators in the data corresponding to the recruitment identifier can include the job name after job classification, such as management positions, professional and technical positions, skilled worker positions, etc.

[0044] It should be noted that the classification method of each type and the preset type index in the type in the embodiments of the present invention can be set according to the needs of the application scenario, and there is no limitation in the embodiments of the present invention.

[0045] In this step, after determining the type identifier of the data to be processed, the data can be further processed based on a preset semantic processing model to determine the set of capability parameters for each preset type indicator associated with the type identifier. Capability parameters can represent the capabilities that can be achieved or are required for the corresponding preset type indicator, and the set of capability parameters for each preset type indicator represents the set of capabilities that the personnel corresponding to that preset type indicator can achieve or are required to achieve.

[0046] Specifically, based on a pre-defined semantic processing model, the processing of data to be processed can identify and extract capability parameters from the data. Then, based on the context of the capability parameters, the corresponding pre-defined type identifier and source are determined to generate structured data corresponding to the capability parameters. Capability parameter identification can be achieved through text recognition. For example, a pre-defined set of texts representing capability parameters can be used to perform text recognition from the data to be processed. Then, through verification and merging of the identified text, reasonable capability text is obtained, and the capability text is determined as the capability parameter.

[0047] S103: Obtain the historical capability parameters of the type identifier, and generate the capability association relationship corresponding to the type identifier based on the capability parameter set and the historical capability parameters.

[0048] After obtaining the set of capability parameters, capability parameter sets belonging to the same type of identifier are associated with capability relationships. When a type identifier has corresponding historical capability parameters, the set of capability parameters and historical capability parameters can be merged to generate a comprehensive capability relationship.

[0049] It should be noted that capability correlation can include the correlation between capability parameters and data sources, data types, and preset type indicators. Since each capability parameter can belong to different types and preset type indicators, capability correlation can reflect the relationship between educational data and enterprise-related data.

[0050] In this embodiment of the invention, the capability association relationship can be specifically a graph. For example, capability parameters, preset type indicators, types, sources, etc. can all be used as entities in the graph. By establishing the association relationship between entities, a graph that can reflect the relationship between data can be obtained.

[0051] S104: In response to the data query command, obtain the corresponding target words and target type identifiers, match the capability association relationship between the target words and the target type identifiers, and obtain and display the matching capability parameter table.

[0052] After establishing capability relationships, the connections between data can be viewed through data queries. For example, if business personnel want to view the capability parameters associated with a word in the education field, enterprise-related field, etc., they can use that word as the target word for this step of the data query.

[0053] The data query command includes the target term and the target type identifier. The target term indicates that the query aims to obtain the capability parameters associated with that term, and the target type identifier indicates the scope of the query. For example, if the industry requirement includes the target term "data analysis," and the user wants to know the capability parameters associated with it in the course type and industry type, then the target type identifier can be determined to include the course type identifier and the industry type identifier. Therefore, the data query command would include "data analysis," the course type identifier, and the industry type identifier.

[0054] Specifically, in this embodiment of the invention, an LLM can be preset to match the capability association between target words and target type identifiers. Through the reasoning ability of LLM, the semantic limitations of short texts can be overcome, and the implicit semantic features of target words under different types can be explicitly mined and expanded, thereby providing richer and more accurate semantic information for subsequent cross-type data matching. The output of LLM can be represented in JSON format.

[0055] For example, taking "data analysis" as the target term, the corresponding Prompt can be generated based on the target term and target type identifier according to the preset format. Then, after inputting the preset LLM, the capability parameter tables in JSON format corresponding to the course type and industry type are obtained respectively.

[0056] The prompt template can be as follows: You are a precise domain semantic analyzer. For a given "skill term," please output its relevance features in both the "curriculum" and "industry" scenarios, strictly following the JSON format below. Skill term: {input_term}. Requirements: 1. Curriculum side: List the "core competencies" or "teaching objectives" cultivated by this skill in school education, up to 3 items. 2. Industry side: List the "practical value" or "application scenarios" of this skill in corporate positions, up to 3 items. 3. The output must be in plain JSON format, without any other interpretation.

[0057] Based on the above, with "data analysis" as the input, the output can be represented as { "education": ["statistical analysis ability", "data visualization understanding", "critical thinking"], "industry": ["business insight mining", "decision support reporting", "user behavior analysis"]}. That is, the corresponding ability parameter table for the course type is: statistical analysis ability, data visualization understanding, and critical thinking; the corresponding ability parameter table for the industry field is: business insight mining, decision support reporting, and user behavior analysis.

[0058] After obtaining the tables of competency parameters, they can be displayed for relevant personnel to view; the display method is not limited. For example, the target keyword and the tables of competency parameters can be concatenated and displayed as a whole. For instance, if the target keyword is "data analysis," the concatenated text could be represented as: "Data Analysis. Competency parameter table corresponding to course type: Statistical analysis ability, data visualization understanding, critical thinking. Competency parameter table corresponding to type: Business insight mining, decision support reporting, user behavior analysis."

[0059] It should be noted that, in this embodiment of the invention, the Prompt template greatly controls the randomness of LLM output and ensures the structured nature of the data through role setting, task clarification, format constraints, and examples. The LLM output structure can be parsed into JSON to extract the text from the two vocabularies: education and industry.

[0060] In another real-time approach, target type identifiers can include multiple identifiers, such as the course type identifier and industry type identifier in the example above. To further determine the association between target words and data of different types, the capability parameter tables corresponding to each target type can be matched. Specifically, taking any two of the multiple type identifiers included in the target type identifier as an example, the target type identifier can include a first type identifier and a second type identifier. This step of obtaining and displaying the matching capability parameter table can be specifically executed as follows: matching the target words with the capability associations corresponding to the first type identifier and the second type identifier respectively, obtaining a first matching word table and a second matching word table; calculating the matching degree between each matching word in the first matching word table and each matching word in the second matching word table, determining the text pairs between the first matching word table and the second matching word table based on the matching degree; and displaying the first matching word table, the second matching word table, and the text pairs.

[0061] In order to represent the matching degree of capability parameters in two types, in this embodiment of the invention, a matching threshold between the two types can be preset. When the matching degree between a certain capability parameter in two types is calculated to be greater than the association threshold, the two capability parameters can be considered to be a matching text pair.

[0062] It should be noted that the matching thresholds can differ between different types, and the specific value of the matching threshold can be set according to needs and adjusted in real time. For example, the matching threshold between course type and industry type can be set to 0.75.

[0063] Specifically, in this embodiment of the invention, the matching degree between capability parameters in two matching word lists can be calculated in various ways, such as through a vector similarity calculation model (cosine similarity calculation model) or a preset LLM semantic model. To improve the accuracy of the calculation, in this embodiment of the invention, the matching degree between capability parameters in two matching word lists can be calculated using both the cosine similarity calculation model and the preset LLM semantic model, and then the final matching degree can be determined based on the calculation results of both.

[0064] For example, suppose the target term is "Python programming," and the first matching term list for course types is: algorithm thinking training, data structure teaching, and computational problem-solving ability. The second matching term list for industry types is: script development efficiency, library function proficiency, and automated task processing. Taking the calculation of the matching degree between "algorithm thinking training" in the first matching term list and "script development efficiency" in the second matching term list as an example, the first calculation result can be obtained through the vector similarity calculation model, and the second calculation result can be obtained through the LLM semantic model. If both the first and second calculation results are greater than the matching threshold, the matching degree is determined to be greater than the matching threshold, and the final matching degree is calculated by weighted summation of the first calculation result (e.g., with a weight of 0.7) and the second calculation result (e.g., with a weight of 0.3). If one of the first and second calculation results is greater than the matching threshold and the other is not greater than the matching threshold, the final matching degree can be determined by other methods or by selecting one of the calculation results (e.g., the second calculation result can be used as the final matching degree). If neither the first nor the second calculation result is greater than the association threshold, the final calculation result is determined to be less than the matching threshold, and the two ability parameters do not match.

[0065] It should be noted that, in this embodiment of the invention, to avoid misjudgment when determining a matching pair, a verification method can be used to re-evaluate when the calculated matching degree is close to the matching threshold. Specifically, a range for the close matching degree can be set. For example, if the matching threshold is 0.7, the range for the close matching degree can be set to 0.6-0.7. When the calculated matching degree falls within this range, a preset LLM verification model can be used for re-evaluation. For some capability parameters, their corresponding weight coefficients can be determined through pre-calculation or pre-setting. In this embodiment of the invention, the corresponding weight coefficients can be added after calculating the matching degree between the capability parameter and other capability parameters to obtain the final matching degree.

[0066] In this embodiment of the invention, to demonstrate the relationship between text pairs in the matching vocabulary, the display methods of the matching vocabulary and text pairs can be preset. For example, it can be displayed using a matching chart, which may include the matching vocabulary corresponding to each target type. Text pairs can be marked with preset labels, and different display styles can be set according to the degree of matching between the text pairs. Figure 2 The image shown is a schematic diagram illustrating the display of a matching chart. Figure 2 In this context, domain A-1 can represent the first target type, and domain B-4 can represent the second target type. The corresponding chart list for each target type displays the corresponding matching terms. Text pairs are represented by identifiers consisting of a key and a number, such as... Figure 2 In the text, the two capability words corresponding to key1 represent a text pair, and the display color of the identifier can indicate the magnitude of the matching value of the text pair. For example, different display colors can be set for text pairs with different matching degrees.

[0067] like Figure 3 The diagram illustrates a target word query operation in this step. The target word can be represented as term K, which is matched against three target types, A, B, and C, to obtain the corresponding capability word list. The terms included in A, B, and C can represent different capability parameters.

[0068] In this embodiment of the invention, when performing a relational query, a table of capability parameters that match the target words in the target type can be obtained. This allows for accurate identification of the correlation between the target words and capability parameters in the education field and enterprise-related fields. Based on the capability correlations corresponding to the capability parameters, educational content and educational directions that match the needs of various industries can be determined. This enables education to meet the changing needs of the industry in a timely manner and improves the efficiency of matching educational data with industry needs.

[0069] The following is combined Figure 1 The illustrated embodiment is for Figure 1 One method for determining the set of capability parameters in step S102 is described in detail below, such as... Figure 4 As shown, the method includes the following steps.

[0070] S401: Extract the first set of terms from the data to be processed.

[0071] In this embodiment of the invention, the data to be processed can be converted into semantic vectors for subsequent processing.

[0072] The method for extracting terms in this step is not limited. For example, terms can be extracted using a preset type dictionary or the word frequency of each word segment after the data to be processed is segmented, thus obtaining the first term set.

[0073] For different types of data, corresponding terminology dictionaries can be pre-set. These dictionaries can include commonly used terms and concepts relevant to the data type. For example, in the course type, the terminology dictionary for medical courses could include terms such as "myocardial infarction," "antibiotics," and "percutaneous coronary intervention." Therefore, in this step, terms can be extracted using the terminology dictionary. The extraction method is not limited; it can be implemented using methods such as AC automata or similarity calculation. The method for calculating the word frequency of each segment after data segmentation is also not limited; it can be implemented using TF-IDF. After calculating the word frequency of each segment, segments whose frequency reaches a preset threshold can be identified as the terms to be extracted.

[0074] It should be noted that in this embodiment of the invention, multiple methods can be used for term extraction to ensure comprehensiveness, and the extraction results of each method are merged to obtain the first term set. In this embodiment of the invention, term extraction using the AC automaton algorithm can make the extraction process fast and accurate, ensuring that core terms are not missed; term extraction using TF-IDF can avoid missing emerging terms or high-frequency important words in specific documents, effectively mining distinctive words that frequently appear in the current data but are not common in general corpora.

[0075] S402: Obtain the position information of each term in the first term set, calculate the statistical value of each term, and filter the second term set from the first term set based on the position information, statistical value and type identifier corresponding to the preset dictionary.

[0076] The terms in the resulting first term set can be further filtered, and the filtering method is not limited. Specifically, in this embodiment of the invention, filtering can be performed from multiple dimensions. For example, filtering can be performed from one or more dimensions such as the term's position, word frequency distribution, and whether it belongs to a term.

[0077] Specifically, in this embodiment of the invention, the filtering is illustrated using three dimensions: the position of the term, the word frequency distribution, and whether it belongs to a terminology. This step can be specifically executed as follows: For each term in the first term set, calculate the word frequency of the term to determine its first weight; obtain the position information of the term, and determine its second weight based on the text structure corresponding to the position information; match the term with a preset dictionary corresponding to the type identifier, and determine its third weight based on the matching result; and filter the second term set from the first term set based on the first, second, and third weights.

[0078] In this step, the higher the frequency of a term, the more important it is. Therefore, different frequency weight levels can be set for different frequencies, thus deriving the first weight for each term. The calculation method for frequency is not limited. For terms extracted from text, their importance varies depending on their location within the text. In this embodiment, the text can be structurally divided, such as into titles, chapter titles, body text, figure / table descriptions, abstracts, and conclusions. Different structures can be configured with different weights; for example, a weight of 1.5 can be set for titles, 1.3 for figure / table descriptions, and 1.0 for body text. Therefore, this step allows us to obtain the position of a term in the text, determine the text structure to which that position belongs, and derive the corresponding second weight. This allows terms in different positions to reflect different information densities and importance, highlighting the importance of terms appearing in key positions. For data of a certain type, terms belonging to that type usually reflect the corresponding capabilities. Therefore, in this step, we can obtain a pre-defined terminology dictionary for each field. If an entry belongs to the pre-defined dictionary, its corresponding third weight can be derived; if an entry does not belong to the pre-defined dictionary, its third weight can be considered to be 0, thereby increasing the importance of the term. After obtaining the weights corresponding to the above dimensions, the weights for each entry can be calculated by weighted summation. Then, based on the magnitude of the weight result, entries can be filtered to obtain the second set of entries.

[0079] It should be noted that if some terms appear near more important terms, it indicates that they are strongly related to the more important terms. In this embodiment of the invention, terms can be screened again based on this.

[0080] Specifically, the process of filtering the second term set based on the first weight, second weight, and third weight can be performed as follows: determining the initial weight of each term in the first term set based on the first weight, second weight, and third weight; identifying terms in the first term set whose initial weight is greater than a preset weight threshold as high-weight terms; determining the co-occurrence frequency of each term in the first term set with the high-weight terms, and determining the fourth weight of each term in the first term set based on the co-occurrence frequency; and filtering the second term set from the first term set based on the initial weight and the fourth weight.

[0081] The initial weights can be calculated in any way, such as by using a weighted summation. After obtaining the initial weights of each term in the first term set, high-weight terms can be selected first. For example, terms with initial weights greater than a preset weight threshold can be identified as high-weight terms. The preset weight threshold can be set according to requirements. Then, the co-occurrence frequency of each term in the first term set with high-weight terms can be calculated. The calculation method can be unrestricted. For example, the distance between a term and a high-weight term can be calculated. If the distance is greater than a preset number of characters, it can be determined that the term and the high-weight term co-occur, and the co-occurrence frequency can be calculated. Alternatively, in this embodiment of the invention, a sliding window can be preset. As the sliding window moves, if a term and a high-weight term co-occur in the same sliding window, it can be determined that the term and the high-weight term co-occur. The size of the sliding window can be a preset number of characters, and the step size can be set according to requirements, such as one character. In this step, different values ​​of common frequency can be set to correspond to different weights. Therefore, a fourth weight corresponding to the common frequency can be determined for each term in the first term set. Then, the second term set can be selected from the first term set based on the initial weight and the fourth weight.

[0082] It should be noted that after obtaining the second set of terms, the terms can be filtered, such as deleting terms that are too short (one character) or too long (possibly a whole sentence). In this embodiment of the invention, refined filtering of terms is achieved through comprehensive judgment of dimensions, and the collected terms are comprehensively evaluated to avoid judgment bias caused by a single dimension.

[0083] S403: Obtain the source and text segment corresponding to each term in the second term set, determine the corresponding preset type indicators based on the source and text segment, generate structured data for each term in the second term set, and derive the capability parameter set for each preset type indicator.

[0084] For each term in the second term set, its corresponding source and the text segment (context fragment) can be obtained to reflect its relationship with each type and preset type indicator. This can generate structured data, thereby realizing the relationship between the established terms and each type and preset type indicator. By determining each term as a capability parameter, the capability parameter set of each preset type indicator can be obtained.

[0085] In this embodiment of the invention, the structured data can be specifically in JSON format, and may include terms, sources, the text segment to which they belong, and the final weight value (such as the value calculated by weighted summation based on the fourth weight and the initial weight).

[0086] It should be noted that in the embodiments of the present invention, the above-mentioned process of processing the entries can be implemented based on NLP, and the process of extracting entries is implemented based on rule-driven NLP. After extracting the entries, they can also be corrected and merged, which can be specifically implemented through an LLM semantic model based on cognitive enhancement to perform semantic calibration and normalization.

[0087] In another implementation manner, after obtaining the second set of entries in the embodiments of the present invention, semantic legality verification can be performed, such as correcting some traditional Chinese characters and non-standardized terms, and performing disambiguation processing on ambiguous terms. This can be specifically implemented through a preset LLM model. For example, "心臟" can be corrected to "心脏", and for polysemous terms (such as "苹果"), they can be corrected to related terms (such as "苹果水果") in combination with the structured data in this paragraph.

[0088] In the above correction process, a score can be given each time after the corrected entries are output. If the score is relatively low, manual review can be prompted. For example, when the score is lower than 0.7, the entry can be marked as pending manual review.

[0089] Since there may be synonyms, near-synonyms, and polysemous terms, etc. in the second set of entries, in the embodiments of the present invention, entries with the same or extremely similar semantics can also be merged to form a unified concept, laying a foundation for constructing a high-quality ability association relationship.

[0090] In another implementation manner, in the embodiments of the present invention, entries can be merged from multiple dimensions. For example, an enhanced decision tree can be established based on LLM, and the entries can be analyzed and judged in turn through the enhanced decision tree, thereby realizing the merging of entries.

[0091] Specifically, in the embodiments of the present invention, it can be determined whether two entries are homologous (such as "计算机" and "计算") by analyzing the roots, affixes, etc. of the entries, and then determine whether they need to be merged; it can also be combined with the text segment to which the entry belongs, and it can be judged by a preset LLM whether the two entries refer to the same concept in a specific context, and then determine whether they need to be merged; corresponding ontology libraries (such as the medical ontology SNOMED CT) can be established in advance in each field, and this step can also determine whether they need to be merged by whether the two entries are mapped to the same standard concept ID in the ontology library; the entries can also be converted into vectors and then subjected to clustering analysis to be used as an auxiliary judgment for whether to merge. For example, the vectors of the entries generated by SBERT can be used and projected into a 2D space for visual clustering analysis; expert rules can be preset to determine whether to merge through the expert rules.

[0092] It should be noted that in the above-mentioned correction and merging process, LLM is integrated into each process as a "super inference engine". Specifically, a corresponding Prompt can be designed for each LLM to perform the above judgment (e.g., "Based on the following context, determine whether 'heart failure' and 'heart strain' refer to the same disease? Context: {belonging text segment}") and give the reason for the judgment.

[0093] For the processing of the aforementioned terms, this embodiment of the invention can generate a correction and traceability system. This system can be used to record, manage, and visualize the entire term lifecycle (from extraction to correction and merging). Specifically, it can be implemented using a knowledge graph in a graph database. Entities in the graph represent different versions of the processed terms, the algorithms used during processing (e.g., NLP, LLM), operators (e.g., systems, experts), time information, etc., and edges in the graph represent processing actions such as correction and merging.

[0094] For example, an entity in a graph can contain an original term, a candidate term, and a standard term A. The original term is connected to the candidate term, and the candidate term is connected to the standard term A. The original term represents the extracted term, and its corresponding attributes may include: extracted by NLP {timestamp: t1, confidence: 0.65}; the candidate term represents the term after the original term has been corrected, and its corresponding attributes may include: corrected by LLM {timestamp: t2, Prompt: "...", before correction: "heart", after correction: "heart"}; the standard term A represents the term after the candidate terms have been merged, and its corresponding attribute parameters may include: merged {operation: LLM decision tree, basis: "five-layer validation passed", responsible party: system}.

[0095] If manual processing is used during the term processing, the process can be recorded. For example, when a business person objects to or rejects a processing result, this action can also be recorded in the graph. Graph entities can include expert users, representing the processing of the connected entities, which corresponds to the manual rejection of term merging. The attributes of this entity can include: Manual rejection of merging {timestamp: t3, reason: "There are slight differences between the two in the domain"]->(merging operation).

[0096] It should be noted that, through the above-mentioned graph in the embodiments of the present invention, the processing process of any term can be clearly traced, showing which original term it evolved from and through which processing steps. This is crucial for auditing, problem investigation, and model iteration. By modifying the traceability system, the original "black box" processing flow can be transformed into a transparent, auditable, and interventionable white box process, greatly enhancing the credibility and maintainability of the system in various fields of application.

[0097] In this embodiment of the invention, the merging process of terms can first be carried out by quickly filtering the candidate set based on the similarity between term vectors. For example, corresponding vectors are generated for each term in the second term set and the similarity between them is calculated. The term groups with a similarity greater than 0.6 are determined as the candidate set. Then, a secondary verification process can be carried out through the above-mentioned LLM-based merging process to reduce erroneous merging (such as the different meanings of "virus" in the medical and computer fields). The LLM-based merging process can use multi-level semantic verification mechanisms, such as word roots, context awareness, and the combination of rules and LLM.

[0098] In order to evaluate the resulting terms, a multi-dimensional evaluation system can be set up in this embodiment of the invention to conduct quantitative evaluation from three core dimensions: term granularity, semantic quality, and system stability.

[0099] At the term granularity level, the rationality of term granularity can be measured by evaluating whether the extracted terms have reached an "atomic" state, i.e., whether a term represents an indivisible minimum independent concept. Specifically, this can be assessed using the Term Decomposition Degree (TDD) and Conceptual Independence Index (CII). TDD measures the degree to which a term can be further decomposed into multiple sub-concepts, indirectly reflected by calculating the number of direct subordinate terms of the corresponding entity in the pre-defined knowledge graph. The pre-defined knowledge graphs for each type can be pre-generated, and the generation method is not limited. When TDD is approximately 0, it indicates that the term has almost no finer-grained division in the knowledge graph and is a good candidate for an "atomic" term (e.g., "myocardial cell"). When TDD is much greater than 0, it indicates that the term is a complex concept and may require further decomposition (e.g., subordinate terms of "cardiovascular disease" include "coronary heart disease" and "hypertension," which have high TDD). Ideally, a term should correspond to a low TDD value. CII (Contextuality Independence) measures the semantic independence and clarity of a term when it is taken out of context. It is determined by calculating the mean cosine similarity (MCI) between the term's vector and the vectors of all contexts containing that term in a pre-built corpus. The method of building this corpus is not limited. A CII close to 1 indicates that the term's meaning is very stable across different contexts and has high independence (e.g., "Pythagorean theorem"). A CII close to 0 indicates that the term is highly context-dependent and its meaning is ambiguous when taken out of context (e.g., "processing"). During the evaluation process, terms with high CII values ​​can be output in advance to identify high-quality terms.

[0100] In terms of semantic quality, the accuracy of merging semantically similar terms can be evaluated, i.e., the accuracy of the term merging result can be measured. This can be achieved using an improved BERTScore-DF (Domain-adapted F1) metric. For example, the context vector of a term can be generated using the SBERT model, and then the similarity between the term and each term in a pre-defined term library of the corresponding type can be calculated. When calculating similarity, a strategy based on bidirectional optimal matching (such as the Hungarian algorithm) is adopted to better handle complex merging scenarios involving one-to-many or many-to-one relationships. The pre-defined term library can be pre-generated and may include historical terms of the corresponding type. When it is determined that a collected term is similar to a term in the pre-defined term library, a term from the pre-defined term library can be used to replace the extracted term to improve the accuracy of the term merging.

[0101] In terms of system stability, the system's stability in the face of noisy data and non-standard input can be tested, i.e., its resistance to interference can be evaluated. Specifically, a test set can be constructed, consisting of 90% normal data and 10% semantic interference data. Interference data can include spelling errors / typos (e.g., neural network, heart disease), colloquial / informal expressions (e.g., computer broken (corresponding to computer malfunction)), near-synonyms (e.g., terms with similar meanings but should not be merged according to domain rules, such as treatment and therapy in medicine), cross-category ambiguous words (e.g., apple), and non-compliant abbreviations (e.g., myocardial infarction (should be merged into myocardial infarction)). The system can process the test set to observe its performance. If the system can correctly correct most spelling and colloquial errors, resist the misleading influence of near-synonyms and ambiguous words without producing incorrect merging, and correctly identify and merge compliant abbreviations, then the system performance can be considered to meet the requirements. In this embodiment of the invention, the robustness of the system can also be quantitatively evaluated by comparing the decrease in various indicators (such as merge accuracy and F1 score) on the clean test set (dataset in normal data processing) and the adversarial test set (test set constructed for system evaluation). Generally, the smaller the decrease, the more robust the system.

[0102] Figure 5 The diagram shown is a schematic representation of the term processing procedure in an embodiment of the present invention. Figure 5As shown, the data to be processed uses Class A and Class K documents as examples. For each type of document, corresponding feature categories and known keywords can be pre-set, specifically represented by a pre-set terminology dictionary, etc. The data to be processed can be extracted using a term recognition engine based on NLP, that is, the process of steps S401-404 in this embodiment of the invention is implemented to obtain a term set. Then, term correction processing can be performed based on LLM to obtain a corrected term set. Then, the terms are merged (reorganized) by combining vector matching and LLM semantic matching. In this embodiment of the invention, term extraction is performed through multi-dimensional term evaluation and a multi-path backtracking mechanism for terms is set.

[0103] It should be noted that the data processing principle in the embodiments of the present invention is the same as... Figure 1 The data processing principles in the illustrated embodiments are the same and will not be repeated here.

[0104] The following is combined Figure 1 The illustrated embodiment is for Figure 1 This section will explain another method for determining the set of capability parameters in step S102. Specifically, it can be illustrated by relating the course type to the data to be processed, such as... Figure 6 As shown, the method includes the following steps.

[0105] S601: Obtain the knowledge graph corresponding to the type identifier, identify the data to be processed based on the knowledge graph, and determine the entity text and entity label corresponding to each preset type indicator.

[0106] The knowledge graph is pre-built. Entities in the knowledge graph can identify the entity text included in the data to be processed and add entity tags to the entity text. Entity tags indicate that the entity text is a knowledge point and can reflect the connection relationship between it and other entity texts.

[0107] It should be noted that in this embodiment of the invention, a corresponding knowledge graph can be generated for each type. Entities in the knowledge graph can include terms and knowledge points, and connections can represent relationships such as inclusion and parallelism between entities. Entities corresponding to each type can be divided based on their associated preset type indicators, so each entity in the generated knowledge graph can correspond to a preset type indicator. Therefore, in this step, the entity text and entity tags corresponding to each preset type indicator can be determined in the data to be processed. Taking the course type as an example, a knowledge graph corresponding to the course type is constructed based on the knowledge points included in each course. The relationships between knowledge points are identified through connections between entities. For example, entities can include the Pythagorean theorem, World War II, etc. The knowledge graph can also include a knowledge base. The entity text of the data to be processed is identified and labeled through the knowledge points in the knowledge base, and the labels can be determined as tags of entity text. In this embodiment of the invention, the knowledge graph is used to identify the data to be processed, and it can also correct and verify ambiguous knowledge points in the data to be processed.

[0108] In this embodiment of the invention, the data to be processed is typically converted into semantic vectors of text before the processing in this step is performed. For non-text data in the data to be processed, it can also be parsed using corresponding parsing tools, and then identified based on a knowledge graph to obtain the corresponding entity text and entity tags. Non-text data may include mathematical formulas, chemical formulas, etc. The parsing tools can be preset; for example, the parsing formula for a mathematical formula may include LaTeX.

[0109] S602: Obtain the capability target level and knowledge dictionary corresponding to the preset type indicators, and determine the capability target of the entity text in combination with the entity tags.

[0110] The competency objective levels represent the required competencies for each knowledge point, and may include different levels such as understanding (memorization only) and mastery (ability to deduce and apply). The knowledge dictionary is pre-built and can include all knowledge points corresponding to the preset type of indicators. Combining the two yields the competency objectives corresponding to each knowledge point in the knowledge dictionary, thereby determining the competency objectives of the entity text. Entity tags can be used for semantic understanding of the entity text to accurately determine the competency objectives of each entity text.

[0111] Specifically, this step can be implemented using LLM. For course-type data, Curriculum-Aligned LLM can be used. Through LLM processing in this step, a semantic vector including competency objectives can be generated; that is, competency objectives can be added as tags to the corresponding entity text. The semantic vector output in this step can be a high-dimensional, numerical semantic vector that not only encapsulates the literal meaning of the text but also embeds its deep semantics within the curriculum system (such as entity tags) and objectives (competency objectives).

[0112] It should be noted that in this embodiment of the invention, a knowledge dictionary is pre-generated based on Curriculum-Aligned LLM, and a contrastive learning method is used to generate vector representations of course knowledge points. The knowledge dictionary can be dynamically updated, and can include different representations of the same knowledge point. During Curriculum-Aligned LLM training, texts describing the same knowledge point but with different expressions can be used as positive samples, and texts corresponding to different knowledge points can be used as negative samples. Training is performed using these samples, ensuring that the vectors generated by the model satisfy the following in space: semantically similar knowledge point vectors are close together, and semantically different vectors are far apart. When the knowledge points and / or competency requirements of the course are updated (e.g., adding the "AI ethics" knowledge point), incremental training of the model may be triggered. For example, data related to the new knowledge point can be collected, and the Curriculum-Aligned LLM can be quickly fine-tuned using PyTorch model services, thereby dynamically updating the knowledge dictionary. In this step, the target requirements of textbooks included in each preset type of indicator can be identified based on the knowledge dictionary, thereby determining the competency targets corresponding to each textbook, and ultimately deriving the competency targets of each knowledge point in the knowledge dictionary. The objectives and requirements of a textbook may include texts such as the textbook's syllabus that describe the knowledge points and required skills.

[0113] In this embodiment of the invention, the knowledge graph can provide an interpretable and structured "skeleton" for parsing the data to be processed, ensuring the basic accuracy of the knowledge points, and then endowing it with "flesh and soul" through capability objectives, that is, strengthening the deep understanding of the capability objectives of the data to be processed. Thus, the combination of the two can jointly constitute a semantic vector graph with capability objectives.

[0114] S603: Generate structured data of entity text based on capability objectives and entity labels to obtain a set of capability parameters for each preset type of indicator.

[0115] After obtaining the labels of the entity text, they can be identified as capability parameters. Based on the entity text, capability objectives, and entity labels, corresponding structured data is generated, thus obtaining the capability parameter set for each preset type of indicator.

[0116] It should be noted that, to enhance semantic understanding of entity text, the source and corresponding text segment of each entity text can be obtained to generate corresponding structured data by combining entity text, capability objectives, and entity tags. Since the semantic vectors generated from data of different modalities will differ, the knowledge text corresponding to the same knowledge entity may also differ. Therefore, in this embodiment of the invention, an attention mechanism can be used to unify the vectors corresponding to the same knowledge entity.

[0117] In another implementation, for data in the field of education, such as data corresponding to course types, an education priority weight matrix can be pre-set in this embodiment of the invention, which includes the weight coefficients of each ability objective. Then, the weight coefficients of the ability objective tags corresponding to each entity text can be determined based on the education priority weight matrix.

[0118] In this embodiment of the invention, for each entity text, a corresponding interpretability tuple can be stored. This interpretability tuple includes the structured data corresponding to the entity text, the corresponding original text segment, and a processing log (recording the processing process of the entity text), as well as the similarity degree between the entity text and the corresponding ability level in a preset teaching standard (such as curriculum standards or ability level descriptions) (specifically calculated using a preset LLM calculation method). For entity texts with low similarity (e.g., less than 0.7), they can be displayed through a web interface for manual verification by relevant personnel, who can then accept or correct them. This manual verification can serve as positive samples for periodic incremental fine-tuning of the Curriculum-Aligned LLM and the fusion model, thereby achieving continuous model optimization.

[0119] In this embodiment of the invention, structured data can be stored, and a corresponding data index can be established during data storage. During data storage, data can be categorized into hot data and cold data based on the frequency of querying. Taking structured data corresponding to course types as an example, for hot data (such as core knowledge points of each chapter, text on popular professional skills), due to its high access frequency, GPU-accelerated Faiss indexes can be used to leverage the parallel computing power of GPUs to respond to tens of thousands of concurrent retrieval requests within milliseconds, ensuring the real-time requirements of data queries. For cold data (such as course materials from past semesters, archived student assignments), due to its low access frequency, course tree-structured encoding compression storage can be adopted. For example, based on the structure of the course knowledge system (such as subject-grade-chapter-knowledge point), a unique tree-structured code can be generated for each knowledge point, and then Huffman coding can be used to further compress these structured encoding sequences, thereby significantly reducing storage space usage while maintaining a good data organization structure, facilitating batch retrieval and historical analysis by course system. This hybrid indexing approach balances performance and cost in data storage.

[0120] In this invention, to provide crucial data support for teaching analysis, the semantics of the structured text can be used to determine corresponding teaching event nodes. These nodes could include those before, after, or after a chapter's learning, such as after a monthly exam or comprehensive practical training. This represents the learning time nodes for each structured data point, thus determining the corresponding teaching stage. When storing the structured data, vector snapshots can be triggered based on these teaching event nodes. These snapshots store feature vectors corresponding to the same preset type of indicator, which may include timestamps, teaching stages (determined based on teaching event nodes), etc. This generates a vector sequence sorted by time for each student or course.

[0121] This vector sequence can reflect the dynamic evolution of students' learning abilities over time, capturing and recording the trajectory of student or class ability evolution at different learning stages. It can also quantify students' knowledge acquisition progress and skill improvement slope, identifying different pre-set learning patterns such as steady progress, plateauing, or declining performance. Furthermore, by combining historical vector sequences, time series prediction models (such as LSTM) can be used to predict students' future ability development, thereby recommending the most likely next learning content for success and predicting students' learning paths. By comparing changes in students before and after using vector sequences to intervene in their learning, the actual effectiveness of the intervention can be scientifically evaluated.

[0122] It should be noted that in the field of course administration, much data involves student-related data (such as answer records and video viewing time). For highly sensitive personal privacy data, this invention embodiment can perform data cleaning or privacy protection. Specifically, before data storage, precisely calculated random noise can be added to the student-related data to facilitate privacy protection. The addition of noise can follow the mathematical definition of differential privacy, ensuring that query results are not significantly altered by the presence or absence of any single student's data. This makes the analytical conclusions obtained from the database (such as the class average mastery level being 70%) valid, but it is impossible to deduce any student's original data from any query results. In this invention embodiment, the noise intensity is controlled by preset parameters (such as privacy budget), allowing for a strategic trade-off between data availability and privacy protection strength, enabling the data to receive the highest level of security protection.

[0123] It should be noted that the data processing principle in the embodiments of the present invention is the same as... Figure 1 The data processing principles in the illustrated embodiments are the same and will not be repeated here.

[0124] Figure 7 The diagram shown is an architectural schematic of a data processing system according to an embodiment of the present invention. Figure 7In this embodiment of the invention, a data collection engine can extract heterogeneous data to be processed from multiple data sources, which can be categorized into internal data (such as educational data), external data (such as enterprise-related data), and standard data (such as specification documents). The heterogeneous data to be processed can be initially parsed and stored in a data pool. Data cleaning can be performed, such as converting it into semantic vectors corresponding to the text, performing preliminary semantic understanding, adding semantic tags, and deleting meaningless and sensitive data. Furthermore, the type identifier and preset type indicators of each piece of data to be processed can be identified, and the data can be stored according to a preset business strategy to obtain a formatted dataset. The formatted dataset can then be processed using a semantic understanding engine. Figure 4 and / or Figure 6 In the embodiments, the data processing method processes the data to obtain a set of capability parameters associated with each type of identifier and each preset type of indicator; the capability parameter set can be combined with historical capability parameters to generate capability association relationships, such as... Figure 7 As shown, the corresponding capability parameter maps for each preset type of indicator are generated. These capability parameter maps can be matched using a capability cross-matching engine to establish relationships between them. This also reflects the relationships between the preset type of indicators, such as... Figure 7 The system demonstrates the relationships between professions, positions, and industries. Based on this, the data processing system can provide data to various applications through the data application interface layer to facilitate target keyword queries. In this embodiment of the invention, the data processing system can establish management between various types and preset type indicators, support target keyword queries, derive capability parameters related to target keywords, analyze industry needs, and determine educational directions and content.

[0125] Figure 8 The diagram shown illustrates one possible data processing hierarchy during data access in an embodiment of the present invention. Figure 8 As shown, it can include a data awareness layer, a semantic parsing layer, a dynamic merging layer, and a vectorized storage layer. The data awareness layer can handle heterogeneous data from various data sources, filtering out data not identified by the data awareness layer. The filtered data then enters the semantic parsing layer, where it performs preliminary semantic parsing using an LLM-driven context-aware engine and undergoes necessary processing to obtain preliminary source data. This preliminary source data can identify its type identifier and preset type indicators, enabling fine-grained classification and merging of the data to obtain merged data. This merged data can then be vectorized and stored according to business strategies for subsequent processing.

[0126] It should be noted that the LLM used in multiple processing steps in the embodiments of the present invention can be pre-deployed and optimized and fine-tuned according to requirements in order to achieve the corresponding functions.

[0127] To address the problems existing in the prior art, embodiments of the present invention provide a data processing apparatus 900, such as... Figure 9 As shown, the device 900 includes: an acquisition unit, used to receive data processing instructions, acquire demand data corresponding to the target industry, locate corresponding data sources, and extract heterogeneous data to be processed from each of the data sources. The data to be processed includes standard documents, educational data, and enterprise-related data corresponding to the target industry. The processing unit is used to determine the type identifier of the data to be processed, process the data to be processed based on a preset semantic processing model, and obtain a set of capability parameters that associate the type identifier with each preset type indicator. A generation unit is used to obtain historical capability parameters of the type identifier and generate capability association relationships corresponding to the type identifier based on the capability parameter set and the historical capability parameters. The matching unit is used to respond to a data query command, obtain the corresponding target words and target type identifiers, match the capability association relationships between the target words and the target type identifiers, and obtain and display the matching capability parameter table.

[0128] It should be understood that the manner in which embodiments of the present invention are implemented is different from the implementation method. Figure 1 The embodiments shown are the same and will not be described again here.

[0129] It should be understood that the manner in which embodiments of the present invention are implemented is different from the implementation method. Figure 1 , 4 The method is the same as that shown in embodiments 6 and will not be repeated here.

[0130] According to embodiments of the present invention, an electronic device and a readable storage medium are also provided.

[0131] An electronic device according to an embodiment of the present invention includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the at least one processor to perform the data processing method provided in the embodiment of the present invention.

[0132] Figure 10 An exemplary system architecture 1000 is shown, in which a data processing method or data processing apparatus to which embodiments of the present invention can be applied is illustrated.

[0133] like Figure 10As shown, system architecture 1000 may include terminal devices 1001, 1002, and 1003, network 1004, and server 1005. Network 1004 is used as a medium to provide communication links between terminal devices 1001, 1002, and 1003 and server 1005. Network 1004 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0134] Users can use terminal devices 1001, 1002, and 1003 to interact with server 1005 via network 1004 to receive or send messages, etc. Various client applications can be installed on terminal devices 1001, 1002, and 1003.

[0135] Terminal devices 1001, 1002, and 1003 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0136] Server 1005 can be a server that provides various services. The server can analyze and process data such as received product information query requests, and feed back the processing results (such as product information - just an example) to the terminal device.

[0137] It should be noted that the data processing method provided in the embodiments of the present invention is generally executed by the server 1005, and correspondingly, the data processing device is generally located in the server 1005.

[0138] It should be understood that Figure 10 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0139] The following is for reference. Figure 11 It shows a schematic diagram of the structure of a computer system 1100 suitable for implementing embodiments of the present invention. Figure 11 The computer system shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments of the present invention.

[0140] like Figure 11As shown, the computer system 1100 includes a central processing unit (CPU) 1101, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 1102 or programs loaded from storage section 1108 into random access memory (RAM) 1103. The RAM 1103 also stores various programs and data required for the operation of the system 1100. The CPU 1101, ROM 1102, and RAM 1103 are interconnected via a bus 1104. An input / output (I / O) interface 1105 is also connected to the bus 1104.

[0141] The following components are connected to I / O interface 1105: an input section 1106 including a keyboard, mouse, etc.; an output section 1107 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1108 including a hard disk, etc.; and a communication section 1109 including a network interface card such as a LAN card, modem, etc. The communication section 1109 performs communication processing via a network such as the Internet. A drive 1110 is also connected to I / O interface 1105 as needed. Removable media 1111, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 1110 as needed so that computer programs read from them can be installed into storage section 1108 as needed.

[0142] In particular, according to the embodiments disclosed in this invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1109, and / or installed from removable medium 1111. When the computer program is executed by central processing unit (CPU) 1101, it performs the functions defined above in the system of this invention.

[0143] It should be noted that the computer-readable medium shown in this invention can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0144] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a unit, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0145] The units described in the embodiments of the present invention can be implemented in software or hardware. The described units can also be located in a processor; for example, a processor can be described as including an acquisition unit, a processing unit, a generation unit, and a matching unit. The names of these units do not necessarily limit the specific unit; for example, an acquisition unit can also be described as a "unit with acquisition function".

[0146] In another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The computer-readable medium carries one or more programs that, when executed by the device, cause the device to perform the data processing method provided by the present invention.

[0147] In another aspect, the present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the data processing method provided in the embodiments of the present invention.

[0148] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A method of data processing, characterized by, include: Receive data processing instructions, obtain the demand data corresponding to the target industry, locate the corresponding data sources, and extract heterogeneous data to be processed from each of the data sources. The data to be processed includes the standard documents, educational data and enterprise-related data corresponding to the target industry. The type identifier of the data to be processed is determined, and the data to be processed is processed based on a preset semantic processing model to obtain a set of capability parameters that associate the type identifier with each preset type indicator; Obtain the historical capability parameters of the type identifier, and generate the capability association relationship corresponding to the type identifier based on the capability parameter set and the historical capability parameters; In response to a data query command, the system retrieves the corresponding target words and target type identifiers, matches the capability association relationships between the target words and the target type identifiers, and generates and displays a matching capability parameter table.

2. The method of claim 1, wherein, The target type identifier includes a first type identifier and a second type identifier; The target words are matched with the capability associations corresponding to the target type identifiers to obtain and display a matching capability parameter table, including: The target words are matched with the capability association relationships corresponding to the first type identifier and the second type identifier, respectively, to obtain a first matching word list and a second matching word list; Calculate the matching degree between each matching word in the first matching word list and each matching word in the second matching word list, and determine the text pair between the first matching word list and the second matching word list based on the matching degree; Display the first matching word list, the second matching word list, and the text pairs.

3. The method of claim 1, wherein, Based on a preset semantic processing model, the data to be processed is processed to obtain a set of capability parameters for associating the type identifier with each preset type indicator, including: Obtain the knowledge graph corresponding to the type identifier, identify the data to be processed based on the knowledge graph, and determine the entity text and entity label corresponding to each preset type indicator; Obtain the capability target level and knowledge dictionary corresponding to the preset type indicator, and determine the capability target of the entity text in combination with the entity tag; Based on the capability objectives and the entity labels, structured data of the entity text is generated, and a set of capability parameters for each preset type indicator is obtained.

4. The method of claim 3, wherein, The process of generating structured data of entity text based on the capability objectives and entity tags, and deriving a set of capability parameters for each preset type of indicator, includes: Obtain a preset capability weight matrix, and determine the capability weight coefficients of the entity text based on the capability objective; Identify entity texts with the same entity label and generate a fusion vector based on the capability weight coefficients; Based on the fusion vector, the entity label, and the capability target, structured data of the entity text is generated, and a set of capability parameters for each preset indicator is obtained.

5. The method of claim 1, wherein, Based on a preset semantic processing model, the data to be processed is processed to obtain a set of capability parameters for associating the type identifier with each preset type indicator, including: Extract the first set of terms from the data to be processed; Obtain the position information of each term in the first term set, calculate the statistical value of each term, and filter the second term set from the first term set based on the position information, the statistical value and the preset dictionary corresponding to the type identifier; Obtain the source and the text segment corresponding to each term in the second term set, determine the corresponding preset type indicators based on the source and the text segment, generate structured data for each term in the second term set, and derive the capability parameter set for each preset type indicator.

6. The method of claim 5, wherein, Obtain the position information of each term in the first term set, calculate the statistical value of each term, and filter the second term set from the first term set based on the position information, the statistical value, and the preset dictionary corresponding to the type identifier, including: For each term in the first term set, calculate the term frequency and determine the first weight of the term; obtain the position information of the term and determine the second weight of the term based on the text structure corresponding to the position information; match the term with the preset dictionary corresponding to the type identifier and determine the third weight of the term based on the matching result; The second set of terms is selected from the first set of terms based on the first weight, the second weight, and the third weight.

7. The method of claim 6, wherein, The second set of terms is selected from the first set of terms based on the first weight, the second weight, and the third weight, including: The initial weight of each term in the first term set is determined based on the first weight, the second weight, and the third weight; The terms in the first term set whose initial weight is greater than a preset weight threshold are identified as high-weight terms. Determine the co-occurrence frequency of each term in the first term set with the high-weight term, and determine the fourth weight of each term in the first term set based on the co-occurrence frequency; The second set of terms is selected from the first set of terms based on the initial weight and the fourth weight.

8. The method according to claim 1, characterized in that, Extracting heterogeneous data to be processed from each of the aforementioned data sources, including: Heterogeneous data to be processed is extracted from each of the aforementioned data sources based on a preset time period; Obtain a parser corresponding to the transmission protocol type associated with each of the data sources, parse the data to be processed, obtain the corresponding text data, and generate a semantic vector corresponding to the text data; Determine the type identifier corresponding to the semantic vector, obtain the preset business strategy, determine the storage location of the semantic vector based on the type identifier and the business strategy, and store the semantic vector.

9. A data processing apparatus, characterized in that, include: The acquisition unit is used to receive data processing instructions, acquire the demand data corresponding to the target industry, locate the corresponding data sources, and extract heterogeneous data to be processed from each of the data sources. The data to be processed includes the standard documents, educational data and enterprise-related data corresponding to the target industry. The processing unit is used to determine the type identifier of the data to be processed, process the data to be processed based on a preset semantic processing model, and obtain a set of capability parameters that associate the type identifier with each preset type indicator. A generation unit is used to obtain historical capability parameters of the type identifier and generate capability association relationships corresponding to the type identifier based on the capability parameter set and the historical capability parameters. The matching unit is used to respond to a data query command, obtain the corresponding target words and target type identifiers, match the capability association relationships between the target words and the target type identifiers, and obtain and display the matching capability parameter table.

10. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-8.