A method, system and electronic device for extracting qualifiers
By using deep learning models and data item processing operations, qualifying words are automatically extracted, solving the problems of low efficiency and low accuracy of manual extraction. This enables fast and accurate naming of data items and improves the integration efficiency of the data management system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG DAHUA TECH CO LTD
- Filing Date
- 2022-11-28
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, manually extracting qualifiers is inefficient and inaccurate, making it difficult to standardize data item naming and affecting the data integration and usage efficiency of data management systems.
A deep learning model is used to transform and predict the attributes of data items, filter preset content, automatically extract target qualifiers, and improve the standardization of data item naming by combining data item processing operations and Chinese definition mapping.
It improves the efficiency and accuracy of extracting qualifying terms, reduces labor costs, minimizes errors caused by human experience, and enables fast and accurate naming of data items.
Smart Images

Figure CN115859966B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of data analysis and data governance technology, and in particular to a method, system and electronic device for extracting qualifying words. Background Technology
[0002] With the popularization and development of internet technology, data types have become increasingly diversified. The diversity of data formats and content increases the difficulty of data integration after raw data is connected to big data systems, hindering truly convenient data use. Therefore, a complete data management system is urgently needed to manage data.
[0003] In a data management system, the various subsystems are relatively independent, and the standards for entering data item names are inconsistent, resulting in disorganized data across subsystems. This makes it difficult for users to quickly and accurately retrieve the data they need based on the data item names. Therefore, in a data management system, it is extremely important to adopt standardized data item naming for various types of data.
[0004] Currently, the traditional method of standardizing data item naming involves manually extracting qualifiers from the data information within each item. In this process, operators rely on their experience to determine the qualifiers and data elements corresponding to the data information within each item, thereby achieving standardized naming.
[0005] For example, in Figure 1 In the original business information table to be processed shown, the operator first determines that the corresponding data element in the four data items is "name" based on the database of data elements. Then, based on the specific data information of the data items and their own experience, the operator extracts the qualifiers in the four data items as father, mother, spouse and children, and then sets the names of the four data items as father_name, mother_name, spouse_name and children_name respectively.
[0006] However, when extracting qualifiers manually, especially when there are many data items, it takes a lot of time to match the qualifiers with the data information, resulting in low extraction efficiency. Furthermore, extracting qualifiers based on human experience lacks objectivity; for example, operator A might extract qualifiers based on their own experience. Figure 1 The qualifier in the third column was set to spouse, but operator B, based on their own experience, set the qualifier in the third column to brother, making the extraction of qualifiers highly subjective and resulting in a low accuracy rate of qualifier extraction. Summary of the Invention
[0007] This application provides a solution to address the problems of low efficiency and low accuracy in extracting limiting words. The specific implementation scheme is as follows:
[0008] Firstly, this application provides a method for extracting limiting words, the method comprising:
[0009] Retrieve N data items from the current business information table, where N is a positive integer;
[0010] The attributes of the N data items are transformed according to the data item processing operations to obtain N input texts;
[0011] Input the N input texts into the target model to obtain the first limiting words corresponding to each of the N data items;
[0012] By filtering the preset content in the first qualifiers corresponding to each of the N data items, the target qualifiers corresponding to each of the N data items are determined, and the target qualifiers corresponding to each of the N data items are extracted.
[0013] By performing attribute transformation on data items in the business information table through data item processing operations, the Chinese fields and data elements obtained after transformation are used as input text and input into the target model that meets the preset conditions. The preset content is filtered out based on the first limiting word in the output to obtain the target limiting word. This avoids matching limiting words with data information based on human experience, thereby reducing labor costs and errors caused by human experience, and thus improving the extraction efficiency and accuracy of limiting words.
[0014] In one possible implementation, before obtaining the N data items from the current business information table, the method further includes:
[0015] Retrieve M data items from the historical business information table, where M is a positive integer;
[0016] The M data items are transformed according to the data item processing operations to obtain M training texts;
[0017] The first model is trained using the M training texts to obtain the second model, wherein the first model is a deep learning model;
[0018] If the second model meets the first preset condition, the second model will be used as the target model.
[0019] By training the model, the parameters of the model were determined, making the model directly usable for extracting limiting words. Based on the first preset condition, the target model was determined, making it the model with the highest accuracy or the lowest error in extracting limiting words in the deep learning model set, thus further improving the accuracy of limiting word extraction.
[0020] In one possible implementation, the step of performing attribute transformation on the N data items according to the data item processing operation to obtain N input texts includes:
[0021] For each of the N data items, the following data item processing operations are performed:
[0022] Retrieve Chinese fields and data elements from data items;
[0023] Filter the first specified data in the Chinese field to obtain the first Chinese field;
[0024] Based on the data attributes in the target model, the data elements and the first Chinese field are converted into input text in word vector format;
[0025] After performing the data item processing operation on each of the N data items, N input texts are obtained.
[0026] The data item processing operation filters out redundant descriptive information and irrelevant identifiers in the data items, and transforms the data items into input text in a word vector format that conforms to the input attributes of the target model. This avoids the influence of interfering data on model prediction and further improves the accuracy of limiting word extraction.
[0027] In one possible implementation, filtering the first specified data in the Chinese field includes:
[0028] Determine whether the Chinese field contains English text;
[0029] If so, the second specified data in the English text is converted into Chinese through Chinese definition mapping to obtain the second Chinese field, and the first specified data in the second Chinese field is filtered out;
[0030] If not, filter the first specified data in the Chinese field.
[0031] By determining whether a Chinese field contains English, we can decide whether Chinese definition mapping is necessary. When English is included, we first map all English text except for business information to Chinese before filtering the first specified data. When no English is included, we can directly filter the first specified data, thus avoiding the impact of unnecessary English data on the model's prediction and further improving the accuracy of the target model in extracting limiting words.
[0032] In one possible implementation, converting the data elements and the first Chinese field into input text in word vector format includes:
[0033] Determine whether the data element and the first Chinese field meet the second preset condition;
[0034] If so, mark the data item that corresponds to both the data element and the first Chinese field as a specified data item, and convert the data element and the first Chinese field into input text in word vector format;
[0035] If not, convert the data elements and the first Chinese field into input text in word vector format.
[0036] By determining whether the data element and the first Chinese field meet the second preset condition of the first Chinese field being empty or the data element being "name", it is determined whether the data element and the corresponding data item should be marked. This determines whether the data item needs to have qualifying words extracted, thereby filtering out abnormal cases of data items. Only then are the data element and the first Chinese field converted into input text in word vector format. This avoids the problem of standardizing the naming of data items without qualifying words or without qualifying word statements, and further improves the accuracy of qualifying word extraction.
[0037] In one possible implementation, after extracting the target qualifying words corresponding to each of the N data items, the method further includes:
[0038] For each of the N data items, perform the following naming operation:
[0039] Determine if a data item is a specified data item;
[0040] If so, filter the data items and determine that the predicted name value corresponding to the data item is empty;
[0041] If not, determine the name prediction value corresponding to the data item based on the data element corresponding to the data item and the target qualifying word corresponding to the data item;
[0042] After performing the naming operation on each of the N data items, the predicted name value for each of the N data items is obtained.
[0043] By determining whether a data item is a specified data item, data items with abnormal conditions are excluded. Based on data elements and target qualifiers, standardized naming of data items is achieved, thereby effectively solving the problem that users have difficulty quickly and accurately obtaining the data they need based on the data item name.
[0044] Secondly, this application also provides a system for extracting qualifying words, the system comprising:
[0045] The acquisition module is used to acquire N data items from the current business information table, where N is an integer greater than zero;
[0046] The data item processing module is used to perform attribute transformation on the N data items according to the data item processing operation to obtain N input texts;
[0047] The prediction module is used to input the N input texts into the target model to obtain the first limiting words corresponding to each of the N data items;
[0048] The processing module is used to determine the target qualifiers corresponding to each of the N data items by filtering the preset content in the first qualifiers corresponding to each of the N data items, and to extract the target qualifiers corresponding to each of the N data items.
[0049] In one possible implementation, the acquisition module is specifically used to acquire M data items from the historical business information table, where M is an integer greater than zero;
[0050] The M data items are transformed according to the data item processing operations to obtain M training texts;
[0051] The first model is trained using the M training texts to obtain the second model, wherein the first model is a deep learning model;
[0052] If the second model meets the first preset condition, the second model will be used as the target model.
[0053] In one possible implementation, the data item processing module is specifically configured to perform the following data item processing operation for each of the N data items:
[0054] Retrieve Chinese fields and data elements from data items;
[0055] Filter the first specified data in the Chinese field to obtain the first Chinese field;
[0056] Based on the data attributes in the target model, the data elements and the first Chinese field are converted into input text in word vector format;
[0057] After performing the data item processing operation on each of the N data items, N input texts are obtained.
[0058] In one possible implementation, the data item processing module is specifically used to determine whether the Chinese field contains English text;
[0059] If so, the second specified data in the English text is converted into Chinese through Chinese definition mapping to obtain the second Chinese field, and the first specified data in the second Chinese field is filtered out;
[0060] If not, filter the first specified data in the Chinese field.
[0061] In one possible implementation, the data item processing module is specifically used to determine whether the data element and the first Chinese field meet the second preset condition;
[0062] If so, mark the data item that corresponds to both the data element and the first Chinese field as a specified data item, and convert the data element and the first Chinese field into input text in word vector format;
[0063] If not, convert the data elements and the first Chinese field into input text in word vector format.
[0064] In one possible implementation, the processing module is specifically configured to perform the following naming operation for each of the N data items:
[0065] Determine whether the data item is the specified data item;
[0066] If so, filter the data items and determine that the predicted name value corresponding to the data item is empty;
[0067] If not, determine the name prediction value corresponding to the data item based on the data element corresponding to the data item and the target qualifying word corresponding to the data item;
[0068] After performing the naming operation on each of the N data items, the predicted name value for each of the N data items is obtained.
[0069] Thirdly, this application provides an electronic device, comprising:
[0070] Memory, used to store computer programs;
[0071] When a processor executes a computer program stored in the memory, it implements the steps of the above-described method for extracting qualifying words.
[0072] Fourthly, this application provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the above-described method steps for extracting qualifying words.
[0073] For the various aspects of the second to fourth aspects mentioned above, and the technical effects that each aspect may achieve, please refer to the above description of the technical effects that can be achieved for the first aspect or the various possible solutions in the first aspect, which will not be repeated here. Attached Figure Description
[0074] Figure 1 A schematic diagram of the original business information table to be processed provided for this application;
[0075] Figure 2 A flowchart of a method for extracting qualifying words provided in this application;
[0076] Figure 3 A schematic diagram illustrating the processing steps of a method for extracting qualifying words provided in this application;
[0077] Figure 4 A schematic diagram of a qualifier extraction system provided in this application;
[0078] Figure 5 A schematic diagram of an electronic device provided in this application. Detailed Implementation
[0079] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The specific operational methods in the method embodiments can also be applied to the device embodiments or system embodiments. It should be noted that in the description of this application, "multiple" is understood as "at least two". "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. A connected to B can represent: A and B directly connected, and A and B connected through C. Furthermore, in the description of this application, terms such as "first" and "second" are used only for distinguishing the purpose of description and should not be construed as indicating or implying relative importance or order.
[0080] The embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0081] Currently, when extracting qualifiers manually, if the number of data items is large, it takes a lot of time to match the qualifiers with the data information, resulting in low extraction efficiency. Furthermore, extracting qualifiers based on human experience lacks objectivity, making the extraction highly subjective, which in turn leads to low extraction accuracy.
[0082] Therefore, this application proposes a method for extracting limiting words. The method involves performing attribute transformation on data items in the business information table according to data item processing operations, inputting the transformed Chinese fields and data elements into a target model that meets preset conditions, filtering the preset content in the output first limiting word, and obtaining the target limiting word. This avoids matching limiting words with data information based on human experience, thereby reducing labor costs and errors caused by human experience, and thus improving the extraction efficiency and accuracy of limiting words.
[0083] Reference Figure 2 The diagram shown is a flowchart of a method for extracting qualifying words according to an embodiment of this application. The method includes:
[0084] S1, retrieve N data items from the current business information table;
[0085] In the method for extracting qualifying terms, before obtaining N data items from the current business information table, it is first necessary to determine the target model. This target model is used to extract the qualifying terms corresponding to the data items in the business information table. The specific method for obtaining this target model is as follows:
[0086] Specifically, first, retrieve M data items from the historical business information table, where M is an integer greater than zero;
[0087] Next, the attributes of the above M data items are transformed according to the data item processing operations to obtain M training texts;
[0088] Then, based on these M training texts, the first model is trained to obtain the second model;
[0089] Further, determine whether the second model meets the first preset condition.
[0090] In this embodiment, the first preset condition may be that the accuracy of the second model in extracting limiting words is the highest among deep learning ensemble models, or that the error of the second model in extracting limiting words is the lowest among deep learning ensemble models. In this embodiment, the first preset condition can be adjusted according to the actual application scenario.
[0091] If the second model does not meet the first preset condition, the first model will be changed to another deep learning model in the deep learning model set, and the modified first model will continue to be trained based on M training texts.
[0092] If the second model meets the first preset condition, then the second model is taken as the target model.
[0093] For example, first, 1000 data items are obtained from the A business information table; second, the 1000 data items are processed according to the data item processing operation to transform their attributes, resulting in training text that conforms to the data attributes of model A; then, model A is trained based on this training text, resulting in the trained A-1 model; calculations show that the qualifying word extraction accuracy of model A-1 is 94%, while the qualifying word extraction accuracy of the other models in the deep learning model set is 80% for model B-1, 84% for model C-1, and 90% for model D-1. Therefore, model A-1 has the highest qualifying word extraction accuracy among the deep learning models, and thus, model A-1 is chosen as the target model.
[0094] By using the above method, the target model is determined based on the preset condition of the highest accuracy or the lowest error in extracting limiting words. This makes the target model the one with the highest accuracy or the lowest error in extracting limiting words in the deep learning model set, thereby further improving the accuracy of limiting word extraction.
[0095] After determining the target model, obtain N data items from the current business information table, where N is an integer greater than zero.
[0096] It should be noted that in this embodiment of the application, the current business information table is used for model prediction, while the historical business information table is used for model training. In order to avoid unreliable model prediction results, the current business information table and the historical business information table cannot be the same business information table.
[0097] S2, perform attribute transformation on N data items according to the data item processing operation to obtain N input texts;
[0098] Since raw data often contains interfering data that affects the model's prediction performance, raw data generally cannot be directly used for model prediction. Therefore, it is necessary to perform data item processing operations on the acquired N data items. In other words, before extracting the limiting words, it is necessary to perform attribute transformation on the N data items according to the data item processing operations to transform the data items into a form that the target model can recognize.
[0099] For each of the N data items mentioned above, a data item processing operation needs to be performed. The specific operations are as follows:
[0100] First, obtain the Chinese fields and data elements in the data item; then filter the first specified data in the Chinese field to obtain the first Chinese field.
[0101] It should be noted that, in the embodiments of this application, the first specified data includes identifiers unrelated to Chinese characters, such as " / ", "\", ",", ".", "_", and spaces; it also includes redundant descriptive information, such as "cannot be empty, ID number", where the redundant descriptive information is "cannot be empty".
[0102] For example, the Chinese field is "cannot be empty, ID number", where the first specified data is the redundant information "cannot be empty" and the identifier ",". After filtering out "cannot be empty" and ",", the first Chinese field is "ID number".
[0103] After obtaining the first Chinese field, the data elements and the first Chinese field are converted into input text in word vector format based on the data attributes of the target model.
[0104] Finally, after performing the data item processing operation on each of the N data items, N input texts can be obtained.
[0105] Through the above data item processing operation, redundant description information and irrelevant identifiers in the data item are filtered, and the data item is converted into an input text in the form of word vectors that conform to the input attributes of the target model, avoiding the influence of interfering data on model prediction, and further improving the extraction accuracy of qualifiers.
[0106] Furthermore, when filtering the first specified data in the Chinese field, it is first necessary to determine whether the Chinese field contains English.
[0107] If the Chinese field contains English, the second specified data in the English is converted into Chinese through Chinese paraphrase mapping to obtain a second Chinese field, and then the first specified data in the second Chinese field is filtered at this time.
[0108] It should be noted here that in the embodiments of the present application, the second specified data is English other than having business information. For example, if the Chinese field is "gender_ID number_ip", only the English "ip" in the English contains business information, so the second specified data is English other than "ip", that is, "gender".
[0109] If the Chinese field does not contain English, the first specified data in the Chinese field can be directly filtered.
[0110] For example, the Chinese field corresponding to the A data item is "gender_ID number_ip". At this time, the Chinese field contains English, and the "ip" in the English contains business information, so the second specified data is "gender". Then, based on Chinese paraphrase mapping, "gender" is mapped to "gender", and the second Chinese field "gender_ID number_ip" can be obtained. And the second Chinese field contains the identifier "_" irrelevant to Chinese in the first specified data. Therefore, filtering "_" can obtain the first Chinese field "gender ID number ip".
[0111] By determining whether the Chinese field contains English as described above, it is determined whether Chinese paraphrase mapping is required. When English is included, the English other than having business information is mapped to Chinese through Chinese paraphrase mapping before filtering the first specified data. When English is not included, the first specified data can be directly filtered, avoiding the influence of unnecessary English data on model prediction, and further improving the accuracy when the target model extracts qualifiers.
[0112] Furthermore, when converting the data element and the first Chinese field into an input text in the form of word vectors, it is first necessary to determine whether the data element and the first Chinese field meet the second preset condition.
[0113] It should be noted that in this embodiment, since a Chinese field being empty indicates that the Chinese field has no practical meaning, the second preset condition is that the first Chinese field is empty; when the data element is a "name" in the form of a tag, it is not necessary to extract the qualifying word, so the second preset condition can also be that the data element is "name". Therefore, the second preset condition is that the first Chinese field is empty or the data element is "name".
[0114] If the data element and the first Chinese field meet the second preset condition, then the data item corresponding to the data element and the first Chinese field is marked as the specified data item, and the data element and the first Chinese field are converted into input text in word vector format;
[0115] If the data element and the first Chinese field do not meet the second preset conditions, then the data element and the first Chinese field are directly converted into input text in word vector format.
[0116] By determining whether the data element and the first Chinese field are empty or the data element is "name", it is determined whether the data element and the corresponding data item should be marked. This determines whether the data item needs to have qualifying words extracted, thus filtering out abnormal cases of the data item. Only then is the data element and the first Chinese field converted into input text in word vector format. This avoids the problem of standardizing the naming of data items without qualifying words or without qualifying words, and further improves the accuracy of qualifying word extraction.
[0117] In one possible embodiment, when converting data elements and the first Chinese field into input text in word vector format, the data elements and the first Chinese field are first converted into first text that conforms to the input attributes of the target model. For example, if the input attribute of the target model requires the input data to be in three-dimensional format, the data elements and the first Chinese field are converted into first text in three-dimensional format.
[0118] Then, the first text is encoded using a language model (BERT) to obtain the input text in word vector format.
[0119] S3, input N input texts into the target model to obtain the first qualifying words corresponding to each of the N data items;
[0120] After processing N data items to obtain N input texts, these N input texts are input into the target model, which then automatically makes predictions and finally obtains the first qualifying words corresponding to each of the N data items.
[0121] It should be noted that in the embodiments of the present application, after the target model automatically makes a prediction, the labels corresponding to the Chinese character fields in each data item can be determined; then, according to the labels, the corresponding characters are processed accordingly, and the first qualifier corresponding to each data item can be obtained, and then the first qualifiers corresponding to each of the N data items can be obtained.
[0122] Among them, the type of the label is one type in the token dictionary of the target model, such as "KEEP", "DELETE", "ADD".
[0123] For example, as shown in the prediction result table in Table 1:
[0124] Table 1 Prediction Result Table
[0125]
[0126] After inputting the processed data element and the processed Chinese character field (i.e., 3 input texts) into the A-1 model (i.e., the target model) for prediction, first, the labels corresponding to the Chinese character fields of each data item can be obtained. In the first data item, the label corresponding to "父" in the Chinese character field is "KEEP", and "KEEP" means to keep, so this character is kept. The label corresponding to "身" is "DELETE", and "DELETE" means to delete, so this character is deleted. Therefore, according to the labels corresponding to each character in the Chinese character field, the first qualifier can be obtained as "父亲"; similarly, the first qualifier corresponding to the second data item is obtained as "地区名称", and the first qualifier corresponding to the third data item is obtained as "没收非法财物的".
[0127] S4. By filtering the preset content in the first qualifiers corresponding to each of the N data items, determine the target qualifiers corresponding to each of the N data items, and extract the target qualifiers corresponding to each of the N data items;
[0128] After obtaining the first qualifiers corresponding to each of the N data items, the first qualifiers need to be optimized before the target qualifiers corresponding to each of the N data items can be determined.
[0129] Specifically, after obtaining the first qualifiers corresponding to each of the N data items, first determine whether the first qualifiers corresponding to each of the N data items contain the preset content.
[0130] It should be noted that in the embodiments of the present application, the preset content includes words with repeated semantics, such as "法人" and "法定代表人"; includes modal particles, such as "的"; and also includes words without qualifier meaning, such as "代码" and "名称".
[0131] If the first qualifier corresponding to each of the N data items does not contain the preset content, then determine the target qualifier corresponding to each of the N data items, and then extract the target qualifier corresponding to each of the N data items.
[0132] If the first qualifier corresponding to each of the N data items contains preset content, then filter the preset content in the first qualifier corresponding to each of the N data items, then determine the target qualifier corresponding to each of the N data items, and finally extract the target qualifier corresponding to each of the N data items.
[0133] For example, as shown in Table 2, the optimized qualifying words results are as follows:
[0134] Table 2 Results of Optimized Limiting Terms
[0135]
[0136] The first qualifier for the first data item is "father". Since this qualifier does not contain any preset content, the target qualifier for the first data item is "father". The first qualifier for the second data item is "region name". The word "name" in this qualifier is a word that does not have a qualifier meaning in the preset content. Therefore, "name" is filtered out, and the target qualifier for the second data item is "region". The first qualifier for the third data item is "confiscate illegal property". Since this qualifier contains the modal particle "of" in the preset content, the target qualifier for the third data item is "confiscate illegal property".
[0137] Furthermore, after extracting the target qualifiers corresponding to each of the N data items, we can then standardize the naming of each data item.
[0138] Specifically, for each of the N data items, the following naming operation is performed:
[0139] First, determine whether the data item is the specified data item;
[0140] If the data item is a specified data item, it indicates that the data item has an abnormal situation and does not have a target qualifier. In this case, there is no need to standardize the naming of the data item, and the predicted value of the name corresponding to the data item is determined to be empty.
[0141] If the data item is not a specified data item, then the name prediction value corresponding to the data item is determined based on the data element corresponding to the data item and the target qualifier corresponding to the data item.
[0142] After performing the above naming operation on each of the N data items, the predicted name value for each of the N data items can be obtained.
[0143] For example, as shown in Table 3, the predicted name values are as follows:
[0144] Table 3 Predicted Values by Name
[0145]
[0146] The first, second, and third data items are not designated data items. Therefore, based on the data elements and target qualifiers, each data item is standardized in its naming. The data element corresponding to the first data item is "Citizen ID Card Number," and the target qualifier is "Father." Therefore, the first data item is named "Father_Citizen ID Card Number." Similarly, the second data item is named "Region_Administrative Division Code," and the third data item is named "Confiscated Illegal Property_Amount."
[0147] By determining whether a data item is a specified data item, data items with abnormal conditions are excluded. Based on data elements and target qualifiers, standardized naming of data items is achieved, thereby effectively solving the problem that users have difficulty quickly and accurately obtaining the data they need based on the data item name.
[0148] In summary, the method for extracting limiting words proposed in this application transforms the attributes of data items in the business information table based on data item processing operations. The transformed Chinese fields and data elements are then input into the target model. For the output first limiting word, the first limiting word is optimized by filtering preset content, thereby determining and extracting the final target limiting word. This avoids matching limiting words with data information based on human experience, thus reducing labor costs and errors caused by human experience. Consequently, it improves the extraction efficiency and accuracy of limiting words, while also achieving automated and intelligent extraction of limiting words.
[0149] Furthermore, after extracting the target qualifiers corresponding to the data items and excluding abnormal data items, standardized naming of data items was achieved based on the data elements and target qualifiers corresponding to the data items. This effectively solved the problem that users could not quickly and accurately obtain the data they needed based on the data item names, laying a technical foundation for data governance.
[0150] The technical solution of this application will be further explained below with reference to a specific application process.
[0151] like Figure 3 The diagram shows the processing steps of the word extraction method. First, the current business information table and the historical business information table are obtained. The current business information table is used for model prediction, and the historical business information table is used for model training.
[0152] Secondly, the current business information table and the historical business information table are input into the data item processing module. According to the data item processing operation, the data items in the current business information table and the data items in the historical business information table are transformed according to their attributes. The transformed Chinese fields and data elements in the current business information table are used as input text, and the transformed Chinese fields and data elements in the historical business information table are used as training text.
[0153] Then, the training text is input into the deep learning model to train the deep learning model. When the accuracy of the trained model is the highest in the set of deep learning models, the trained model is used as the target model and output to the model prediction module.
[0154] Furthermore, based on the input text, the target model makes predictions and outputs the first qualifying word;
[0155] Furthermore, by inputting the first limiting word into the optimization module and filtering the preset content in the first limiting word, the target limiting word can be extracted. The preset content includes words with repetitive semantics, words that do not contain the meaning of limiting words, and modal particles.
[0156] Finally, after extracting the target qualifiers, the target qualifiers are input into the data item standardization naming module. Based on the data elements and the target qualifiers, the data items are standardized and named to obtain the predicted names of the data items.
[0157] By using the above method, data items in the business information table are processed based on data item processing operations. Then, the processed data items are predicted based on the trained target model. The first qualifier obtained after prediction is optimized to obtain the target qualifier. This avoids matching qualifiers with data information based on human experience, thereby reducing labor costs and errors caused by human experience. This improves the efficiency and accuracy of qualifier extraction and realizes automated and intelligent extraction of qualifiers.
[0158] Furthermore, after extracting the target qualifiers corresponding to the data items, standardized naming of the data items was achieved based on the data elements and target qualifiers. This effectively solved the problem that users could not quickly and accurately obtain the data they needed based on the data item names, laying a technical foundation for data governance.
[0159] Based on the same inventive concept, this application also provides a system for extracting qualifying words, such as... Figure 4 The diagram shown is a structural schematic of a qualifying word extraction system provided in this application. The system includes:
[0160] The acquisition module 401 is used to acquire N data items from the current business information table, where N is an integer greater than zero;
[0161] The data item processing module 402 is used to perform attribute transformation on N data items according to the data item processing operation to obtain N input texts;
[0162] Prediction module 403 is used to input N input texts into the target model and obtain the first qualifying words corresponding to each of the N data items;
[0163] The processing module 404 is used to determine the target qualifiers corresponding to each of the N data items by filtering the preset content in the first qualifiers corresponding to each of the N data items, and to extract the target qualifiers corresponding to each of the N data items.
[0164] In one possible implementation, the acquisition module 401 is specifically used to acquire M data items from the historical business information table, where M is an integer greater than zero.
[0165] Based on the data item processing operations, attribute transformations are performed on M data items to obtain M training texts;
[0166] The first model is trained using M training texts to obtain the second model, where the first model is a deep learning model.
[0167] If the second model meets the first preset condition, the second model will be used as the target model.
[0168] In one possible implementation, the data item processing module 402 is specifically configured to perform the following data item processing operation for each of the N data items:
[0169] Retrieve Chinese fields and data elements from data items;
[0170] Filter the first specified data in the Chinese field to get the first Chinese field;
[0171] Based on the data attributes in the target model, the data elements and the first Chinese field are converted into input text in word vector format;
[0172] After performing data item processing operations on each of the N data items, N input texts are obtained.
[0173] In one possible implementation, the data item processing module 402 is specifically used to determine whether a Chinese field contains English text.
[0174] If the Chinese field contains English, the second specified data in the English is converted into Chinese through Chinese definition mapping to obtain the second Chinese field, and the first specified data in the second Chinese field is filtered out.
[0175] If the Chinese field does not contain English, filter the first specified data in the Chinese field.
[0176] In one possible implementation, the data item processing module 402 is specifically used to determine whether the data element and the first Chinese field meet the second preset condition.
[0177] If the data element and the first Chinese field meet the second preset condition, mark the data item that corresponds to both the data element and the first Chinese field as the specified data item, and convert the data element and the first Chinese field into input text in word vector format;
[0178] If the data elements and the first Chinese field do not meet the second preset conditions, the data elements and the first Chinese field will be converted into input text in word vector format.
[0179] In one possible implementation, processing module 404 is specifically configured to perform the following naming operation for each of the N data items:
[0180] Determine if a data item is a specified data item;
[0181] If the data item is a specified data item, filter the data items and determine that the predicted value of the name corresponding to the data item is empty;
[0182] If the data item is not a specified data item, the name prediction value corresponding to the data item is determined based on the data element corresponding to the data item and the target qualifier corresponding to the data item.
[0183] After performing a naming operation on each of the N data items, we obtain the predicted name value for each of the N data items.
[0184] Based on the same inventive concept, this application also provides an electronic device that can realize the function of the aforementioned qualifying word extraction system. (Refer to...) Figure 5 The aforementioned electronic devices include:
[0185] At least one processor 501 and a memory 502 connected to at least one processor 501. In this embodiment, the specific connection medium between the processor 501 and the memory 502 is not limited. Figure 5 The example shown is the connection between processor 501 and memory 502 via bus 500. Bus 500 is... Figure 5 The connections between other components are indicated by thick lines and are for illustrative purposes only, not as limiting information. The Bus 500 can be divided into address bus, data bus, control bus, etc., for ease of representation. Figure 5 The term 501 is represented by a single thick line, but this does not imply that there is only one bus or one type of bus. Alternatively, the processor 501 can also be called a controller; there is no restriction on the name.
[0186] In this embodiment, memory 502 stores instructions executable by at least one processor 501. By executing the instructions stored in memory 502, at least one processor 501 can perform the aforementioned method for extracting qualifying terms. Processor 501 can implement... Figure 5 The system shown illustrates the functions of each module.
[0187] The processor 501 is the control center of the system. It can connect to various parts of the control device through various interfaces and lines. By running or executing instructions stored in memory 502 and calling data stored in memory 502, the system can perform various functions and process data, thereby monitoring the system as a whole.
[0188] In one possible implementation, processor 501 may include one or more processing units. Processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip; in some embodiments, they may be implemented separately on independent chips.
[0189] Processor 501 can be a general-purpose processor, such as a central processing unit (CPU), digital signal processor, application-specific integrated circuit, field-programmable gate array or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method for extracting qualifying terms disclosed in the embodiments of this application can be directly manifested as execution by a hardware processor, or execution by a combination of hardware and software modules within the processor.
[0190] Memory 502, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. Memory 502 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM), static random access memory (SRAM), programmable read-only memory (PROM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic storage, magnetic disk, optical disk, etc. Memory 502 can be any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. In the embodiments of this application, memory 502 can also be a circuit or any other device capable of implementing storage functions for storing program instructions and / or data.
[0191] By designing and programming the processor 501, the code corresponding to the qualifier extraction method described in the foregoing embodiments can be embedded into the chip, thereby enabling the chip to execute the code during runtime. Figure 4 The steps of the method for extracting qualifying words in the illustrated embodiment are described below. How to design and program the processor 501 is a technique well-known to those skilled in the art and will not be elaborated upon here.
[0192] Based on the same inventive concept, embodiments of this application also provide a storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the aforementioned method for extracting qualifying words.
[0193] In some possible implementations, various aspects of the qualifying word extraction method provided in this application can also be implemented as a program product comprising program code that, when the program product is run on a device, causes the control device to perform the steps in the qualifying word extraction method according to the various exemplary embodiments of this application described above.
[0194] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0195] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0196] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0197] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0198] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for extracting limiting words, characterized in that, The method includes: Retrieve N data items from the current business information table, where N is a positive integer; The attributes of the N data items are transformed according to the data item processing operations to obtain N input texts; Input the N input texts into the target model to obtain the first limiting words corresponding to each of the N data items; By filtering the preset content in the first limiting words corresponding to each of the N data items, the target limiting words corresponding to each of the N data items are determined, and the target limiting words corresponding to each of the N data items are extracted; the preset content includes words with repetitive semantics, modal particles, and words that do not have limiting word meaning; For each of the N data items, perform the following naming operation: Determine whether a data item is a specified data item; wherein, the data element of the specified data item and the first Chinese field meet a second preset condition; the second preset condition is that the first Chinese field is empty or the data element is a name; If so, filter the data items and determine that the predicted name value corresponding to the data item is empty; If not, determine the name prediction value corresponding to the data item based on the data element corresponding to the data item and the target qualifying word corresponding to the data item; After performing the naming operation on each of the N data items, the predicted name value for each of the N data items is obtained.
2. The method as described in claim 1, characterized in that, Before retrieving the N data items from the current business information table, the following is also included: Retrieve M data items from the historical business information table, where M is a positive integer; The M data items are transformed according to the data item processing operations to obtain M training texts; The first model is trained using the M training texts to obtain the second model, wherein the first model is a deep learning model; If the second model meets the first preset condition, the second model will be used as the target model.
3. The method as described in claim 1, characterized in that, The step of performing attribute transformation on the N data items according to the data item processing operation to obtain N input texts includes: For each of the N data items, the following data item processing operations are performed: Retrieve Chinese fields and data elements from data items; Filter the first specified data in the Chinese field to obtain the first Chinese field; Based on the data attributes in the target model, the data elements and the first Chinese field are converted into input text in word vector format; After performing the data item processing operation on each of the N data items, N input texts are obtained.
4. The method as described in claim 3, characterized in that, The filtering of the first specified data in the Chinese field includes: Determine whether the Chinese field contains English text; If so, the second specified data in the English text is converted into Chinese through Chinese definition mapping to obtain the second Chinese field, and the first specified data in the second Chinese field is filtered out; If not, filter the first specified data in the Chinese field.
5. The method as described in claim 3, characterized in that, The step of converting the data elements and the first Chinese field into input text in word vector format includes: Determine whether the data element and the first Chinese field meet the second preset condition; If so, mark the data item that corresponds to both the data element and the first Chinese field as a specified data item, and convert the data element and the first Chinese field into input text in word vector format; If not, convert the data elements and the first Chinese field into input text in word vector format.
6. A system for extracting qualifying words, characterized in that, The system includes: The acquisition module is used to acquire N data items from the current business information table, where N is an integer greater than zero; The data item processing module is used to perform attribute transformation on the N data items according to the data item processing operation to obtain N input texts; The prediction module is used to input the N input texts into the target model to obtain the first limiting words corresponding to each of the N data items; The processing module is used to determine the target limiting words corresponding to each of the N data items by filtering the preset content in the first limiting words corresponding to each of the N data items, and extracting the target limiting words corresponding to each of the N data items; the preset content includes words with repetitive semantics, modal particles, and words without limiting word meaning; for each of the N data items, the following naming operation is performed: determining whether the data item is a specified data item; wherein, the data element of the specified data item and the first Chinese field meet the second preset condition; the second preset condition is that the first Chinese field is empty or the data element is a name; if yes, filtering the data item and determining that the name prediction value corresponding to the data item is empty; if no, determining the name prediction value corresponding to the data item based on the data element corresponding to the data item and the target limiting words corresponding to the data item; after performing the naming operation on each of the N data items, the name prediction value corresponding to each of the N data items is obtained.
7. The system as described in claim 6, characterized in that, The acquisition module is also used to acquire M data items from the historical business information table, where M is an integer greater than zero; The M data items are transformed according to the data item processing operations to obtain M training texts; The first model is trained using the M training texts to obtain the second model, wherein the first model is a deep learning model; If the second model meets the first preset condition, the second model will be used as the target model.
8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, when executing a computer program stored in the memory, implements the method steps of any one of claims 1-5.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method described in any one of claims 1-5.