Training data correction device

The training data correction device addresses the challenge of modifying hierarchical category structures in teacher data by identifying and removing documents with characteristic terms, enhancing the accuracy of document classification models.

JP7876710B2Active Publication Date: 2026-06-19NTT DOCOMO INC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
NTT DOCOMO INC
Filing Date
2024-04-04
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing correction methods for teacher data fail to address the modification of training data consisting of pairs of categories with a hierarchical structure and documents belonging to those categories, particularly when category structures change over time.

Method used

A training data correction device that includes an acquisition unit to identify incorrect hierarchical relationships and a deletion unit to remove documents containing characteristic terms of lower categories from training data, ensuring accurate classification by updating the hierarchical structure.

Benefits of technology

The device efficiently modifies training data to align with changing category structures, reducing misclassifications and improving the accuracy of document classification models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007876710000001
    Figure 0007876710000001
  • Figure 0007876710000002
    Figure 0007876710000002
  • Figure 0007876710000003
    Figure 0007876710000003
Patent Text Reader

Abstract

The present invention addresses the problem of correcting teaching data comprising a set of a category having a hierarchical structure and a document belonging to the category. A teaching data correction device (1) for correcting teaching data made up of a set of a category having a hierarchical structure and a document belonging to the category comprising: a deletion data determination unit (15) that acquires category information indicating a first category, which is one category, and a second category, which is a category having a hierarchical relationship with the first category and in which a document included in teaching data that should belong to the first category erroneously belongs or is likely to erroneously belong; a deletion data determination unit (15) that identifies a characteristic term in a document that is included in the teaching data and that belongs to a first category indicated by the acquired category information; and a teaching data deletion unit (16) that deletes, from the teaching data, a set of documents including the identified term among documents included in the teaching data and belonging to a second category indicated by the acquired category information.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] One aspect of the present disclosure relates to a teacher data correction device that corrects teacher data.

Background Art

[0002] In Patent Document 1 below, a correction method for targeting teacher data with incorrect labels is disclosed.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] The labels in the above correction method are composed of an OK label attached to the image data of normal products and an NG label attached to the image data of abnormal products. Therefore, in the above correction method, for example, it is impossible to correct teacher data consisting of a pair of a category having a hierarchical structure and a document belonging to the category.

Means for Solving the Problems

[0005] A training data correction device relating to one aspect of this disclosure is a training data correction device that corrects training data consisting of a set of categories having a hierarchical structure and documents belonging to said categories, comprising: an acquisition unit that acquires category information indicating a first category which is a single category and a second category which is a category that has a hierarchical relationship with said first category and to which documents included in training data that should belong to said first category belong incorrectly or may belong incorrectly; and a deletion unit that identifies characteristic terms in documents included in training data which belong to the first category indicated by the category information acquired by the acquisition unit, and deletes from the training data a set of documents included in said training data which belong to the second category indicated by said category information that contain the identified terms.

[0006] In this respect, the training data consists of pairs of categories with a hierarchical structure and documents belonging to those categories. Among the documents belonging to the second category (a category that has a hierarchical relationship with the first category) included in the training data, pairs of documents that contain terms characteristic of the documents belonging to the first category are deleted from the training data. In other words, the training data consisting of pairs of categories with a hierarchical structure and documents belonging to those categories can be modified. [Effects of the Invention]

[0007] According to one aspect of this disclosure, training data consisting of pairs of hierarchical categories and documents belonging to those categories can be modified. [Brief explanation of the drawing]

[0008] [Figure 1] This figure shows an example of the functional configuration of the training data correction device according to the embodiment. [Figure 2] This diagram shows an example of a hierarchical category system. [Figure 3] This figure shows a scenario where the system example in Figure 2 has been modified. [Figure 4]This figure shows examples of misclassifications in a document classification model trained on training data of incorrect system examples. [Figure 5] This flowchart shows an example of the processing performed by the training data correction device according to the embodiment. [Figure 6] This figure shows an example of a training data table. [Figure 7] This figure shows an example of a table of upper-lower category pair data. [Figure 8] This figure shows an example of a table with category classification results joined horizontally to the training data. [Figure 9] This figure shows an example table of upper-lower category pair data that has been flagged as needing correction. [Figure 10] This figure shows an example of a data table representing the characteristic words and features of subcategories. [Figure 11] This figure shows an example of deleting training data. [Figure 12] This flowchart shows an example of the processing performed by the machine learning unit 11 and the inference unit 12. [Figure 13] This figure shows another example of a training data table. [Figure 14] This figure shows an example of a training data table. [Figure 15] This figure shows an example of an evaluation data table. [Figure 16] This figure shows an example of a table of category classification results. [Figure 17] This figure shows another example of a table data where category classification results are joined horizontally to the training data. [Figure 18] This flowchart shows an example of the processes performed by the misclassification rate calculation unit 13 and the training data correction judgment unit 14. [Figure 19] This figure shows an example of records extracted from the example table in Figure 17. [Figure 20] This figure shows an example table with a misclassification rate column added to the example table in Figure 7. [Figure 21]It is a flowchart showing an example of the processing executed by the deletion data determination unit 15 and the teacher data deletion unit 16. [Figure 22] It is a diagram showing an example of a table of the morpheme-analyzed table data. [Figure 23] It is a diagram showing an example of division of the table example in FIG. 22. [Figure 24] It is a diagram showing an example of a table of data obtained by combining all the morpheme analysis columns of the extracted table among the table examples in FIG. 23. [Figure 25] It is a diagram showing an example of a table of the feature quantity table. [Figure 26] It is a diagram showing another example of a table of data representing the feature words and feature quantities of the lower category. [Figure 27] It is a diagram showing an example of deletion of the upper category table. [Figure 28] It is a flowchart showing another example of the processing executed by the teacher data correction device according to the embodiment. [Figure 29] It is a diagram showing an example of the hardware configuration of a computer used in the teacher data correction device according to the embodiment.

Embodiments for Carrying Out the Invention

[0009] Hereinafter, embodiments in the present disclosure will be described in detail with reference to the drawings. In the description of the drawings, the same reference numerals are given to the same elements, and duplicate descriptions are omitted. Also, the embodiments in the present disclosure in the following description are specific examples of the present invention, and are not limited to these embodiments as long as there is no description to particularly limit the present invention.

[0010] FIG. 1 is a diagram showing an example of the functional configuration of a teacher data correction device 1 according to an embodiment. As shown in FIG. 1, the teacher data correction device 1 includes a storage unit 10, a machine learning unit 11 (learning unit), an inference unit 12 (learning unit), a misclassification rate calculation unit 13, a teacher data correction determination unit 14, a deletion data determination unit 15 (acquisition unit, deletion unit), and a teacher data deletion unit 16 (acquisition unit, deletion unit).

[0011] Each functional block of the training data correction device 1 is intended to function within the training data correction device 1, but is not limited to this. For example, some of the functional blocks of the training data correction device 1 may function in a computer device different from the training data correction device 1, within a computer device that is network-connected to the training data correction device 1, while appropriately sending and receiving information with the training data correction device 1. Furthermore, some functional blocks of the training data correction device 1 may be omitted, multiple functional blocks may be integrated into one functional block, or one functional block may be decomposed into multiple functional blocks.

[0012] The training data modification device 1 modifies training data consisting of pairs of categories with a hierarchical structure and documents belonging to those categories. The hierarchical structure of the categories may change over time.

[0013] Let's explain categories. As background, in a hierarchical category classification system, new categories may be created as needed, and subcategories may be added or deleted. For example, in the case of the news article website targeted in this embodiment, adding a separate category for highly popular topics makes it easier for more users to find the content they need on the website.

[0014] Figure 2 shows an example of a hierarchical category system. In the example system shown in Figure 2, the categories include sports, baseball, novel coronavirus, and vaccines. Each category has a hierarchical structure. For example, the sports category is a higher category than the baseball category, and conversely, the baseball category is a lower category than the sports category. Similarly, the novel coronavirus category is a higher category than the vaccine category, and conversely, the vaccine category is a lower category than the novel coronavirus category.

[0015] Each category is associated with (contains) documents belonging to that category. Specifically, the sports category is associated with articles on sumo wrestling, golf, basketball, and soccer. The baseball category is associated with articles on international baseball tournaments and professional baseball. The novel coronavirus category is associated with articles on masks and medical facilities. The vaccine category is associated with articles on pharmaceutical company F and pharmaceutical company M.

[0016] Figure 3 shows a scenario where the system example in Figure 2 is modified. Figure 3 shows a scenario where, as time progresses and the FIFA World Cup takes place, the popularity of articles related to soccer increases, and a new soccer category is added (separated). Before the soccer category was added, soccer articles that were linked to the sports category are removed from the sports category (the link is broken) and moved to the added soccer category (linked). Specifically, articles about the World Cup, articles about the host country Q, and articles about player K, which are all soccer articles, are moved to the added soccer category. In this way, a soccer category is added during the World Cup period, and removed from the category once the trend subsides.

[0017] The challenges of the categories described above will be explained. When using machine learning to automatically classify articles by category, the presence of data with different past category structures can exacerbate misclassification. In machine learning, a larger amount of data generally leads to a more accurate model, so data other than the changed parts should be retained. It is necessary to reduce misclassification by removing only the changed parts (for example, articles related to soccer) from the past training data.

[0018] Figure 4 shows examples of misclassification by a document classification model trained on training data of incorrect system examples. The training data shown in Figure 4, up to September 2022 (past training data) and as of December 2022 (current training data), each consists of pairs of hierarchical categories and documents belonging to those categories.

[0019] In past training data, World Cup articles were associated with the sports category. In current training data, World Cup articles have been removed from the sports category and are now associated with the newly added soccer category. If a new World Cup article is inferred using a document classification model that automatically categorizes articles based on past training data, it will be classified into the sports category, resulting in a misclassification. On the other hand, if a new World Cup article is inferred using a document classification model that has been trained based on current training data, it will be correctly classified into the soccer category.

[0020] The training data modification device 1 can efficiently and easily modify and format past training data when the category system changes.

[0021] The following will explain each function of the training data correction device 1 shown in Figure 1, using the flowchart shown in Figure 5 and example tables shown in Figures 6 to 11. Figure 5 is a flowchart showing an example of the process performed by the training data correction device 1.

[0022] The storage unit 10 stores training data consisting of pairs of categories with a hierarchical structure and documents belonging to those categories. Figure 6 shows an example of a training data table. In the example table shown in Figure 6, the article body, which is a document, is associated with the correct category (or its name), which is the category to which the article body belongs. The correct category may be a category that has been manually assigned by a person after they have reviewed the content of the article body. In the example table shown in Figure 6, the correct category for the article body about soccer, "Player M in the IP League of soccer...", is listed as "Sports" (it should be "Soccer"), and the goal (in the flowchart shown in Figure 5) is to remove this.

[0023] The storage unit 10 stores upper-lower category pair data, which is data of a pair of upper-upper categories and lower-upper categories. Figure 7 shows an example table of upper-lower category pair data. In the example table shown in Figure 7, the upper category (category name) and the lower category (category name) are associated.

[0024] The storage unit 10 also stores other information used in calculations by the teacher data correction device 1 (including various data described in the embodiment) and the results of calculations by the teacher data correction device 1. The information stored in the storage unit 10 may be referenced as appropriate by each function of the teacher data correction device 1.

[0025] The machine learning unit 11 trains a document classification model using the training data stored by the storage unit 10 (step S1). The document classification model is a model that classifies the category to which any input document belongs.

[0026] Following S1, the inference unit 12 inputs evaluation data (for example, the article text from the training data table example shown in Figure 6) into the trained document classification model, outputs the category classification result, and also outputs table data with the category classification result horizontally joined to the training data (step S2). Figure 8 shows an example of table data with the category classification result horizontally joined to the training data. In the example table shown in Figure 8, the article text (from the training data table example shown in Figure 6), the correct category (from the training data table example shown in Figure 6), and the category classification result described above are associated.

[0027] Following S2, the misclassification rate calculation unit 13 compares the predicted category of the table data output in S2 with the correct category based on the upper-lower category pair data stored by the storage unit 10, and calculates the misclassification rate to the upper category (described later) (step S3). Figure 9 shows an example table of upper-lower category pair data with a correction-required flag (described later). In the example table shown in Figure 9, the upper category (of the example table of upper-lower category pair data shown in Figure 7), the lower category (of the example table of upper-lower category pair data shown in Figure 7), the aforementioned misclassification rate, and the aforementioned correction-required flag are associated.

[0028] Following S3, the training data correction determination unit 14 determines whether or not the correction flag is present in the correction column (of the upper-lower category pair data with the correction flag) (step S4). If it is determined in S4 that there is no correction flag (S4: NO), the process ends.

[0029] If it is determined in S4 that there is a flag indicating that correction is needed (S4:YES), the deletion data determination unit 15 determines the word to be deleted as "category name + feature word" (step S5). Figure 10 shows an example of a data table representing the feature words and features of a subcategory. The example table shown in Figure 10 includes a feature word of a subcategory, which consists of a name indicating the subcategory, a feature word (described later), and the feature quantity of that feature word.

[0030] Following S5, the training data deletion unit 16 deletes data from lower categories from higher categories (step S6). Figure 11 shows an example of training data deletion. In the deletion example shown in Figure 11, article text containing words included in the example table of data representing lower category feature words and features shown in Figure 10 has been deleted from the example table of training data shown in Figure 6.

[0031] After S6, the process returns to S1 and is repeated.

[0032] The details of the machine learning unit 11 and the inference unit 12 will be explained below using the flowchart shown in Figure 12 and the example tables shown in Figures 13 to 17. Figure 12 is a flowchart showing an example of the processing performed by the machine learning unit 11 and the inference unit 12.

[0033] The machine learning unit 11 divides the previously acquired training data (stored by the storage unit 10) into K groups (K is an integer greater than or equal to 2) (step S10). Figure 13 shows another example table of training data. The example table shown in Figure 13 shows that the training data (with a similar structure to the example table of training data shown in Figure 6) is divided into three groups (K=3): group G1 (containing the 1st and 2nd records of the training data), group G2 (containing the 3rd and 4th records of the training data), and group G3 (containing the 5th and 6th records of the training data).

[0034] Following S10, the machine learning unit 11 trains a document classification model using data from group K-1 (training data) (step S11). For example, the machine learning unit 11 trains using data from group G1 and group G2. Figure 14 shows an example of a training data table. The example table shown in Figure 14 shows data from group G1 and group G2 from the example training data table shown in Figure 13.

[0035] The machine learning unit 11 repeats the learning process for S11 K times to obtain K document classification models. For example, the machine learning unit 11 obtains three document classification models: document classification model 1 trained on data from group G1 and group G2, document classification model 2 trained on data from group G1 and group G3, and document classification model 3 trained on data from group G2 and group G3.

[0036] Following S11, the inference unit 12 performs inference for all document classification models using training data (evaluation data) that was not used in training (step S12). For example, the inference unit 12 performs inference for document classification model 1 using data from group G3, for document classification model 2 using data from group G2, and for document classification model 3 using data from group G1. Figure 15 shows an example of an evaluation data table. The example table shown in Figure 15 shows the data from group G3 (evaluation data for document classification model 1) from the example training data table shown in Figure 13.

[0037] As a result of the inference in S12, the inference unit 12 outputs the category classification result. Figure 16 shows an example of a table of category classification results. In the example table shown in Figure 16, the first and second records are the results of inference using evaluation data for group G3, the third and fourth records are the results of inference using evaluation data for group G2, and the fifth and sixth records are the results of inference using evaluation data for group G1.

[0038] Next, the inference unit 12 vertically combines all the inferred category classification results and inputs the resulting table data, which is horizontally combined with the training data, into the misclassification rate calculation unit (step S13). Figure 17 shows another example of a table of the table data obtained by horizontally combining category classification results with training data. In the example table shown in Figure 17, the example training data table shown in Figure 13 and the example category classification result table shown in Figure 16 are combined.

[0039] The processing in S11-S12 involves so-called cross-validation. For example, the training data is divided into K=5 groups, with K-1=4 groups used as training data and 1 group used as test data. The learning process is repeated K times so that all groups become test data. The reason for performing cross-validation is to eliminate the possibility that the data used to calculate the misclassification rate (evaluation data) does not include, or contains very few, articles (news) for the correct categories "soccer" and "sports" that are being corrected in this training data. In other words, it is performed with the motivation of preventing the evaluation data from not including the article text corresponding to the higher-lower category pair data. Alternatively, it may be defined that the evaluation data must always include the article text corresponding to the higher-lower category pair data.

[0040] The details of the misclassification rate calculation unit 13 and the training data correction judgment unit 14 will be explained below using the flowchart shown in Figure 18 and the example tables shown in Figures 19 and 20. Figure 18 is a flowchart showing an example of the processes performed by the misclassification rate calculation unit 13 and the training data correction judgment unit 14.

[0041] Following S13, the misclassification rate calculation unit 13 compares the correct category and predicted category pairs in the table data (table data obtained by horizontally joining category classification results to training data) obtained by the machine learning unit 11 and the inference unit 12, and calculates the misclassification rate (step S20). Specifically, the misclassification rate calculation unit 13 calculates the misclassification rate to the higher category based on the higher-lower category pair data prepared in advance (stored by the storage unit 10).

[0042] More specifically, first, the misclassification rate calculation unit 13 extracts one record (hereinafter referred to as "upper / lower pair") from the upper-lower category pair data. Next, the misclassification rate calculation unit 13 extracts records from the table data obtained from the machine learning unit 11 and the inference unit 12 (table data obtained by horizontally joining category classification results to training data) where the correct category corresponds to the lower category of the upper / lower pair. It then calculates the percentage of the extracted records where the predicted category-correct category pair matches the order of the upper / lower pair, and adds this to the misclassification rate column in the upper-lower category pair data (step S20). For example, the misclassification rate calculation unit 13 calculates the misclassification rate for the correct category "soccer" to the higher category "sports". The misclassification rate calculation unit 13 performs the same operation for all upper-lower category pair data.

[0043] Figure 19 shows an example of records extracted from the example table in Figure 17. The example table in Figure 19 extracts records where the correct category in the example table in Figure 17 corresponds to the subcategory "soccer" in a one-upper-lower pair.

[0044] Figure 20 shows an example table from Figure 7 with an added misclassification rate column. In the example table shown in Figure 20, the misclassification rates calculated in S20 are newly associated with the example table from Figure 7.

[0045] Following S20, the misclassification rate calculation unit 13 outputs upper-lower category pair data with a correction flag (see table example shown in Figure 9), in which records with a correction flag higher than a predetermined threshold are flagged from the misclassification rate column of the upper-lower category pair data (see table example shown in Figure 20) (step S21).

[0046] Following S21, the training data correction determination unit 14 determines whether or not there is a correction flag in the column requiring correction (step S22). If there is a correction flag in the column requiring correction (step S22: YES), the unit proceeds to the deletion data determination unit 15. If there is no correction flag in the column requiring correction (step S22: NO), the process ends.

[0047] The details of the data deletion determination unit 15 and the training data deletion unit 16 will be explained below using the flowchart shown in Figure 21 and the example tables shown in Figures 22 to 27. Figure 21 is a flowchart showing an example of the process performed by the data deletion determination unit 15 and the training data deletion unit 16.

[0048] Following S22:YES, the deletion data determination unit 15 calculates words characteristic of the subcategory.

[0049] Specifically, first, the deletion data determination unit 15 performs morphological analysis on the table data obtained from the machine learning unit 11 and the inference unit 12 (table data obtained by horizontally joining category classification results to training data) (step S30). Figure 22 shows an example of the table data after morphological analysis. The example table shown in Figure 22 is the same as the example table data obtained by horizontally joining category classification results to training data shown in Figure 17, but with the morphological analysis results of each data in the article body column added as morphological analysis columns.

[0050] Next, the deletion data determination unit 15 divides the morphologically analyzed table data into correct categories (step S31). Alternatively, the deletion data determination unit 15 may divide the data into lower categories, higher categories, and others based on the higher category-lower category pair data. Next, the deletion data determination unit 15 extracts records with the "Needs Correction" flag one by one (hereinafter referred to as "Higher / Lower Pair Needing Correction") based on the higher category-lower category pair data obtained from the misclassification rate calculation unit 13. The deletion data determination unit 15 compares the extracted Higher / Lower Pair Needing Correction with the divided table data and extracts tables other than the table corresponding to the higher category of the record (step S32). Figure 23 shows an example of the division of the table example in Figure 22. The example table shown in Figure 23 shows that the example table in Figure 22 has been divided by the correct categories "Sports," "Soccer," and "Novel Coronavirus" (referred to as the higher-level category table, lower-level category table, and novel coronavirus table, respectively), and that the lower-level category table and novel coronavirus table, which are tables other than the table corresponding to the higher-level category "Sports" of the higher-level / lower-level pair that needs correction, consisting of the higher-level category "Sports" and the lower-level category "Soccer," have been extracted.

[0051] Following S32, the deletion data determination unit 15 calculates words characteristic of the subcategory.

[0052] Specifically, first, the deletion data determination unit 15 combines all the morphological analysis columns of each category table extracted in S32 (step S33). Figure 24 shows an example table of data obtained by combining all the morphological analysis columns of the extracted tables from the example table in Figure 23. The example table shown in Figure 24 includes an example table of data obtained by combining all the morphological analysis columns of the subordinate category tables from the example table in Figure 23, and an example table of data obtained by combining all the morphological analysis columns of the novel coronavirus category table from the example table in Figure 23.

[0053] Next, the deletion data determination unit 15 calculates the importance (TFIDF value) of each word within its category based on the TFIDF calculation formula (step S34). An example of the TFIDF calculation formula is shown below (although i and j are originally in subscript form, they are not shown in subscript form for convenience). TFIDF wi,dj =TF wi,dj ×IDF wi TF wi,dj = Frequency of occurrence of the word wi in document dj IDF wi =log((1 + total number of documents (J)) / (number of documents containing the word wi))

[0054] In this case, document d represents the morphological analysis results combined above, and j represents each category. In other words, dj is the combined morphological analysis result of each category j.

[0055] The deletion data determination unit 15 performs the above calculations and obtains a feature table. Figure 25 shows an example of the feature table. In the example table shown in Figure 25, the importance (TFIDF value) of each word is associated with each category.

[0056] Next, the training data deletion unit 16 deletes articles related to the lower category from the higher category.

[0057] Specifically, first, the deletion data determination unit 15 extracts a list of the top and bottom pairs of subcategories that need correction and their corresponding records from the feature table obtained in the previous step, sorts them in descending order by TFIDF value, and extracts the top four (step S35). Next, the training data deletion unit 16 uses all five feature words, including the subcategory names, to delete articles of subcategories that are mixed in with the top category table using keyword matching (step S36).

[0058] Figure 26 shows another example table of data representing the characteristic words and features of a subcategory. The example table in Figure 26 includes the subcategory name ("Soccer") and the characteristic words of the subcategory, which include the top four words sorted in descending order by TFIDF value ("Player M", "Country Q", "Player K", and "World Cup") and the features of each of those words.

[0059] Figure 27 shows an example of deletion in a higher-level category table. In the deletion example shown in Figure 27, records of article text containing the word ("Player M") which is included in the example table of data representing characteristic words and features of the lower-level category shown in Figure 26, have been deleted from the higher-level category table shown in Figure 23.

[0060] Next, the training data deletion unit 16 vertically joins the parent category table, parent category table, and other category tables that were modified in the previous step (step S37). Then, the training data deletion unit 16 performs these operations on all records of the parent category-parent category pair data that have the "needs correction" flag. Once processing is complete for all records, it deletes the morphological analysis column and the predicted category column, and then returns to the machine learning unit 11 and the inference unit 12 (step S38). In other words, the training data deletion unit 16 outputs the corrected (formatted) training data (or has it stored by the storage unit 10).

[0061] The following describes other aspects of each functional block.

[0062] The machine learning unit 11 and the inference unit 12 may learn and output a document classification model that classifies the category to which any input document belongs, based on the training data deleted by the training data deletion unit 16.

[0063] The deletion data determination unit 15 may acquire category information indicating a first category, which is a single category, and a second category, which is a category hierarchically related to the first category, to which a document included in the training data that should belong to the first category belongs incorrectly or may belong incorrectly. The unit may also identify characteristic terms in documents included in the training data that belong to the first category indicated by the acquired category information. The second category may be a higher hierarchical level than the first category.

[0064] The deletion data determination unit 15 may acquire category information indicating the first and second categories if the misclassification rate of a document classification model, which classifies the category to which any input document belongs and is trained on training data, meets a predetermined standard. The misclassification rate is the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category. Cross-validation may be performed during training based on training data. The misclassification rate may also be the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category.

[0065] The training data deletion unit 16 may delete from the training data a set of documents that are included in the training data, belong to the second category indicated by the category information obtained by the deletion data determination unit 15, and contain a term identified by the deletion data determination unit 15. The training data deletion unit 16 may also perform deletion on the training data if the misclassification rate of a document classification model that classifies the category to which any input document belongs, and which has been trained based on the training data, meets a predetermined standard.

[0066] The training data deletion unit 16 may delete from the training data a set of documents that are included in the training data and belong to the second category indicated by the category information obtained by the deletion data determination unit 15, and which include a term identified by the deletion data determination unit 15 and a name indicating the first category indicated by the category information.

[0067] Next, with reference to Figure 28, we will explain an example of the processing performed by the training data correction device 1. Figure 28 is a flowchart showing another example of the processing performed by the training data correction device 1.

[0068] First, the deletion data determination unit 15 obtains category information (upper-lower category pair data with a correction flag) indicating a first category, which is a single category, and a second category, which is a category that has a hierarchical relationship with the first category, and to which documents included in the training data that should belong to the first category belong incorrectly or may belong incorrectly (step S40). Next, the deletion data determination unit 15 identifies characteristic terms in documents included in the training data that belong to the first category indicated by the category information obtained in S40, and the training data deletion unit 16 deletes from the training data a set of documents that include the identified terms from among the documents included in the training data that belong to the second category indicated by the category information (step S41).

[0069] Next, the effects and benefits of the training data correction device 1 according to this embodiment will be explained.

[0070] The training data correction device 1 is a training data correction device 1 that corrects training data consisting of a set of categories having a hierarchical structure and documents belonging to those categories, and includes a deletion data determination unit 15 that acquires category information indicating a first category, which is a single category, and a second category, which is a category that is hierarchically related to the first category, and to which documents included in the training data that should belong to the first category are incorrectly or potentially incorrectly related, and identifies characteristic terms in documents included in the training data that belong to the first category indicated by the acquired category information, and a training data deletion unit 16 that deletes from the training data a set of documents included in the training data that belong to the second category indicated by the category information acquired by the deletion data determination unit 15, and which contain the terms identified by the deletion data determination unit 15. With this configuration, a set of documents that belong to the second category (a category that is hierarchically related to the first category) included in the training data consisting of a set of categories having a hierarchical structure and documents belonging to those categories, and which contain characteristic terms in documents belonging to the first category, is deleted from the training data. In other words, training data consisting of a set of categories having a hierarchical structure and documents belonging to those categories can be corrected.

[0071] Furthermore, in the training data modification device 1, the hierarchical structure of categories may be changed over time. This configuration allows for more appropriate modification of the training data to conform to the modified hierarchical structure, even as time passes.

[0072] Furthermore, in the training data correction device 1, the second category may be a higher-level category than the first category. With this configuration, even if the training data contains documents belonging to the lower-level first category in the higher-level second category, those documents can be appropriately deleted.

[0073] Furthermore, in the training data correction device 1, if the misclassification rate of the document classification model that classifies the category to which any input document belongs, and which has been learned based on training data, meets a predetermined standard, deletion may be performed on the training data by the training data deletion unit 16 (and the deletion data determination unit 15). This configuration makes it possible to more reliably correct the training data if there are defects in the training data.

[0074] Furthermore, in the training data correction device 1, cross-validation may be performed during learning based on the training data. This configuration eliminates the possibility that the evaluation data used to calculate the misclassification rate does not include, or contains very few, the documents that are to be corrected in the training data.

[0075] Furthermore, in the training data correction device 1, the misclassification rate may be the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category. This configuration makes it possible to more reliably correct training data in which documents that should belong to the first category are incorrectly classified as belonging to the second category.

[0076] Furthermore, in the training data correction device 1, the deletion data determination unit 15 may acquire category information indicating the first and second categories if the misclassification rate, which is the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category, of a document classification model that classifies the category to which any input document belongs, and which has been learned based on training data, meets a predetermined standard. This configuration makes it possible to more reliably correct the training data when there are deficiencies in the training data.

[0077] Furthermore, in the training data correction device 1, the deletion data determination unit 15 may identify characteristic terms in documents included in the training data that belong to the first category indicated by the acquired category information, and the training data deletion unit 16 may delete from the training data a set of documents that include the terms identified by the deletion data determination unit 15 and the name indicating the first category, among documents included in the training data that belong to the second category indicated by the category information acquired by the deletion data determination unit 15. With this configuration, since the set of documents that include the name indicating the first category is also deleted from the training data, the training data can be corrected with greater accuracy.

[0078] Furthermore, the training data correction device 1 may also include a machine learning unit 11 that learns and outputs a document classification model that classifies the category to which any input document belongs, based on the training data deleted by the training data deletion unit 16. This configuration can improve the accuracy of the document classification model.

[0079] The training data correction device 1 relates to the automation of formatting training data. After training with inaccurate training data containing a mix of old and new categories, the training data correction device 1 calculates the misclassification rate to higher categories, identifies higher and lower categories that require correction of the training data based on the calculated misclassification rate, identifies words characteristic of lower categories, and deletes documents related to lower categories from past training data, thereby correcting the training data quickly and at low cost and improving the accuracy of the model.

[0080] According to the training data correction device 1, from a user's perspective, the effect is that articles are correctly classified into subcategories, eliminating the situation where articles are scattered across both higher and lower categories. From an operator's perspective, the effect is that data formatting can be easily performed even when a new category is added.

[0081] The training data correction device 1 of this disclosure may have the following configuration.

[0082] [A] A training data modification device that deletes training data of lower categories that are mixed in with training data of higher categories, A machine learning and inference unit that trains a document classification model using training data, inputs the text data constituting the training data into the trained document classification model, and outputs a category classification result, A misclassification rate calculation unit compares the category classification results output from the above with the categories previously assigned to the training data, and calculates the misclassification rate to the higher category based on the higher-lower category pair data. A training data correction determination unit that determines whether or not to correct the training data based on whether the misclassification rate to a higher category is higher than a threshold, A training data correction device equipped with a training data correction unit that corrects training data by deleting training data of lower categories that are mixed in with training data of higher categories using characteristic words that represent the lower categories.

[0083] [B] The system further includes a deletion data determination unit for extracting characteristic words representing the aforementioned subcategories. The aforementioned training data modification unit modifies the training data using the feature words and category names obtained using the deletion data determination unit. The training data correction device described in [A].

[0084] The training data correction device 1 of this disclosure may have the following configuration.

[0085] [1] A training data modification device that modifies training data consisting of a pair of categories with a hierarchical structure and documents belonging to those categories, An acquisition unit that acquires category information indicating a first category which is one of the aforementioned categories, and a second category which is a hierarchical relationship with the first category, and to which the document included in the training data that should belong to the first category belongs incorrectly or may belong incorrectly. A deletion unit identifies characteristic terms in documents included in the training data that belong to a first category indicated by category information acquired by the acquisition unit, and deletes from the training data a set of documents that include the identified terms among documents included in the training data that belong to a second category indicated by category information. A training data correction device equipped with the following features.

[0086] [2] The hierarchical structure of the aforementioned categories changes over time. [1] The training data correction device described above.

[0087] [3] The second category is a higher level than the first category. A training data correction device as described in [1] or [2].

[0088] [4] A document classification model that classifies any input document into the category to which it belongs, and when the misclassification rate of the document classification model learned based on the training data meets a predetermined standard, the deletion unit performs deletion on the training data. A training data correction device as described in any one of items [1] to [3].

[0089] [5] In the training based on the aforementioned training data, cross-validation is performed. [4] The training data correction device described above.

[0090] [6] The misclassification rate is the probability that a document that should belong to category 1 is incorrectly classified as belonging to category 2. A training data correction device as described in [4] or [5].

[0091] [7] The acquisition unit acquires category information indicating the first category and the second category when the misclassification rate, which is the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category, of a document classification model that classifies the category to which any input document belongs, and which has been learned based on the training data, meets a predetermined standard. A training data correction device as described in any one of items [1] to [6].

[0092] [8] The deletion unit identifies characteristic terms in documents included in the training data that belong to the first category indicated by the category information acquired by the acquisition unit, and deletes from the training data a set of documents that include the identified terms and the name indicating the first category among documents included in the training data that belong to the second category indicated by the category information. A training data correction device as described in any one of items [1] to [7].

[0093] [9] The system further includes a learning unit that learns and outputs a document classification model that classifies the category to which any input document belongs, based on the training data deleted by the deletion unit. A training data correction device as described in any one of items [1] to [8].

[0094] The block diagrams used in the description of the above embodiments show functional units. These functional blocks (components) are realized by any combination of at least one of hardware and software. Furthermore, the method of realizing each functional block is not particularly limited. That is, each functional block may be realized using one device that is physically or logically coupled, or it may be realized using two or more physically or logically separated devices that are directly or indirectly connected (for example, using wired or wireless connections). A functional block may also be realized by combining the above one device or the above multiple devices with software.

[0095] Functions include, but are not limited to, judgment, decision, judgment, calculation, calculation, processing, derivation, investigation, exploration, confirmation, reception, transmission, output, access, resolution, selection, selection, establishment, comparison, assumption, expectation, assumption, broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating (mapping), and assigning. For example, a functional block (configuration part) that enables transmission is called a transmitting unit or transmitter. As mentioned above, the method of implementation is not particularly limited.

[0096] For example, the training data correction device 1 in one embodiment of the present disclosure may function as a computer that processes the training data correction method of the present disclosure. Figure 29 is a diagram showing an example of the hardware configuration of the training data correction device 1 according to one embodiment of the present disclosure. The above-described training data correction device 1 may be physically configured as a computer device including a processor 1001, memory 1002, storage 1003, communication device 1004, input device 1005, output device 1006, bus 1007, etc.

[0097] In the following explanation, the term "device" can be replaced with "circuit," "device," "unit," etc. The hardware configuration of the training data correction device 1 may include one or more of the devices shown in the figure, or it may be configured to omit some of the devices.

[0098] Each function in the training data correction device 1 is realized by loading predetermined software (programs) onto hardware such as the processor 1001 and memory 1002, which causes the processor 1001 to perform calculations, control communication by the communication device 1004, and control at least one of the reading and writing of data in the memory 1002 and storage 1003.

[0099] The processor 1001 controls the entire computer, for example, by running the operating system. The processor 1001 may be composed of a central processing unit (CPU) that includes interfaces with peripheral devices, control units, arithmetic units, registers, etc. For example, the machine learning unit 11, inference unit 12, misclassification rate calculation unit 13, training data correction determination unit 14, deletion data determination unit 15, and training data deletion unit 16 described above may be implemented by the processor 1001.

[0100] Furthermore, the processor 1001 reads programs (program code), software modules, data, etc., from at least one of the storage 1003 and the communication device 1004 into the memory 1002 and executes various processes accordingly. The program used is one that causes the computer to execute at least a part of the operations described in the above embodiment. For example, the machine learning unit 11, the inference unit 12, the misclassification rate calculation unit 13, the training data correction determination unit 14, the deletion data determination unit 15, and the training data deletion unit 16 may be implemented by a control program stored in the memory 1002 and running on the processor 1001, and other functional blocks may be implemented similarly. The above-described various processes have been explained as being executed by one processor 1001, but they may be executed simultaneously or sequentially by two or more processors 1001. The processor 1001 may be implemented by one or more chips. The program may also be transmitted from a network via a telecommunications line.

[0101] Memory 1002 is a computer-readable recording medium and may consist of at least one of the following: ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. Memory 1002 may also be called a register, cache, main memory, etc. Memory 1002 can store executable programs (program code), software modules, etc., for carrying out a wireless communication method according to one embodiment of the present disclosure.

[0102] Storage 1003 is a computer-readable recording medium and may consist of at least one of the following: an optical disc such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (e.g., a compact disc, a digital multipurpose disc, a Blu-ray® disc), a smart card, flash memory (e.g., a card, a stick, a key drive), a floppy® disk, a magnetic strip, etc. Storage 1003 may also be called an auxiliary storage device. The above-mentioned storage medium may be, for example, a database, server, or other suitable medium including at least one of memory 1002 and storage 1003.

[0103] The communication device 1004 is hardware (transceiver / receiver device) for communicating between computers via at least one of a wired network and a wireless network, and is also referred to as a network device, network controller, network card, communication module, etc. The communication device 1004 may be configured to include a high-frequency switch, duplexer, filter, frequency synthesizer, etc., in order to implement at least one of frequency division duplex (FDD) and time division duplex (TDD). For example, the machine learning unit 11, inference unit 12, misclassification rate calculation unit 13, training data correction determination unit 14, deletion data determination unit 15, and training data deletion unit 16 described above may be implemented by the communication device 1004.

[0104] The input device 1005 is an input device that accepts input from an external source (e.g., a keyboard, mouse, microphone, switch, button, sensor, etc.). The output device 1006 is an output device that outputs to an external source (e.g., a display, speaker, LED lamp, etc.). The input device 1005 and the output device 1006 may be configured as an integrated unit (e.g., a touch panel).

[0105] Furthermore, each device, such as the processor 1001 and memory 1002, is connected by a bus 1007 for communicating information. The bus 1007 may be configured using a single bus, or different buses may be configured for each device.

[0106] Furthermore, the training data correction device 1 may be configured to include hardware such as a microprocessor, a digital signal processor (DSP), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array), and some or all of each functional block may be realized by such hardware. For example, the processor 1001 may be implemented using at least one of these hardware components.

[0107] The notification of information is not limited to the manner / embodiments described herein and may be carried out by other means.

[0108] Each aspect / embodiment described in this disclosure may be applied to at least one of the following systems: LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), FRA (Future Radio Access), NR (new Radio), W-CDMA (registered trademark), GSM (registered trademark), CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi (registered trademark)), IEEE 802.16 (WiMAX (registered trademark)), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth (registered trademark), and other appropriate systems, as well as next-generation systems extended based thereon. Furthermore, multiple systems may be applied in combination (for example, a combination of at least one of LTE and LTE-A with 5G).

[0109] The processing procedures, sequences, flowcharts, etc., of each aspect / embodiment described herein may be reordered, provided they are consistent with each other. For example, the methods described herein present various step elements in an exemplary order and are not limited to that specific order.

[0110] Input and output information may be stored in a specific location (e.g., memory) or managed using a management table. Input and output information may be overwritten, updated, or appended to. Output information may be deleted. Input information may be transmitted to other devices.

[0111] The determination may be made by a value represented by 1 bit (0 or 1), by a boolean value (true or false), or by a numerical comparison (for example, a comparison with a predetermined value).

[0112] Each aspect / embodiment described herein may be used individually, in combination, or switched between as needed during implementation. Furthermore, notification of specific information (e.g., notification that "X is") is not limited to explicit notification, but may also be implicit (e.g., by not providing such notification).

[0113] Although the present disclosure has been described in detail above, it will be clear to those skilled in the art that the present disclosure is not limited to the embodiments described herein. The present disclosure can be implemented in modified and altered forms without departing from the intent and scope of the present disclosure as defined by the claims. Therefore, the descriptions in the present disclosure are illustrative and not intended to be restrictive in any way.

[0114] Software should be broadly interpreted to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, execution threads, procedures, functions, and so on, whether they are called software, firmware, middleware, microcode, hardware description languages, or by any other name.

[0115] Furthermore, software, instructions, information, etc., may be transmitted and received via a transmission medium. For example, if software is transmitted from a website, server, or other remote source using at least one of wired technology (such as coaxial cable, fiber optic cable, twisted pair, or digital subscriber line (DSL)) and wireless technology (such as infrared or microwave), then at least one of these wired and wireless technologies is included in the definition of a transmission medium.

[0116] The information, signals, etc. described in this disclosure may be represented using any of the various different techniques. For example, the data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be represented by voltage, current, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.

[0117] In addition, terms used in this disclosure and terms necessary for understanding this disclosure may be replaced with terms having the same or similar meaning.

[0118] The terms “system” and “network” as used in this disclosure are interchangeable.

[0119] Furthermore, the information, parameters, etc., described in this disclosure may be expressed using absolute values, relative values ​​from a predetermined value, or corresponding other information.

[0120] The names used for the parameters described above are not restrictive in any way. Furthermore, the formulas and other expressions using these parameters may differ from those expressly disclosed in this disclosure.

[0121] As used in this disclosure, the terms “determining” and “determining” may encompass a wide variety of actions. “Determining” may include, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiry (e.g., searching in a table, database, or other data structure), and ascertaining. “Determining” may also include, for example, receiving (e.g., receiving information), transmitting (e.g., sending information), input, output, and accessing (e.g., accessing data in memory). Furthermore, "judgment" and "decision" can include considering something as having been "judged" or "decided" after resolving, selecting, choosing, establishing, comparing, etc. In other words, "judgment" and "decision" can include considering something as having been "judged" or "decided" after some action. Also, "judgment (decision)" can be reinterpreted as "assuming," "expecting," or "considering."

[0122] The terms “connected,” “coupled,” or any variation thereof, mean any direct or indirect connection or coupling between two or more elements, and may include the presence of one or more intermediate elements between two elements that are “connected” or “coupled” with each other. The coupling or connection between elements may be physical, logical, or a combination thereof. For example, “connection” may be reinterpreted as “access.” As used in this disclosure, two elements may be considered to be “connected” or “coupled” with each other using at least one of one or more wires, cables, and printed electrical connections, and, in some non-limiting and non-exclusive examples, electromagnetic energy having wavelengths in the radio frequency domain, microwave domain, and optical (both visible and invisible) domain.

[0123] In this disclosure, the phrase "based on" does not mean "based solely on" unless otherwise specified. In other words, the phrase "based on" means both "based solely on" and "based at least on."

[0124] Any reference to elements using the designations “first,” “second,” etc., as used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient way to distinguish between two or more elements. Accordingly, references to the first and second elements do not imply that only two elements may be employed, or that the first element must precede the second element in any way.

[0125] In the configuration of each of the above devices, "means" may be replaced with "part," "circuit," "device," etc.

[0126] Where the terms “include,” “including,” and variations thereof are used in this disclosure, these terms are intended to be inclusive, as is the term “comprising.” Furthermore, the term “or” as used in this disclosure is not intended to mean exclusive OR.

[0127] In this disclosure, if articles are added by translation, such as a, an, and the in English, this disclosure may include the fact that the noun following these articles is plural.

[0128] In this disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean "A and B are each different from C." Terms such as "separate" and "combine" may be interpreted similarly to "different." [Explanation of Symbols]

[0129] 1...Training data correction device, 10...Storage unit, 11...Machine learning unit, 12...Inference unit, 13...Misclassification rate calculation unit, 14...Training data correction judgment unit, 15...Deletion data determination unit, 16...Training data deletion unit, 1001...Processor, 1002...Memory, 1003...Storage, 1004...Communication device, 1005...Input device, 1006...Output device, 1007...Bus.

Claims

1. A training data modification device that modifies training data consisting of a pair of categories with a hierarchical structure and documents belonging to those categories, An acquisition unit that acquires category information indicating a first category which is one of the aforementioned categories, and a second category which is a category that is hierarchically related to the first category, and to which a document included in the training data that should belong to the first category is incorrectly or may incorrectly belong. A deletion unit identifies characteristic terms in documents included in the training data that belong to a first category indicated by category information acquired by the acquisition unit, and deletes from the training data a set of documents that include the identified terms among documents included in the training data that belong to a second category indicated by category information. A training data correction device equipped with the following features.

2. The hierarchical structure of the aforementioned categories changes over time. The training data correction device according to claim 1.

3. The second category is a higher level than the first category. The training data correction device according to claim 1.

4. A document classification model that classifies any input document into the category to which it belongs, and when the misclassification rate of the document classification model learned based on the training data meets a predetermined standard, the deletion unit performs deletion on the training data. The training data correction device according to claim 1.

5. In the training based on the aforementioned training data, cross-validation is performed. The training data correction device according to claim 4.

6. The misclassification rate is the probability that a document that should belong to category 1 is incorrectly classified as belonging to category 2. The training data correction device according to claim 4.

7. The acquisition unit acquires category information indicating the first category and the second category when the misclassification rate, which is the probability that a document that should belong to the first category is incorrectly classified as belonging to the second category, of a document classification model that classifies the category to which any input document belongs, and which has been learned based on the training data, meets a predetermined standard. The training data correction device according to claim 1.

8. The deletion unit identifies characteristic terms in documents included in the training data that belong to a first category indicated by category information acquired by the acquisition unit, and deletes from the training data a set of documents that include the identified terms and the name indicating the first category among documents included in the training data that belong to a second category indicated by category information. The training data correction device according to claim 1.

9. The system further includes a learning unit that learns and outputs a document classification model that classifies the category to which any input document belongs, based on the training data deleted by the deletion unit. The training data correction device according to claim 1.