A text classification model analysis method and device
By performing word segmentation and classification on the text, calculating the contribution of text fragments, and explaining the text classification results of the deep learning model, the problem of the difficulty in interpreting text classification models is solved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2022-05-09
- Publication Date
- 2026-06-16
Smart Images

Figure CN117076662B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information processing technology, and in particular to a method and apparatus for parsing a text classification model. Background Technology
[0002] Text classification is a natural language processing technique. Given a set of text and a set of labels, text classification refers to mapping the text in the text set to the labels in the label set using machine learning or deep learning methods. The vehicle that obtains this mapping relationship is called a text classification model.
[0003] Currently, deep learning has made great progress in the field of natural language processing. Text classification models are becoming increasingly complex, with more and more layers, making them harder to interpret. Text classification models trained based on machine learning or deep learning are like black boxes. Although a text classification model can return a classification result when given a piece of text data, it is impossible to know why the model returned that classification result or what text features it learned from that result. Summary of the Invention
[0004] In view of this, the present invention provides a method and apparatus for parsing text classification models, which is used to solve the problem that text classification models are difficult to interpret.
[0005] To achieve the above objectives, the embodiments of the present invention provide the following technical solutions:
[0006] In a first aspect, embodiments of the present invention provide a method for parsing a text classification model, comprising:
[0007] The target text is segmented into words to obtain a set of text fragments corresponding to the target text, and the set of text fragments includes multiple text fragments;
[0008] Multiple sampling sets are generated based on the set of text fragments, and each of the sampling sets includes at least one text fragment from the set of text fragments.
[0009] The text fragment set and each of the sample sets are classified based on a preset text classification model to obtain the target text category of the text fragment set and the text category of each of the sample sets;
[0010] Based on the target text category, the text categories of each of the sample sets, and the text fragments contained in each of the sample sets, the contribution of each text fragment in the text fragment set is obtained. The contribution is used to characterize the degree of influence of the text fragment on the preset text classification model to classify the text fragment set into the target text category.
[0011] As an optional implementation of this invention, the step of generating multiple sampling sets based on the text fragment set includes:
[0012] Text segments in the set of text segments are selected one by one based on a preset selection probability;
[0013] Determine whether the set of selected text fragments is an empty set;
[0014] If the set of selected text fragments is not empty, then the set of selected text fragments is determined as a sampling set.
[0015] As an optional implementation of this invention, the step of obtaining the contribution of each text segment in the text segment set based on the target text category, the text categories of each of the sampling sets, and the text segments contained in each of the sampling sets includes:
[0016] Determine whether the text category of each of the sampling sets is the target text category, and obtain the determination result;
[0017] Based on the determined results and the text fragments contained in each of the sampling sets, the contribution of each text fragment in the text fragment set is obtained.
[0018] As an optional implementation of this invention, the step of obtaining the contribution of each text segment in the text segment set based on the determination result and the text segments included in each of the sampling sets includes:
[0019] Based on the text fragments contained in each of the sampling sets, the number of times each text fragment in the text fragment set is sampled is obtained, where the number of times is sampled is the number of sampling sets including the text fragments;
[0020] Based on whether the text category of each of the sampling sets is the target text category, the hit count difference of each text segment in the text segment set is obtained; the hit count difference is the difference between the correct count of the text segment and the incorrect count of the text segment, the correct count is the number of sampling sets that contain the text segment and whose text category is the target text category, and the incorrect count is the number of sampling sets that contain the text segment and whose text category is not the target text category;
[0021] The contribution of each text segment in the text segment set is obtained based on the difference between the number of times each text segment is sampled and the number of times it is hit.
[0022] As an optional implementation of this invention, obtaining the contribution of each text segment in the text segment set based on the difference between the number of samples and the number of hits for each text segment in the text segment set includes:
[0023] The ratio of the difference in the number of hits of each text segment in the text segment set to the number of times each text segment in the text segment set is sampled is determined to obtain the contribution of each text segment in the text segment set.
[0024] As an optional implementation of this invention, obtaining the contribution of each text segment in the text segment set based on the target text category and the text categories of each sampling set includes:
[0025] Obtain the contribution of each identical text fragment in the set of text fragments;
[0026] Determine the average contribution of each identical text segment and obtain the average contribution.
[0027] The average contribution is determined as the contribution of each identical text segment.
[0028] As an optional implementation of this invention, the step of performing word segmentation on the target text to obtain a set of text fragments corresponding to the target text includes:
[0029] The target text is segmented using the Byte-Pair Encoding (BPE) algorithm to obtain a set of text fragments corresponding to the target text.
[0030] As an optional implementation of this invention, the method further includes:
[0031] Obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set, wherein the target text set includes at least one target text;
[0032] Based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the text category of the text segment set corresponding to each text in the target text set determined by the preset text classification model, the contribution of each text segment corresponding to each text category in the preset text classification model is obtained.
[0033] As an optional implementation of this invention, after determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category of the preset text classification model, the method further includes:
[0034] Retrieve positive text fragments for each text category;
[0035] Among them, the positive text fragments of any text category are the text fragments whose contribution to the corresponding text category is greater than the first threshold contribution, or the positive text fragments of any text category are the first preset number of text fragments with the largest contribution to the corresponding text category.
[0036] As an optional implementation of this invention, after determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category of the preset text classification model, the method further includes:
[0037] Retrieve negative text fragments for each text category;
[0038] Among them, the negative text fragments of any text category are the text fragments whose contribution to the corresponding text category is less than the second threshold contribution, or the negative text fragments of any text category are the second preset number of text fragments whose contribution to the corresponding text category is the smallest.
[0039] Secondly, embodiments of the present invention provide a parsing apparatus for a text classification model, comprising:
[0040] The word segmentation unit is used to segment the target text into words and obtain a set of text fragments corresponding to the target text, wherein the set of text fragments includes multiple text fragments;
[0041] A sampling unit is configured to generate multiple sampling sets based on the set of text fragments, wherein any one of the sampling sets includes at least one text fragment from the set of text fragments.
[0042] A classification unit is used to classify the text fragment set and each of the sample sets based on a preset text classification model, and to obtain the target text category of the text fragment set and the text category of each of the sample sets;
[0043] The processing unit is configured to obtain the contribution of each text segment in the text segment set based on the target text category, the text category of each of the sample sets, and the text segments contained in each of the sample sets. The contribution is used to characterize the degree of influence of the text segment on the preset text classification model to classify the text segment set into the target text category.
[0044] As an optional implementation of the present invention,
[0045] The sampling unit is specifically used for:
[0046] Text segments in the set of text segments are selected one by one based on a preset selection probability;
[0047] Determine whether the set of selected text fragments is an empty set;
[0048] If the set of selected text fragments is not empty, then the set of selected text fragments is determined as a sampling set.
[0049] As an optional implementation of the present invention,
[0050] The processing unit is specifically used for:
[0051] Determine whether the text category of each of the sampling sets is the target text category, and obtain the determination result;
[0052] Based on the determined results and the text fragments contained in each of the sampling sets, the contribution of each text fragment in the text fragment set is obtained.
[0053] As an optional implementation of the present invention,
[0054] The processing unit is specifically used for:
[0055] Based on the text fragments contained in each of the sampling sets, the number of times each text fragment in the text fragment set is sampled is obtained, where the number of times is sampled is the number of sampling sets including the text fragments;
[0056] Based on whether the text category of each of the sampling sets is the target text category, the hit count difference of each text segment in the text segment set is obtained; the hit count difference is the difference between the correct count of the text segment and the incorrect count of the text segment, the correct count is the number of sampling sets that contain the text segment and whose text category is the target text category, and the incorrect count is the number of sampling sets that contain the text segment and whose text category is not the target text category;
[0057] The contribution of each text segment in the text segment set is obtained based on the difference between the number of times each text segment is sampled and the number of times it is hit.
[0058] As an optional implementation of the present invention,
[0059] The processing unit is specifically used for:
[0060] The ratio of the difference in the number of hits of each text segment in the text segment set to the number of times each text segment in the text segment set is sampled is determined to obtain the contribution of each text segment in the text segment set.
[0061] As an optional implementation of the present invention,
[0062] The processing unit is specifically used for:
[0063] Obtain the contribution of each identical text fragment in the set of text fragments;
[0064] Determine the average contribution of each identical text segment and obtain the average contribution.
[0065] The average contribution is determined as the contribution of each identical text segment.
[0066] As an optional implementation of this invention, the word segmentation unit is specifically used to perform word segmentation processing on the target text based on the Byte-Pair Encoding (BPE) algorithm to obtain a set of text fragments corresponding to the target text.
[0067] As an optional implementation of the present invention,
[0068] The acquisition unit is further configured to:
[0069] Obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set, wherein the target text set includes at least one target text;
[0070] Based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the text category of the text segment set corresponding to each text in the target text set determined by the preset text classification model, the contribution of each text segment corresponding to each text category in the preset text classification model is obtained.
[0071] As an optional implementation of the present invention,
[0072] The processing unit is further configured to:
[0073] After determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category in the preset text classification model, the positive text segment of each text category is obtained.
[0074] Among them, the positive text fragments of any text category are the text fragments whose contribution to the corresponding text category is greater than the first threshold contribution, or the positive text fragments of any text category are the first preset number of text fragments with the largest contribution to the corresponding text category.
[0075] As an optional implementation of the present invention,
[0076] The processing unit is further configured to:
[0077] After determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category in the preset text classification model, the negative text segment of each text category is obtained.
[0078] Among them, the negative text fragments of any text category are the text fragments whose contribution to the corresponding text category is less than the second threshold contribution, or the negative text fragments of any text category are the second preset number of text fragments whose contribution to the corresponding text category is the smallest.
[0079] Thirdly, embodiments of the present invention provide an electronic device, including: a memory and a processor, wherein the memory is used to store a computer program; and the processor is used to execute the parsing method of the text classification model described in the first aspect or any optional embodiment of the first aspect when the computer program is invoked.
[0080] Fourthly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the parsing method of the text classification model described in the first aspect or any optional embodiment of the first aspect.
[0081] Fifthly, embodiments of the present invention provide a computer program product that, when run on a computer, enables the computer to implement the text classification model parsing method described in the first aspect or any optional implementation of the first aspect.
[0082] The text classification model parsing method provided in this invention first performs word segmentation on the target text to obtain a set of text segments, including multiple text fragments, corresponding to the target text. Then, it generates multiple sampling sets based on the set of text fragments. Next, based on a preset text classification model, it classifies the set of text fragments and each sampling set to obtain the target text category of the set of text fragments and the text category of each sampling set. Finally, based on the target text category, the text category of each sampling set, and the text fragments contained in each sampling set, it obtains the contribution degree of each text fragment in the set of text fragments. Since the contribution degree can characterize the degree of influence of the text fragment on the preset text classification model's classification of the set of text fragments into the target text category, the reason why the preset text classification model classifies the set of text fragments into the target text category can be explained by the contribution degree of each text fragment in the set of text fragments. Therefore, this invention can solve the problem of text classification models being difficult to explain. Attached Figure Description
[0083] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
[0084] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0085] Figure 1 One of the step flowcharts of the text classification model parsing method provided in the embodiments of the present invention;
[0086] Figure 2 The second flowchart illustrates the steps of the text classification model parsing method provided in this embodiment of the invention.
[0087] Figure 3 The third step of the parsing method for the text classification model provided in this embodiment of the invention;
[0088] Figure 4 This is a schematic diagram of the structure of the parsing device for the text classification model provided in an embodiment of the present invention;
[0089] Figure 5 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0090] To better understand the above-mentioned objectives, features, and advantages of the present invention, the solutions of the present invention will be further described below. It should be noted that, unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other.
[0091] Many specific details are set forth in the following description in order to provide a full understanding of the invention, but the invention may also be practiced in other ways different from those described herein; obviously, the embodiments in the specification are only some embodiments of the invention, and not all embodiments.
[0092] In the embodiments of the present invention, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present invention should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner. Furthermore, in the description of the embodiments of the present invention, unless otherwise stated, "a plurality of" means two or more.
[0093] Based on the above, this embodiment of the invention provides a method for parsing a text classification model, referring to... Figure 1 As shown, the text classification model parsing method provided in this embodiment of the invention includes the following steps:
[0094] S11. Perform word segmentation on the target text to obtain a set of text fragments corresponding to the target text.
[0095] The text fragment set includes multiple text fragments.
[0096] That is, the target text is split into multiple text fragments, and the resulting text fragments are combined into a set of text fragments corresponding to the target text.
[0097] As an optional implementation of this invention, step S11 (performing word segmentation on the target text to obtain a set of text fragments corresponding to the target text) includes:
[0098] The target text is segmented using the Byte Pair Encoder (BPE) algorithm to obtain a set of text fragments corresponding to the target text.
[0099] Specifically, the process of segmenting the target text using the BPE algorithm to obtain the set of text segments corresponding to the target text includes the following steps a to d:
[0100] Step a: Divide the target text into the smallest text segments and obtain the word frequency of each smallest text segment.
[0101] For example, when the target text is: "Floydbub is the fastest way to build, train and deploy deep learning models.Build deep learning models in the cloud.Traindeep learning models", the smallest text segments obtained by splitting "Floydbub is the fastest way to build, train and deploy deep learning models.Build deep learning models in the cloud.Traindeep learning models" and the word frequencies of each smallest text segment can be shown in Table 1 below:
[0102] Table 1
[0103]
[0104] Step b: Combine the pair of text segments with the highest word frequency, and update the word frequency and vocabulary of the text segments and each smallest text segment.
[0105] Following the example above, the pair of text segments with the highest word frequency are text segment [e] and text segment [d]. Combining text segment [e] and text segment [d] into text segment [de], we can update Table 1 above to Table 2 below:
[0106] Table 2
[0107]
[0108] As shown in Table 2, the word frequency of text fragment [de] is 7, while the word frequencies of text fragments [e] and [d] are reduced by 7 to 9 and 5 respectively. Text fragment [de] has been added to the vocabulary.
[0109] Step c: Determine whether the number of text fragments in the vocabulary reaches the first quantity or whether the highest frequency is the second quantity.
[0110] In step c above, if the number of text fragments in the vocabulary does not reach the first number and the highest frequency is not the second number, then step b above is repeated until the number of text fragments in the vocabulary reaches the first number and the highest frequency is the second number.
[0111] If, in step c above, the number of text fragments in the vocabulary reaches the first quantity or the highest frequency reaches the second quantity, then proceed to step d as follows:
[0112] Step d: Obtain the set of text fragments corresponding to the target text fragment based on the vocabulary.
[0113] It should be noted that using the BPE algorithm to segment the target text and obtain the set of text segments corresponding to the target text fragment is only for selecting more reasonable text fragments to interpret the text classification model. However, the embodiments of the present invention are not limited to this. Based on the above embodiments, other word segmentation algorithms can also be used to segment the target text fragment to obtain the set of text segments corresponding to the target text fragment. For example, the Jieba algorithm can be used to segment the target text fragment to obtain the set of text segments corresponding to the target text fragment.
[0114] For example, the target text fragment is: "Analysis of the top 30 cards in Game A: Experts teach you how to match cards". The corresponding text fragment set includes: [Game A], [top],
[30] , [name], [card], [card], [data], [analysis], [:], [expert], [teach you], [match], [card].
[0115] For example, the target text fragment is: "Game B new map and new species are coming soon". The corresponding text fragment set includes: [Game B], [new], [map], [new], [species], [coming soon], [coming soon].
[0116] S12. Generate multiple sampling sets based on the set of text fragments.
[0117] Wherein, any of the sampling sets includes at least one text segment from the set of text segments.
[0118] As an optional implementation of this invention, step S12 (generating multiple sampling sets based on the text fragment set) includes the following steps 1 to 3:
[0119] Step 1: Select text segments from the set of text segments one by one based on the preset selection probability.
[0120] For example, the preset selection probability is 50%. That is, each text fragment in the text fragment set has a 50% probability of being selected as a text fragment in the current sampling set.
[0121] Step 2: Determine whether the set of selected text fragments is an empty set.
[0122] Since each text fragment in the text fragment set has a certain probability of not being selected, there may be a situation where no text fragments are selected, and in this case, the set of selected text fragments is an empty set.
[0123] Step 3: If the set of selected text fragments is not empty, then the set of selected text fragments is determined as a sampling set.
[0124] Multiple sampling sets can be obtained by repeatedly executing steps 1 to 3 above.
[0125] Following the example above, when the set of text fragments includes: [Game B], [New], [Map], [New], [Species], [Coming Soon], [Launch], then multiple sampling sets can include:
[0126] Sampling set 1: {Species, Soon, Online};
[0127] Sample set 2: {new, map, new, coming soon};
[0128] Sampling set 3: {Game B, New, Species, Launch};
[0129] Sample set 4: {Game B, new, new, species, coming soon};
[0130] Sample set 5: {map, species}.
[0131] It should be noted that the number of sample sets is not limited in the embodiments of the present invention. The more sample sets there are, the smaller the interpretation bias of the text classification model will be, but the more data needs to be processed. Conversely, the fewer sample sets there are, the less data needs to be processed, and the greater the interpretation bias of the text classification model will be. Therefore, in actual use, the number of sample sets can be set according to the computing performance of the device and the accuracy requirements of the interpretation of the text classification model.
[0132] S13. Classify the text fragment set and each of the sample sets based on a preset text classification model to obtain the target text category of the text fragment set and the text category of each of the sample sets.
[0133] That is, the set of text fragments corresponding to the target text and each of the sample sets are respectively input into a preset text classification model, and the output of the preset text classification model is respectively obtained as the target text category of the set of text fragments and the text category of each of the sample sets.
[0134] The preset text classification model in this embodiment of the invention is the text classification model to be explained by the text classification model parsing method. This model can be a model trained on any preset machine learning model using sample text data.
[0135] S14. Based on the target text category, the text category of each of the sampling sets, and the text segments contained in each of the sampling sets, obtain the contribution of each text segment in the text segment set.
[0136] The contribution degree is used to characterize the degree of influence of the text fragment on the preset text classification model to classify the set of text fragments into the target text category.
[0137] The text classification model parsing method provided in this invention first performs word segmentation on the target text to obtain a set of text segments, including multiple text fragments, corresponding to the target text. Then, it generates multiple sampling sets based on the set of text fragments. Next, based on a preset text classification model, it classifies the set of text fragments and each sampling set to obtain the target text category of the set of text fragments and the text category of each sampling set. Finally, based on the target text category, the text category of each sampling set, and the text fragments contained in each sampling set, it obtains the contribution degree of each text fragment in the set of text fragments. Since the contribution degree can characterize the degree of influence of the text fragment on the preset text classification model's classification of the set of text fragments into the target text category, the reason why the preset text classification model classifies the set of text fragments into the target text category can be explained by the contribution degree of each text fragment in the set of text fragments. Therefore, this invention can solve the problem of text classification models being difficult to explain.
[0138] As an extension and refinement of the above embodiments, this invention provides another method for parsing a text classification model, referring to... Figure 2 As shown, the parsing method of this text classification model includes the following steps:
[0139] S201. Perform word segmentation on the target text to obtain a set of text fragments corresponding to the target text.
[0140] The text fragment set includes multiple text fragments.
[0141] S202. Generate multiple sampling sets based on the set of text fragments.
[0142] Wherein, any of the sampling sets includes at least one text segment from the set of text segments.
[0143] S203. Classify the text fragment set and each of the sample sets based on a preset text classification model to obtain the target text category of the text fragment set and the text category of each of the sample sets.
[0144] The implementation methods of the above steps S201 to S203 are the same as Figure 1 The implementation of steps S11 to S13 in the illustrated embodiment is similar, and will not be repeated here to avoid repetition.
[0145] S204. Determine whether the text category of each of the sampling sets is the target text category, and obtain the determination result.
[0146] That is, determining whether the text category of each sample set output by the preset classification model is the same as the text category of the text fragment set corresponding to the target text, and combining the determination results of each sample set into the determination result.
[0147] S205. Based on the determined results and the text fragments contained in each of the sampling sets, obtain the contribution of each text fragment in the text fragment set.
[0148] As an optional implementation of this invention, step S205 (obtaining the contribution of each text segment in the text segment set based on the determination result and the text segments contained in each of the sampling sets) includes the following steps I to III:
[0149] Step 1: Based on the text fragments contained in each of the sampling sets, obtain the number of times each text fragment in the text fragment set has been sampled.
[0150] Wherein, the number of samplings is the number of sampling sets including the text fragment.
[0151] For example, referring to Table 3, Table 3 uses the set of text fragments corresponding to the target text as {Game B, New, Map, New, Species, Coming Soon, Launch}, and five collection sets are generated based on the set of text fragments corresponding to the target text, as shown in Table 3 below. The text fragments contained in the collection sets and the text categories of the sampling sets are as follows:
[0152] Table 3
[0153]
[0154] Wherein, “N” indicates that the sample set does not contain the text fragment, “Y” indicates that the sample set contains the text fragment, “No” indicates that the text category of the corresponding sample set is not the target text category, “Yes” indicates that the text category of the corresponding sample set is the target text category, and “New_1” and “New_2” represent text fragments [New] located in different positions.
[0155] As shown in Table 3 above, the number of sample sets containing the text fragment [Game B] is 2, the number of sample sets containing the text fragment [New_1] is 3, the number of sample sets containing the text fragment [Map] is 2, the number of sample sets containing the text fragment [New_2] is 2, the number of sample sets containing the text fragment [Species] is 3, the number of sample sets containing the text fragment [Coming Soon] is 4, and the number of sample sets containing the text fragment [Online] is 3. Therefore, the number of times [Game B], [New_1], [Map], [New_2], [Species], [Coming Soon], and [Online] are sampled are 2, 3, 2, 2, 3, 4, and 3, respectively.
[0156] Step II: Based on whether the text category of each of the sampling sets is the target text category, obtain the hit difference of each text segment in the text segment set.
[0157] Wherein, the hit count difference is the difference between the number of correct hits of the text segment and the number of incorrect hits of the text segment, the number of correct hits is the number of sample sets containing the text segment and whose text category is the target text category, and the number of incorrect hits is the number of sample sets containing the text segment and whose text category is not the target text category.
[0158] Continuing with the examples in Table 3 above, the number of sample sets containing the text fragment [Game B] and whose text category is the target text category is 2, and the number of sample sets containing the text fragment [Game B] and whose text category is not the target text category is 0. Therefore, the hit count difference for the text fragment [Game B] is 2 - 0 = 2; the number of sample sets containing the text fragment [New_1] and whose text category is the target text category is 2, and the number of sample sets containing the text fragment [New_1] and whose text category is not the target text category is 1. Therefore, the hit count difference for the text fragment [New_1] is 2 - 1 = 1; the number of sample sets containing the text fragment [Map] and whose text category is the target text category is 0, and the number of sample sets containing the text fragment [Map] and whose text category is not the target text category is 2. Therefore, the hit count difference for the text fragment [Map] is 0 - 2 = -2; the number of sample sets containing the text fragment [New_2] and whose text category is the target text category is 1, and the number of sample sets containing the text fragment [New_2] and whose text category is not the target text category is 0. The number of sampling sets for the target text category is 2, therefore the hit count difference for the text fragment [New_2] is 1-2=-1; the number of sampling sets containing the text fragment [Species] and whose text category is the target text category is 2, and the number of sampling sets containing the text fragment [Species] and whose text category is not the target text category is 1, therefore the hit count difference for the text fragment [Species] is 2-1=1; the number of sampling sets containing the text fragment [Soon] and whose text category is the target text category is 1, and the number of sampling sets containing the text fragment [Soon] and whose text category is not the target text category is 3, therefore the hit count difference for the text fragment [Soon] is 1-3=-2; the number of sampling sets containing the text fragment [Online] and whose text category is the target text category is 2, and the number of sampling sets containing the text fragment [Online] and whose text category is not the target text category is 1, therefore the hit count difference for the text fragment [Online] is 2-1=1; therefore, the number of samples and the hit count difference for each text fragment can be shown in Table 4 below:
[0159] Table 4
[0160]
[0161] Step III: Based on the difference between the number of times each text segment in the text segment set is sampled and the number of times it is hit, obtain the contribution of each text segment in the text segment set.
[0162] As an optional implementation of this invention, step III (obtaining the contribution of each text segment in the text segment set based on the difference between the number of samples and the number of hits for each text segment in the text segment set) includes:
[0163] The ratio of the difference in the number of hits of each text segment in the text segment set to the number of times each text segment in the text segment set is sampled is determined to obtain the contribution of each text segment in the text segment set.
[0164] Following the example shown in Table 4 above, the ratio of the difference in the number of hits for each text segment in the text segment set to the number of times each text segment in the text segment set is sampled is determined. This ratio is then used as the contribution of each text segment. The contribution of each text segment can be determined as shown in Table 5 below:
[0165] Table 5
[0166]
[0167] Furthermore, setting "N" to 0 and "Y" to 1 in Table 3, and setting the model prediction score to 1 when the text type of the sample set is the target text category, and setting the model prediction score to -1 when the text type of the sample set is not the target text category, we can obtain the following Table 6:
[0168] Table 6
[0169]
[0170] Further, the multiple sampling sets generated based on the text fragment set are denoted as S, and the sampling set is denoted as S. k ,k∈{0,…,m};The sampling result of the j-th text segment in the sampling set k is denoted as The model prediction score of the sample set k is denoted as P. k The contribution of each text fragment in the text fragment set is denoted as a score. j The contribution of each text segment in the set of text segments corresponding to the target text can then be calculated using the following formula:
[0171]
[0172] It should be noted that, for ease of demonstration, only five sampling sets were randomly generated in the above embodiment. Therefore, the contribution of each text segment in the set of text segments corresponding to the target text has a relatively large deviation. This deviation can be reduced by increasing the number of sampling sets. For example, if 5000 sampling sets are randomly generated, the contribution of each text segment is shown in Table 7 below:
[0173] Table 7
[0174] Game B New_1 Map New_2 Species Coming Online Contribution 0.8346 0.1975 0.2632 0.1744 0.1200 0.0468 0.1208
[0175] As an optional implementation of this invention, obtaining the contribution of each text segment in the text segment set based on the target text category and the text categories of each sampling set includes:
[0176] Obtain the contribution of each identical text segment in the set of text segments; calculate the average contribution of each identical text segment to obtain the average contribution; and determine the average contribution as the contribution of each identical text segment.
[0177] For example, if the contribution scores of “New_1” and “New_2” in the example in Table 7 above are 0.1975 and 0.1744 respectively, then the average contribution score of “New_1” and “New_2”, 0.1859, can be obtained as the contribution score of the text fragment “New”.
[0178] As an optional implementation of the present invention, based on any of the above embodiments, referring to... Figure 3 As shown in the embodiment of the present invention, obtaining the contribution of each text segment in the text segment set corresponding to each text in the target text set further includes:
[0179] S31. Obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set.
[0180] The target text set includes at least one of the target texts.
[0181] Optionally, each text in the target text set can be used as a separate text. Figure 1 Target text execution in the illustrated embodiment Figure 1 The text classification model shown uses a parsing method to obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set.
[0182] S32. Based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the text category of the text segment set corresponding to each text in the target text set determined by the preset text classification model, obtain the contribution of each text segment corresponding to each text category in the preset text classification model.
[0183] Specifically, in this embodiment of the invention, the contribution of a text fragment corresponding to a text category refers to the contribution of that text fragment to classifying a set of text fragments into that text category. For example, the text fragment "[Instruction Manual]" has a contribution of 0.2345 to classifying a set of text fragments into the text category "Technology," while the text fragment "[Instruction Manual]" has a contribution of -0.6581 to classifying a set of text fragments into the text category "Entertainment." Therefore, the contribution of the text fragment "[Instruction Manual]" corresponding to the text category "Technology" is 0.2345, and the contribution of the text fragment "[Instruction Manual]" corresponding to the text category "Entertainment" is -0.6581.
[0184] Since the above embodiments can obtain the contribution of each text segment corresponding to each text category of the preset text classification model, the above embodiments can explain which text segments each text category of the preset text classification model has learned, thereby achieving a further explanation of the preset text classification model.
[0185] As an optional implementation of the present invention, step S32 (obtaining the contribution of each text segment corresponding to each text segment set in the target text set based on the contribution of each text segment set in the target text set and the text category of the text segment set in the target text set determined by the preset text classification model) includes: obtaining the positive text segments of each text category.
[0186] Among them, the positive text fragments of any text category are the text fragments whose contribution to the corresponding text category is greater than the first threshold contribution, or the positive text fragments of any text category are the first preset number of text fragments with the largest contribution to the corresponding text category.
[0187] For example, if the first threshold contribution is 0.5, then text segments with a contribution greater than 0.5 for each text category can be identified as positive text segments for each text category.
[0188] For example, if the first preset number is 20, then the 20 text fragments with the largest contribution to each text category can be identified as the positive text fragments for each text category.
[0189] For example, positive text fragments for the text category "games" may include: [review], [explanation], [version], [player], [game], etc.; positive text fragments for the text category "cars" may include: [tire], [car], [electric car], [fuel], [steering], [review], [electric], [new model], [configuration], [mileage], [driver], etc.; positive text fragments for the text category "technology" may include: [telecommunications], [smart], [big data], [internet], [cutting-edge technology], [machine], [technology], [algorithm], [chip], [breakthrough], [development], etc.
[0190] As an optional implementation of the present invention, step S32 (obtaining the contribution of each text segment corresponding to each text segment set in the target text set based on the contribution of each text segment set in the target text set and the text category of the text segment set in the target text set determined by the preset text classification model) includes: obtaining the negative text segments of each text category.
[0191] Among them, the negative text fragments of any text category are the text fragments whose contribution to the corresponding text category is less than the second threshold contribution, or the negative text fragments of any text category are the second preset number of text fragments whose contribution to the corresponding text category is the smallest.
[0192] For example, if the first threshold contribution is -0.8, then text segments whose contribution to each text category is less than -0.8 can be identified as negative text segments for each text category.
[0193] For example, if the first preset number is 100, then the 100 text fragments with the smallest contribution to each text category can be identified as the negative text fragments for each text category.
[0194] For example, negative text snippets for the text category "Entertainment" may include: [cars], [tires], [achievements], [chips], [electric], [reviews], [players], [games], etc.; negative text snippets for the text category "Finance" may include: [legends], [smart], [telecommunications], [fuel], [games], [travel], [fire alarm], [new models], [drivers], etc.
[0195] Using the same inventive concept, as an implementation of the above method, this embodiment of the invention also provides a text classification model parsing device. This device embodiment corresponds to the aforementioned method embodiment. For ease of reading, this device embodiment will not repeat the details of the aforementioned method embodiment one by one, but it should be clear that the text classification model parsing device in this embodiment can correspondingly implement all the contents of the aforementioned method embodiment.
[0196] Figure 4 This is a schematic diagram of the structure of the parsing device for the text classification model provided in an embodiment of the present invention, as shown below. Figure 4 As shown, the text classification model parsing device 400 provided in this embodiment includes:
[0197] Word segmentation unit 41 is used to perform word segmentation on the target text and obtain a set of text fragments corresponding to the target text, wherein the set of text fragments includes multiple text fragments.
[0198] Sampling unit 42 is configured to generate multiple sampling sets based on the text fragment set, wherein any sampling set includes at least one text fragment from the text fragment set;
[0199] Classification unit 43 is used to classify the text fragment set and each of the sample sets based on a preset text classification model, and obtain the target text category of the text fragment set and the text category of each of the sample sets;
[0200] Processing unit 44 is configured to obtain the contribution of each text segment in the text segment set based on the target text category, the text category of each of the sampling sets, and the text segments contained in each of the sampling sets. The contribution is used to characterize the degree of influence of the text segment on the preset text classification model to classify the text segment set into the target text category.
[0201] As an optional implementation of the present invention, the sampling unit 42 is specifically used to select text segments in the text segment set one by one based on a preset selection probability; determine whether the set of selected text segments is an empty set; if the set of selected text segments is not an empty set, then determine the set of selected text segments as a sampling set.
[0202] As an optional implementation of this invention, the processing unit 44 is specifically used to determine whether the text category of each of the sampling sets is the target text category, and obtain a determination result; based on the determination result and the text segments contained in each of the sampling sets, to obtain the contribution of each text segment in the text segment set.
[0203] As an optional implementation of this invention, the processing unit 44 is specifically configured to: obtain the number of times each text segment in the text segment set is sampled, based on the text segments contained in each of the sampling sets, wherein the number of times a text segment is sampled is the number of sampling sets including the text segment; obtain the hit count difference of each text segment in the text segment set based on whether the text category of each sampling set is the target text category; wherein the hit count difference is the difference between the number of correct hits of the text segment and the number of incorrect hits of the text segment, wherein the number of correct hits is the number of sampling sets including the text segment and whose text category is the target text category, and the number of incorrect hits is the number of sampling sets including the text segment and whose text category is not the target text category; and obtain the contribution of each text segment in the text segment set based on the difference between the number of times each text segment is sampled and the hit count.
[0204] As an optional implementation of this invention, the processing unit 44 is specifically used to determine the ratio of the difference in the number of hits of each text segment in the text segment set to the number of times each text segment in the text segment set is sampled, so as to obtain the contribution of each text segment in the text segment set.
[0205] As an optional implementation of the present invention, the processing unit 44 is specifically used to obtain the contribution of each identical text segment in the text segment set; determine the average contribution of each identical text segment, obtain the average contribution; and determine the average contribution as the contribution of each identical text segment.
[0206] As an optional implementation of the present invention, the word segmentation unit 41 is specifically used to perform word segmentation processing on the target text based on the byte pair encoding (BPE) algorithm to obtain a set of text fragments corresponding to the target text.
[0207] As an optional implementation of this invention, the processing unit 44 is further configured to obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set, wherein the target text set includes at least one target text, and to obtain the contribution of each text segment corresponding to each text category of the preset text classification model based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the text category of the text segment set corresponding to each text in the target text set determined by the preset text classification model.
[0208] As an optional implementation of the present invention, the processing unit 44 is further configured to obtain the positive text fragments of each text category after determining the text category of the text fragment set corresponding to each text in the target text set based on the contribution of each text fragment in the text fragment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text fragment corresponding to each text category of the preset text classification model.
[0209] Among them, the positive text fragments of any text category are the text fragments whose contribution to the corresponding text category is greater than the first threshold contribution, or the positive text fragments of any text category are the first preset number of text fragments with the largest contribution to the corresponding text category.
[0210] As an optional implementation of the present invention, the processing unit 44 is further configured to, after determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, obtain the contribution of each text segment corresponding to each text category in the preset text classification model, and then obtain the negative text segment of each text category.
[0211] Among them, the negative text fragments of any text category are the text fragments whose contribution to the corresponding text category is less than the second threshold contribution, or the negative text fragments of any text category are the second preset number of text fragments whose contribution to the corresponding text category is the smallest.
[0212] The text classification model parsing device provided in this embodiment can execute the text classification model parsing method provided in the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.
[0213] Using the same inventive concept, embodiments of the present invention also provide an electronic device. Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention, such as... Figure 5 As shown, the electronic device provided in this embodiment includes: a memory 501 and a processor 502. The memory 501 is used to store computer programs; the processor 502 is used to execute the parsing method of the text classification model provided in the above embodiment when the computer program is invoked.
[0214] This invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the text classification model parsing method provided in the above embodiments.
[0215] This invention provides a computer program product that, when run on a computer, enables the computer to implement the text classification model parsing method provided in the above embodiments.
[0216] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media containing computer-usable program code.
[0217] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0218] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, like read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0219] Computer-readable media include both permanent and non-permanent, removable and non-removable storage media. Storage media can store information using any method or technology; the information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media do not include transient computer-readable media, such as modulated data signals and carrier waves.
[0220] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for parsing a text classification model, characterized in that, include: The target text is segmented into words to obtain a set of text fragments corresponding to the target text, and the set of text fragments includes multiple text fragments; Multiple sampling sets are generated based on the set of text fragments, and each of the sampling sets includes at least one text fragment from the set of text fragments. The text fragment set and each of the sample sets are classified based on a preset text classification model to obtain the target text category of the text fragment set and the text category of each of the sample sets; Based on the target text category, the text categories of each of the sampling sets, and the text segments contained in each of the sampling sets, the contribution of each text segment in the text segment set is obtained. The contribution is used to characterize the degree of influence of the text segment on the preset text classification model to classify the text segment set into the target text category. The step of obtaining the contribution of each text segment in the text segment set based on the target text category, the text categories of each of the sampling sets, and the text segments contained in each of the sampling sets includes: Determine whether the text category of each of the sampling sets is the target text category, and obtain the determination result; Based on the determined results and the text fragments contained in each of the sampling sets, the contribution of each text fragment in the text fragment set is obtained.
2. The method according to claim 1, characterized in that, The step of generating multiple sampling sets based on the set of text fragments includes: Text segments in the set of text segments are selected one by one based on a preset selection probability; Determine whether the set of selected text fragments is an empty set; If the set of selected text fragments is not empty, then the set of selected text fragments is determined as a sampling set.
3. The method according to claim 1, characterized in that, The step of obtaining the contribution of each text segment in the text segment set based on the determination result and the text segments contained in each of the sampling sets includes: Based on the text fragments contained in each of the sampling sets, the number of times each text fragment in the text fragment set is sampled is obtained, where the number of times is sampled is the number of sampling sets including the text fragments; Based on whether the text category of each of the sampling sets is the target text category, the hit count difference of each text segment in the text segment set is obtained; the hit count difference is the difference between the correct count of the text segment and the incorrect count of the text segment, the correct count is the number of sampling sets that contain the text segment and whose text category is the target text category, and the incorrect count is the number of sampling sets that contain the text segment and whose text category is not the target text category; The contribution of each text segment in the text segment set is obtained based on the difference between the number of times each text segment is sampled and the number of times it is hit.
4. The method according to claim 3, characterized in that, The step of obtaining the contribution of each text segment in the text segment set based on the difference between the number of samples and the number of hits for each text segment in the text segment set includes: The ratio of the difference in the number of hits of each text segment in the text segment set to the number of times each text segment in the text segment set is sampled is determined to obtain the contribution of each text segment in the text segment set.
5. The method according to claim 1, characterized in that, The step of obtaining the contribution of each text segment in the text segment set based on the target text category and the text categories of each sampling set includes: Obtain the contribution of each identical text fragment in the set of text fragments; Determine the average contribution of each identical text segment and obtain the average contribution. The average contribution is determined as the contribution of each identical text segment.
6. The method according to claim 1, characterized in that, The step of segmenting the target text to obtain a set of text fragments corresponding to the target text includes: The target text is segmented using the Byte-Pair Encoding (BPE) algorithm to obtain a set of text fragments corresponding to the target text.
7. The method according to any one of claims 1-6, characterized in that, The method further includes: Obtain the contribution of each text segment in the text segment set corresponding to each text in the target text set, wherein the target text set includes at least one target text; Based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the text category of the text segment set corresponding to each text in the target text set determined by the preset text classification model, the contribution of each text segment corresponding to each text category in the preset text classification model is obtained.
8. The method according to claim 7, characterized in that, After determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category of the preset text classification model, the method further includes: Retrieve positive text fragments for each text category; Among them, the positive text fragments of any text category are the text fragments whose contribution to the corresponding text category is greater than the first threshold contribution, or the positive text fragments of any text category are the first preset number of text fragments with the largest contribution to the corresponding text category.
9. The method according to claim 2, characterized in that, After determining the text category of the text segment set corresponding to each text in the target text set based on the contribution of each text segment in the text segment set corresponding to each text in the target text set and the preset text classification model, and obtaining the contribution of each text segment corresponding to each text category of the preset text classification model, the method further includes: Retrieve negative text fragments for each text category; Among them, the negative text fragments of any text category are the text fragments whose contribution to the corresponding text category is less than the second threshold contribution, or the negative text fragments of any text category are the second preset number of text fragments whose contribution to the corresponding text category is the smallest.
10. A parsing device for a text classification model, characterized in that, include: The word segmentation unit is used to segment the target text into words and obtain a set of text fragments corresponding to the target text, wherein the set of text fragments includes multiple text fragments; A sampling unit is configured to generate multiple sampling sets based on the set of text fragments, wherein any one of the sampling sets includes at least one text fragment from the set of text fragments. A classification unit is used to classify the text fragment set and each of the sample sets based on a preset text classification model, and to obtain the target text category of the text fragment set and the text category of each of the sample sets; The processing unit is configured to obtain the contribution of each text segment in the text segment set based on the target text category, the text category of each of the sampling sets, and the text segments contained in each of the sampling sets. The contribution is used to characterize the degree of influence of the text segment on the preset text classification model to classify the text segment set into the target text category. The processing unit is specifically used to determine whether the text category of each of the sampling sets is the target text category, obtain the determination result, and obtain the contribution of each text segment in the text segment set based on the determination result and the text segments contained in each of the sampling sets.
11. An electronic device, characterized in that, include: A memory and a processor, wherein the memory is used to store a computer program; and the processor is used to execute the parsing method of the text classification model according to any one of claims 1-9 when the computer program is invoked.
12. A computer-readable storage medium, characterized in that, It stores a computer program, which, when executed by a processor, implements the parsing method of the text classification model according to any one of claims 1-9.
13. A computer program product, characterized in that, When the computer program product is run on a computer, the computer implements the parsing method of the text classification model as described in any one of claims 1-9.