Text selection method and apparatus
The masked language model, which utilizes a sparse attention mechanism, overcomes the limitations of the BERT model on sequence length, supports ultra-long text processing, achieves accurate prediction of candidate word fill-in-the-blank answers, and improves the accuracy of English cloze tests.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING YUANLI WEILAI SCI & TECH CO LTD
- Filing Date
- 2021-11-19
- Publication Date
- 2026-06-26
AI Technical Summary
The existing BERT model, due to its full attention mechanism, has a squared dependency on the sequence length, making it unable to handle extremely long texts longer than 512 characters, thus failing to effectively handle candidate word filling scenarios.
Masked language models employing sparse attention mechanisms (such as the Big Bird model) reduce the quadratic dependency of sequence length to linear, support input sequences greater than 512, and combine them with a vocabulary to predict answers for candidate word filling.
It effectively processes ultra-long texts with a length greater than 512, improves the accuracy of predicting candidate word fill-in-the-blank answers, and enhances the accuracy of English cloze test questions.
Smart Images

Figure CN116151221B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of natural language processing technology, and in particular to a text word selection method. This specification also relates to a text word selection device, a computing device, and a computer-readable storage medium. Background Technology
[0002] In recent years, artificial intelligence technology has developed rapidly, and natural language processing, which uses deep learning to process some dialogue understanding in our lives, has become a popular technology.
[0003] Natural Language Processing (NLP) is an important field in computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language.
[0004] Transformer-based models have become one of the most successful deep learning models in NLP. For natural language processing tasks, the Bidirectional Encoder Representation from Transformers (BERT) neural network model is typically used. Currently, the task of selecting candidate words to fill in blanks in sentences is usually handled by the BERT model. However, due to the full attention mechanism used by the BERT model, one of its core limitations is its squared dependence on the sequence length, which means it can only support input sequences with a maximum length of 512. This makes it impossible to process extremely long texts with a length greater than 512 in the candidate word filling scenario, thus failing to predict the answer for candidate word filling.
[0005] To address the aforementioned issues, how to process extremely long texts (length greater than 512) in candidate word fill-in-the-blank scenarios and predict the answers to these texts has become a pressing problem for technical personnel. Summary of the Invention
[0006] In view of this, embodiments of this specification provide a text word selection method. This specification also relates to a text word selection device, a computing device, and a computer-readable storage medium to address the technical deficiencies existing in the prior art.
[0007] According to a first aspect of the embodiments of this specification, a text word selection method is provided, including:
[0008] Determine the initial text containing the candidate word region, and the candidate word set corresponding to the candidate word region;
[0009] Based on the candidate word region and the candidate word set, the initial text is processed to obtain the target text;
[0010] After segmenting the target text into words, a masked language model is used to obtain the target vector of the target words in the segmented target text. The masked language model employs a sparse attention mechanism.
[0011] The target probability value of each candidate word in the candidate word set relative to the target vector is determined based on the vocabulary, and the target candidate word is determined based on the target probability value.
[0012] According to a second aspect of the embodiments of this specification, a text word selection device is provided, comprising:
[0013] The text determination module is configured to determine the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region;
[0014] The text processing module is configured to perform text processing on the initial text based on the candidate word region and the candidate word set to obtain the target text;
[0015] The model processing module is configured to segment the target text into words and then use a masked language model to obtain the target vector of the target words in the segmented target text, wherein the masked language model adopts a sparse attention mechanism;
[0016] The target candidate word determination module is configured to determine the target probability value of each candidate word in the candidate word set relative to the target vector based on the vocabulary, and to determine the target candidate word based on the target probability value.
[0017] According to a third aspect of the embodiments of this specification, a computing device is provided, comprising:
[0018] Memory and processor;
[0019] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the steps of the above-described text word selection method.
[0020] According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores computer-executable instructions that, when executed by a processor, implement the steps of the text word selection method described above.
[0021] This specification provides a text word selection method and apparatus. The text word selection method includes determining an initial text containing a region of words to be selected and a set of candidate words corresponding to the region of words to be selected; performing text processing on the initial text based on the region of words to be selected and the set of candidate words to obtain target text; segmenting the target text and obtaining target vectors of the target words in the segmented target text using a masked language model, wherein the masked language model employs a sparse attention mechanism; determining the target probability value of each candidate word in the set of candidate words relative to the target vector based on a vocabulary, and determining the target candidate word based on the target probability value.
[0022] Specifically, the text word selection method reduces the quadratic dependence on sequence length to linear by using a masked language model with a sparse attention mechanism, enabling the masked language model to support input sequences greater than 512. This solves the problem of processing ultra-long texts with a length greater than 512 in the candidate word filling scenario, and combines a vocabulary to achieve accurate prediction of the answer to the candidate word filling. Attached Figure Description
[0023] Figure 1 This is a flowchart of a text word selection method provided in one embodiment of this specification;
[0024] Figure 2 This is a flowchart illustrating a text word selection method for English cloze tests, provided in one embodiment of this specification.
[0025] Figure 3 This is a schematic diagram of the structure of a text word selection device provided in one embodiment of this specification;
[0026] Figure 4 This is a structural block diagram of a computing device provided in one embodiment of this specification. Detailed Implementation
[0027] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.
[0028] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.
[0029] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0030] First, the terms and concepts used in one or more embodiments of this specification will be explained.
[0031] BigBird: Transformer-based models (such as BERT) have become some of the most successful deep learning models in NLP. Unfortunately, due to their full attention mechanism, one of their core limitations is a squared dependency on sequence length, limiting them to input sequences with a maximum length of 512. To address this issue, the BigBird model employs a sparse attention mechanism that reduces this squared dependency to linear, supporting input sequences with a maximum length of 4096, while also showing significant performance improvements on mask language modeling tasks.
[0032] Mask language model: This refers to randomly masking (replacing words with "[MASK]") some words in the input sequence (usually sentences as the basic unit), and then having the modified sequence predict which words were masked, such as the Big Bird model.
[0033] Mask Language Head Classifier: The vocabulary classifier in a masked language model, such as the output layer of the Big Bird model, is a basic function of the masked language model, used to determine the mapping from a vector dimension to a vocabulary dimension; in the masked language model, the Mask Language Head classifier predicts which word originally occupied the position of the replaced [MASK].
[0034] English cloze tests involve intentionally removing words from a short English passage or dialogue to create blanks. The task is to select the correct or best answer from the given options to restore the passage to its original form. This tests both the ability to apply basic knowledge of grammar, vocabulary, idioms, sentence structure, and collocations, as well as reading comprehension skills.
[0035] This specification provides a text word selection method, and also relates to a text word selection device, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.
[0036] See Figure 1 , Figure 1 A flowchart of a text word selection method according to an embodiment of this specification is shown, which specifically includes the following steps:
[0037] Step 102: Determine the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region.
[0038] Specifically, the text word selection method provided in the embodiments of this specification can be applied to English cloze tests, Chinese cloze tests, or cloze tests in other languages. The specific application scenario varies, and different adaptations can be made to the pre-trained masked language model and the accompanying vocabulary. This specification does not impose any limitations on this. For ease of understanding, the embodiments in this specification will focus on the application of the text word selection method to English cloze tests.
[0039] In the context of applying the text-based word selection method to English cloze tests, the initial text can be understood as an English passage or dialogue; the candidate word region can be understood as the blank area formed by removing a certain word or phrase from the initial text, which will be filled with words later; the candidate word set can be understood as the options corresponding to each candidate word region, which includes two or more candidate words. The correct or best candidate word will be selected from the candidate word set and filled into the candidate word region to restore the initial text to its completeness.
[0040] Taking "_you help me" and "A.can B.will C.must D.could" as an example, "_you help me" can be understood as the initial text, "_" can be understood as the candidate word region contained in the initial text, and "A.can B.will C.must D.could" can be understood as the set of candidate words corresponding to the candidate word region.
[0041] In practice, since it is necessary to select the correct or best target candidate word from the candidate word set corresponding to each candidate word region for word filling, if the number of candidate word regions is different from the number of candidate word sets, the subsequent selection of target candidate words cannot be performed. Therefore, before selecting target candidate words, it is necessary to ensure that the number of candidate word regions in the initial text is consistent with the number of candidate word sets. The specific implementation method is as follows:
[0042] The process of determining the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region includes:
[0043] Obtain the initial text containing the region of candidate words, and the set of candidate words corresponding to the initial text;
[0044] Determine the number of regions in the candidate word region and the number of sets in the candidate word set;
[0045] When the number of regions and the number of sets are the same, the candidate word set corresponding to the candidate word region is determined.
[0046] In practical applications, the initial text containing one, two, or more candidate word regions and the candidate word set corresponding to the initial text are first obtained; then the number of candidate word regions and the number of candidate word sets are determined; when the number of regions and the number of sets are the same, the candidate word set corresponding to each candidate word region is determined.
[0047] When the number of regions differs from the number of sets, there may be instances of redundant or missing candidate word regions, or redundant or missing candidate word sets. Regardless of the cause, this discrepancy will result in a mismatch between the candidate word regions and the candidate word sets in the initial text, preventing subsequent selection of target candidate words. Therefore, to avoid this problem during target candidate word selection and prevent interruptions that negatively impact the user experience, after obtaining the initial text containing the candidate word regions and the corresponding candidate word sets, the correspondence between the candidate word regions and candidate word sets can be pre-determined based on the number of regions and sets, thus improving the user experience when using target candidate words.
[0048] For example, if the initial text is: _you help me; and the candidate word set is: A.can B.will C.must D.could, then from the initial text and the candidate word set, we know that there is one region for each candidate word region and one set for each candidate word region. This means that the number of regions and the number of sets are the same. Therefore, we can select the corresponding candidate word set for each candidate word region.
[0049] If the initial text is: _you help me; and the candidate word set is: A.can B.will C.must D.could; A.can B.will C.may D.could; then from the initial text and the candidate word set, we know that the number of regions in the candidate word set is 1, and the number of combinations in the candidate word set is 2. This indicates that the number of regions and the number of sets are not the same, meaning that subsequent selection of target candidate words is not possible.
[0050] Furthermore, given the same number of regions and sets, it is also necessary to match a corresponding set of candidate words for each region to ensure the accuracy of subsequent target candidate word selection. The specific implementation method is as follows:
[0051] Determining the candidate word set corresponding to the candidate word region includes:
[0052] Determine the text order of the candidate word regions in the initial text, and the arrangement order of the candidate word set;
[0053] The text order is matched with the arrangement order to determine the set of candidate words corresponding to the candidate word region.
[0054] In practical applications, in English cloze tests, there is a one-to-one correspondence between the candidate word regions in the initial text and the candidate word set, and the candidate word set is arranged according to the order of the candidate word regions in the initial text.
[0055] Therefore, after determining the text order of each candidate word region in the initial text and the arrangement order of the candidate word set, the text order and the arrangement order are matched to determine the corresponding candidate word set for each candidate word region.
[0056] For example, if the initial text is: _you help me, I_; and the candidate word set is: A.can B.will C.must D.could; A.can B.will C.may D.could; then the text order of the first candidate word region in the initial text is 1, and the text order of the second candidate word region in the initial text is 2; the set order of the candidate word set "A.can B.will C.must D.could" is 1, and the set order of the candidate word set "A.can B.will C.may D.could" is 2; then, after matching the text order with the set order, the corresponding candidate word set determined for the first candidate word region is "A.can B.will C.must D.could"; and the corresponding candidate word set determined for the second candidate word region is "A.can B.will C.may D.could".
[0057] Step 104: Perform text processing on the initial text based on the candidate word region and the candidate word set to obtain the target text.
[0058] Specifically, the step of processing the initial text based on the candidate word region and the candidate word set to obtain the target text includes:
[0059] Replace the candidate word region with a preset mask of the masked language model;
[0060] Each candidate word in the candidate word set is concatenated with the preset mask to obtain the concatenated target text.
[0061] The masked language model is a model that employs a sparse attention mechanism, such as the Bigbird model. However, it is not limited to the Bigbird model; it can also be any other model that employs a sparse attention mechanism and can perform natural language processing on very long texts (texts greater than 512 characters). For ease of understanding, this specification uses the Bigbird model as an example in all embodiments.
[0062] In practice, after determining the initial text containing the candidate word regions and the candidate word set corresponding to each candidate word region, each candidate word region in the initial text is replaced with a preset mask of the mask language model, for example, the preset mask is [MASK]. Then, each candidate word in the candidate word set is concatenated with its corresponding preset mask to obtain the concatenated target text.
[0063] Using the previous example, if the initial text is: _you help me; the candidate word set is: A.can B.will C.must D.could; and the default mask of the masked language model is [MASK].
[0064] The initial text after replacing the candidate word region with the preset mask of the mask language model is: [MASK]you help me; then, each candidate word in the candidate word set is concatenated with its corresponding preset mask to obtain the concatenated target text: [SEP]can[SEP]will[SEP]must[SEP]could[MASK]you help me.
[0065] In practical applications, each candidate word in the candidate word set can be concatenated before or after the preset mask, depending on the specific application. This manual does not impose any restrictions on this.
[0066] In this embodiment of the specification, the answers to the English cloze test (i.e., each candidate word in the candidate word set) are concatenated into the text. Subsequently, the probability value of each answer in the corresponding blank can be determined in the text, and the target answer can be determined based on the probability value, thereby improving the accuracy of English cloze test questions. This avoids the situation where the target answer and the blank are matched incorrectly when the answers and blanks are processed separately.
[0067] Step 106: After segmenting the target text into words, obtain the target vector of the target words in the segmented target text through a masked language model, wherein the masked language model adopts a sparse attention mechanism.
[0068] For a detailed introduction to the mask language model, please refer to the above embodiments and terminology explanations, which will not be repeated here.
[0069] Specifically, the step of segmenting the target text and then using a masked language model to obtain the target vector of the target word in the segmented target text includes:
[0070] The target text is segmented based on a vocabulary list to obtain the text segmentation of the target text;
[0071] The text segmentation is input into a masked language model to obtain the segmentation vectors of the text segmentation;
[0072] The mask vector of the preset mask is determined from the word segmentation vector, and the mask vector of the preset mask is used as the target vector of the target word in the target text after word segmentation.
[0073] The vocabulary can be understood as the English vocabulary of the masked language model, through which any English word, phrase, or word can be combined. In practical applications, segmenting the target text based on the vocabulary to obtain the text segmentation of the target text can be understood as inputting the target text into the masked language model, and the masked language model segmenting the target text according to its vocabulary to obtain the text segmentation of the target text. Furthermore, the target segmentation described below is determined from this text segmentation and is a subset of the text segmentation.
[0074] In practice, the target text is segmented based on a vocabulary list to obtain multiple segmented text segments. Each segmented text segment is then input into a pre-trained masked language model (such as the Big Bird model) to obtain a segmentation vector for each segmented text segment. From these multiple segmentation vectors, the mask vectors of all preset masks (i.e., the vectors of [MASK]) are selected. This preset mask is used as the target segment, and its mask vector is used as the target vector for the target segment in the segmented target text.
[0075] In the embodiments of this specification, the Big Bird model can be used to obtain the word segmentation vector of each word in the target text after word segmentation. A mask vector of [MASK] is selected from the word segmentation vector to determine the probability value of each word in the vocabulary relative to it based on the mask vector of [MASK], so as to ensure the accurate determination of subsequent target candidate words.
[0076] Step 108: Determine the target probability value of each candidate word in the candidate word set relative to the target vector based on the vocabulary, and determine the target candidate word based on the target probability value.
[0077] For a detailed explanation of the vocabulary, please refer to the above embodiments, which will not be repeated here.
[0078] Specifically, determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the vocabulary, and determining the target candidate word based on the target probability value, includes:
[0079] The target vector is passed through the word classifier of the masked language model to calculate the initial probability value of each word in the word list relative to the target vector.
[0080] Based on the initial probability values, determine the target probability value of each candidate word in the candidate word set relative to the target vector;
[0081] Based on the target probability value, the target candidate word corresponding to the candidate word region is determined from the candidate word set.
[0082] In practice, each target vector is passed through the word classifier of the masked language model to calculate the initial probability value of each word in the word list relative to the target vector; that is, the initial probability value of each word in the word list placed in the target vector. Then, based on the initial probability value of each word in the word list, the target probability value of each candidate word in the candidate word set relative to the target vector is determined. Finally, based on the target probability value, the target candidate word corresponding to each candidate word region is determined from the candidate word set.
[0083] In the embodiments of this specification, by combining the target vector output by the Big Bird model with the word classifier of the mask language model, the initial probability value of each word in the word list relative to the target vector can be accurately obtained. Furthermore, the Big Bird model can significantly improve the performance on the Mask Language Model task.
[0084] In practical applications, before determining the target probability value of each candidate word relative to the target vector based on the initial probability value, it is necessary to segment each candidate word in the candidate word set to obtain the position representation of each segmented candidate word. This position representation is then used to accurately locate the position of each candidate word in the vocabulary. Finally, based on this position and the initial probability value of the words in the vocabulary, the target probability value of each candidate word is determined. The specific implementation method is as follows:
[0085] Before determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value, the method further includes:
[0086] Based on the vocabulary, each candidate word in the candidate word set is segmented to obtain the segmented candidate words corresponding to each candidate word; and
[0087] The position of the segmentation candidate word corresponding to each candidate word in the word list is determined.
[0088] Specifically, firstly, the candidate words in each candidate word set are segmented based on the word list to obtain the segmented candidate words corresponding to each candidate word. Then, based on the segmented candidate words and the word list, the position representation of each segmented candidate word in the word list is determined; that is, the id of each segmented candidate word in the word list.
[0089] Continuing with the previous example, if the candidate word set is: A.can B.will C.must D.could, the segmented candidate words are: [[can],[will],[must],[could]]. That is, can in the candidate word set corresponds to the segmented candidate word [can]; will in the candidate word set corresponds to the segmented candidate word [will]; must in the candidate word set corresponds to the segmented candidate word [must]; and could in the candidate word set corresponds to the segmented candidate word [could].
[0090] Furthermore, based on the segmentation candidate word [can] and the vocabulary, the position ID of the segmentation candidate word [can] in the vocabulary can be determined: the segmentation candidate word [can] is the first one on the far left of the first row.
[0091] In practice, after determining the position representation of the segmented candidate word corresponding to each candidate word in each candidate set within the vocabulary, the initial probability value of the segmented candidate word relative to the target vector can be determined based on its position representation in the vocabulary. Subsequently, based on the initial probability value of the segmented candidate word corresponding to each candidate word, the target probability value of each candidate word relative to the target vector can be obtained quickly and accurately. The specific implementation method is as follows:
[0092] Determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value includes:
[0093] Based on the position representation of the segmentation candidate word in the vocabulary, the initial probability value of the segmentation candidate word relative to the target vector is determined from the vocabulary;
[0094] Based on the initial probability values of the segmentation candidate words relative to the target vector, the target probability value of each candidate word in the candidate set relative to the target vector is determined.
[0095] In practical applications, since the initial probability value of each word in the vocabulary relative to the target vector has been pre-calculated, the segmentation candidate word and its pre-calculated initial probability value relative to the target vector can be found from the vocabulary based on the position of each segmentation candidate word in the vocabulary. Since there is a correspondence between each candidate word and its corresponding segmentation candidate word, after determining the initial probability value of each segmentation candidate word relative to the target vector, the target probability value of its corresponding candidate word relative to the target vector can be calculated based on the initial probability value.
[0096] Using the previous example, if [can] is the first one on the far left of the first row of the vocabulary, the Mask LanguageHead classifier has already pre-calculated the initial probability value of [can] relative to the target vector to be 90%.
[0097] The candidate word [can] corresponds to the segmentation candidate word [can], and the position of [can] in the vocabulary is represented as the first leftmost word in the first row. Based on this position representation, we can find the initial probability value of [can] in the vocabulary relative to the target vector: 90%. Therefore, the target probability value of the candidate word [can] relative to the target vector is 90%.
[0098] In practical applications, in one scenario, the candidate words in the candidate word set are relatively short, such as a single word. After segmenting the candidate words based on the vocabulary, the candidate word and its corresponding segmented candidate word may be the same. In this case, the initial probability value of the segmented candidate word relative to the target vector can be directly used as the target initial probability value of the corresponding candidate word relative to the target vector. The specific implementation method is as follows:
[0099] The step of determining the target probability value of each candidate word in the candidate set relative to the target vector based on the initial probability values of the segmented candidate words relative to the target vector includes:
[0100] If each candidate word is determined to be the same as the segmentation candidate word corresponding to each candidate word, the initial probability value of the segmentation candidate word relative to the target vector is determined as the target probability value of each candidate word in the candidate set relative to the target vector.
[0101] Specifically, the implementation method of the target probability value of each candidate word in the candidate set relative to the target vector can be found in the above embodiment, and will not be repeated here.
[0102] However, although most candidate words in the candidate word set are single words, when segmenting based on the word list, the candidate word and its corresponding segmentation candidate word may be the same, but there are also cases where the candidate word is long; for example, the candidate word is a phrase.
[0103] In this scenario, after segmenting the candidate word based on the vocabulary, one candidate word may correspond to multiple segmentation candidate words. When obtaining the target probability value of a candidate word relative to the target vector based on its initial probability value relative to the target vector, it is necessary to consider taking the average of the initial probability values of multiple segmentation candidate words relative to the target vector as the target probability value for that candidate word relative to the target vector, to ensure the accuracy of the target probability value. The specific implementation method is as follows:
[0104] The step of determining the target probability value of each candidate word in the candidate set relative to the target vector based on the initial probability values of the segmented candidate words relative to the target vector includes:
[0105] If any of the candidate words is determined to be different from the corresponding segmentation candidate words, then each segmentation candidate word corresponding to the candidate word is determined.
[0106] Obtain the initial probability value of each segmentation candidate word relative to the target vector corresponding to the candidate word;
[0107] The average probability value of each segmentation candidate word corresponding to the candidate word relative to the target vector is determined as the target probability value of the candidate word relative to the target vector.
[0108] In cases where the candidate word and the corresponding segmentation candidate word are different, it can be understood that after segmenting the candidate word based on the word list, the candidate word corresponds to two or more segmentation candidate words.
[0109] In specific implementation, if after segmentation of any candidate word in the candidate word set, more than one segmentation candidate word is obtained, all segmentation candidate words corresponding to that candidate word are determined; then, based on the method of the above embodiment, the initial probability value of each segmentation candidate word relative to the target vector is obtained; finally, the average probability value of the initial probability values of all segmentation candidate words relative to the target vector is taken, and this average probability value is used as the initial probability value of the candidate word relative to the target vector.
[0110] Continuing with the previous example, if the candidate word [could] is segmented into two candidate words, [cou] and [ld], based on the word list segmentation, then the target probability value of the candidate word [could] relative to the target vector is the average probability value of the initial probability values of [cou] and [ld] relative to the target vector, as shown in Formula 1:
[0111]
[0112] Finally, after obtaining the target probability value of each candidate word relative to the target vector, the target probability value of each candidate word relative to the target vector is mapped to each candidate word concatenated in the text. Finally, the candidate word with the highest target probability value is selected from each candidate word set as the target candidate word corresponding to that candidate word region.
[0113] The embodiments provided in this specification state that the text word selection method reduces the quadratic dependence on sequence length to linear by employing a masked language model with a sparse attention mechanism, enabling the masked language model to support input sequences greater than 512. This solves the problem of processing ultra-long texts with a length greater than 512 in the candidate word filling scenario, and combines a vocabulary to achieve accurate prediction of the answer to the candidate word filling.
[0114] Furthermore, by combining the method of splicing answers into the text, when obtaining the target answer, the answer with the largest target prediction value among the corresponding options in each text fill-in-the-blank can be selected as the target answer, thereby improving the accuracy of English cloze test questions.
[0115] The following is in conjunction with the appendix Figure 2 Taking the application of the text word selection method provided in this manual in an English cloze test scenario as an example, the text word selection method will be further explained. Among other things, Figure 2 This specification illustrates a processing flowchart of a text word selection method for English cloze tests, provided by an embodiment of this specification. The method specifically includes the following steps:
[0116] Step 202: Obtain the English cloze test text containing the areas to be filled, and the answer options corresponding to each area to be filled.
[0117] Step 204: Replace each area to be filled in in the English cloze test text with the Big Bird model's special token - [MASK].
[0118] Step 206: Add each answer option before the corresponding [MASK].
[0119] Step 208: Input the processed English cloze test text into the Big Bird model after word segmentation to obtain the vector of each token (text segmentation) of the English cloze test text after word segmentation.
[0120] Step 210: Extract the vector of all [MASK] tokens from the vector of each token (text segmentation) in the English cloze test text after word segmentation.
[0121] Step 212: Segment the answer options corresponding to each [MASK] vector and obtain the id representation of each answer in the vocabulary after segmentation.
[0122] Step 214: Pass the vector of each [MASK] through the Mask Language Head classifier to calculate the initial probability value of each word in the vocabulary relative to the vector of each [MASK].
[0123] Step 216: Based on the id representation of each answer in the vocabulary and the initial probability value of each word in the vocabulary relative to each [MASK] vector, determine the target probability value of each answer relative to its corresponding [MASK] vector.
[0124] Specifically, the detailed implementation of determining the target probability value of each answer relative to its corresponding [MASK] vector based on the ID representation of each answer in the vocabulary and the initial probability value of each word in the vocabulary relative to each [MASK] vector can be found in the above embodiment, and will not be repeated here.
[0125] Step 218: Based on the target probability value of each answer in the answer options relative to its corresponding [MASK] vector, select the answer with the largest target probability value as the target answer.
[0126] Specifically, after determining the target answer, fill the target answer into the position of the [MASK] vector to complete the English cloze test.
[0127] The embodiments provided in this specification describe a text selection method applied to English cloze tests. Facing the frequent occurrence of answer options exceeding the 512-character length limit in English cloze tests, the Big Bird model, employing a sparse attention mechanism, can solve the problem of lengths exceeding 512 characters. Furthermore, the Big Bird model, trained on extremely long texts, demonstrates better vector representation of [MASK] than typical pre-trained models (e.g., BERT). Combined with the method of concatenating answers into the text, when obtaining the target answer, the answer with the highest predicted value among the corresponding options in each fill-in-the-blank area can be selected as the target answer, improving the accuracy of English cloze tests.
[0128] Corresponding to the above method embodiments, this specification also provides embodiments of a text word selection device. Figure 3 A schematic diagram of a text word selection device according to an embodiment of this specification is shown. Figure 3 As shown, the device includes:
[0129] The text determination module 302 is configured to determine the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region;
[0130] Text processing module 304 is configured to perform text processing on the initial text based on the candidate word region and the candidate word set to obtain target text;
[0131] The model processing module 306 is configured to segment the target text into words and then use a masked language model to obtain the target vector of the target words in the segmented target text, wherein the masked language model adopts a sparse attention mechanism.
[0132] The target candidate word determination module 308 is configured to determine the target probability value of each candidate word in the candidate word set relative to the target vector based on the vocabulary, and to determine the target candidate word based on the target probability value.
[0133] Optionally, the text determination module 302 is further configured to:
[0134] Obtain the initial text containing the region of candidate words, and the set of candidate words corresponding to the initial text;
[0135] Determine the number of regions in the candidate word region and the number of sets in the candidate word set;
[0136] When the number of regions and the number of sets are the same, the candidate word set corresponding to the candidate word region is determined.
[0137] Optionally, the text determination module 302 is further configured to:
[0138] Determine the text order of the candidate word regions in the initial text, and the arrangement order of the candidate word set;
[0139] The text order is matched with the arrangement order to determine the set of candidate words corresponding to the candidate word region.
[0140] Optionally, the text processing module 304 is further configured to:
[0141] Replace the candidate word region with a preset mask of the masked language model;
[0142] Each candidate word in the candidate word set is concatenated with the preset mask to obtain the concatenated target text.
[0143] Optionally, the model processing module 306 is further configured to:
[0144] The target text is segmented based on a vocabulary list to obtain the text segmentation of the target text;
[0145] The text segmentation is input into a masked language model to obtain the segmentation vectors of the text segmentation;
[0146] The mask vector of the preset mask is determined from the word segmentation vector, and the mask vector of the preset mask is used as the target vector of the target word in the target text after word segmentation.
[0147] Optionally, the model processing module 306 is further configured to:
[0148] The target vector is passed through the word classifier of the masked language model to calculate the initial probability value of each word in the word list relative to the target vector.
[0149] Based on the initial probability values, determine the target probability value of each candidate word in the candidate word set relative to the target vector;
[0150] Based on the target probability value, the target candidate word corresponding to the candidate word region is determined from the candidate word set.
[0151] Optionally, the device further includes:
[0152] The word segmentation module is configured as follows:
[0153] Based on the vocabulary, each candidate word in the candidate word set is segmented to obtain the segmented candidate words corresponding to each candidate word; and
[0154] The position of the segmentation candidate word corresponding to each candidate word in the word list is determined.
[0155] Optionally, the target candidate word determination module 308 is further configured to:
[0156] Based on the position representation of the segmentation candidate word in the vocabulary, the initial probability value of the segmentation candidate word relative to the target vector is determined from the vocabulary;
[0157] Based on the initial probability values of the segmentation candidate words relative to the target vector, the target probability value of each candidate word in the candidate set relative to the target vector is determined.
[0158] Optionally, the target candidate word determination module 308 is further configured to:
[0159] If each candidate word is determined to be the same as the segmentation candidate word corresponding to each candidate word, the initial probability value of the segmentation candidate word relative to the target vector is determined as the target probability value of each candidate word in the candidate set relative to the target vector.
[0160] Optionally, the target candidate word determination module 308 is further configured to:
[0161] If any of the candidate words is determined to be different from the corresponding segmentation candidate words, then each segmentation candidate word corresponding to the candidate word is determined.
[0162] Obtain the initial probability value of each segmentation candidate word relative to the target vector corresponding to the candidate word;
[0163] The average probability value of each segmentation candidate word corresponding to the candidate word relative to the target vector is determined as the target probability value of the candidate word relative to the target vector.
[0164] The embodiments provided in this specification state that the text word selection device reduces the quadratic dependence on sequence length to linear by employing a masked language model with a sparse attention mechanism, enabling the masked language model to support input sequences greater than 512. This solves the problem of processing ultra-long texts with a length greater than 512 in the candidate word filling scenario, and achieves accurate prediction of the answer to the candidate word filling by combining a vocabulary.
[0165] Furthermore, by combining the method of splicing answers into the text, when obtaining the target answer, the answer with the largest target prediction value among the corresponding options in each text fill-in-the-blank can be selected as the target answer, thereby improving the accuracy of English cloze test questions.
[0166] The above is an illustrative scheme of a text word selection device according to this embodiment. It should be noted that the technical solution of this text word selection device and the technical solution of the above-described text word selection method belong to the same concept. For details not described in detail in the technical solution of the text word selection device, please refer to the description of the technical solution of the above-described text word selection method.
[0167] Figure 4 A structural block diagram of a computing device 400 according to an embodiment of this specification is shown. The components of the computing device 400 include, but are not limited to, a memory 410 and a processor 420. The processor 420 is connected to the memory 410 via a bus 430, and a database 450 is used to store data.
[0168] The computing device 400 also includes an access device 440, which enables the computing device 400 to communicate via one or more networks 460. Examples of these networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the Internet. The access device 440 may include one or more of any type of wired or wireless network interface (e.g., a Network Interface Card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) interface, a Wi-MAX interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.
[0169] In one embodiment of this specification, the aforementioned components of the computing device 400 and Figure 4 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 4The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.
[0170] The computing device 400 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. The computing device 400 can also be a mobile or stationary server.
[0171] The processor 420 is configured to execute the following computer-executable instructions to implement the steps of the above-described text word selection method.
[0172] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the text word selection method described above belong to the same concept. For details not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the text word selection method described above.
[0173] An embodiment of this specification also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the text word selection method described above.
[0174] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the text word selection method described above belong to the same concept. For details not described in detail in the technical solution of the storage medium, please refer to the description of the technical solution of the text word selection method described above.
[0175] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.
[0176] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0177] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this specification is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this specification. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this specification.
[0178] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0179] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. These embodiments have been selected and specifically described in this specification to better explain the principles and practical applications of this specification, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.
Claims
1. A text word selection method, characterized in that, include: Determine the initial text containing the candidate word region, and the candidate word set corresponding to the candidate word region; Based on the candidate word region and the candidate word set, the initial text is processed to obtain the target text; After segmenting the target text into words, a masked language model is used to obtain the target vector of the target words in the segmented target text. The masked language model employs a sparse attention mechanism. The process of determining the target probability value of each candidate word in the candidate word set relative to the target vector based on a vocabulary and determining the target candidate word based on the target probability value includes: passing the target vector through the vocabulary classifier of the masked language model to calculate the initial probability value of each word in the vocabulary relative to the target vector; determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value; and determining the target candidate word corresponding to the candidate word region from the candidate word set based on the target probability value.
2. The text word selection method according to claim 1, characterized in that, The process of determining the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region includes: Obtain the initial text containing the region of candidate words, and the set of candidate words corresponding to the initial text; Determine the number of regions in the candidate word region and the number of sets in the candidate word set; When the number of regions and the number of sets are the same, the candidate word set corresponding to the candidate word region is determined.
3. The text word selection method according to claim 2, characterized in that, Determining the candidate word set corresponding to the candidate word region includes: Determine the text order of the candidate word regions in the initial text, and the arrangement order of the candidate word set; The text order is matched with the arrangement order to determine the set of candidate words corresponding to the candidate word region.
4. The text word selection method according to claim 1, characterized in that, The step of processing the initial text based on the candidate word region and the candidate word set to obtain the target text includes: Replace the candidate word region with a preset mask of the masked language model; Each candidate word in the candidate word set is concatenated with the preset mask to obtain the concatenated target text.
5. The text word selection method according to claim 4, characterized in that, The step of segmenting the target text into words and then using a masked language model to obtain the target vector of the target words in the segmented target text includes: The target text is segmented based on a vocabulary list to obtain the text segmentation of the target text; The text segmentation is input into a masked language model to obtain the segmentation vectors of the text segmentation; The mask vector of the preset mask is determined from the word segmentation vector, and the mask vector of the preset mask is used as the target vector of the target word in the target text after word segmentation.
6. The text word selection method according to claim 5, characterized in that, Before determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value, the method further includes: Based on the vocabulary, each candidate word in the candidate word set is segmented to obtain the segmented candidate words corresponding to each candidate word; and The position of the segmentation candidate word corresponding to each candidate word in the word list is determined.
7. The text word selection method according to claim 6, characterized in that, Determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value includes: Based on the position representation of the segmentation candidate word in the vocabulary, the initial probability value of the segmentation candidate word relative to the target vector is determined from the vocabulary; Based on the initial probability values of the segmented candidate words relative to the target vector, the target probability value of each candidate word in the candidate word set relative to the target vector is determined.
8. The text word selection method according to claim 7, characterized in that, The step of determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability values of the segmented candidate words relative to the target vector includes: If each candidate word is determined to be the same as the segmentation candidate word corresponding to each candidate word, the initial probability value of the segmentation candidate word relative to the target vector is determined as the target probability value of each candidate word in the candidate word set relative to the target vector.
9. The text word selection method according to claim 7, characterized in that, The step of determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability values of the segmented candidate words relative to the target vector includes: If any of the candidate words is determined to be different from the corresponding segmentation candidate words, then each segmentation candidate word corresponding to the candidate word is determined. Obtain the initial probability value of each segmentation candidate word relative to the target vector corresponding to the candidate word; The average probability value of each segmentation candidate word corresponding to the candidate word relative to the target vector is determined as the target probability value of the candidate word relative to the target vector.
10. A text word selection device, characterized in that, include: The text determination module is configured to determine the initial text containing the candidate word region and the candidate word set corresponding to the candidate word region; The text processing module is configured to perform text processing on the initial text based on the candidate word region and the candidate word set to obtain the target text; The model processing module is configured to segment the target text into words and then use a masked language model to obtain the target vector of the target words in the segmented target text, wherein the masked language model adopts a sparse attention mechanism; The target candidate word determination module is configured to determine the target probability value of each candidate word in the candidate word set relative to the target vector based on a lexicon, and to determine the target candidate word based on the target probability value. The step of determining the target probability value of each candidate word in the candidate word set relative to the target vector based on a lexicon and determining the target candidate word based on the target probability value includes: passing the target vector through a lexicon classifier of the masked language model to calculate the initial probability value of each word in the lexicon relative to the target vector; determining the target probability value of each candidate word in the candidate word set relative to the target vector based on the initial probability value; and determining the target candidate word corresponding to the candidate word region from the candidate word set based on the target probability value.
11. A computing device, characterized in that, It includes a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the steps of the text word selection method according to any one of claims 1 to 9.
12. A computer-readable storage medium storing computer instructions, characterized in that, When executed by the processor, this instruction implements the steps of the text word selection method according to any one of claims 1 to 9.