Keyword determination method, model, apparatus, medium, and device
By combining statistical and contextual information of candidate words for feature extraction, the problem of insufficient accuracy in keyword extraction in existing technologies is solved, and more accurate keyword identification is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-11
- Publication Date
- 2026-06-12
AI Technical Summary
Existing keyword extraction technologies fail to fully utilize the contextual information of the text, resulting in insufficient accuracy in keyword identification.
By combining statistical, semantic, and contextual information of candidate words, and through feature extraction and classification layer processing, it is determined whether candidate words belong to the keywords of the target text.
It improves the accuracy of keyword identification, reduces the impact of clustering parameter selection on the results, and adapts to specific downstream tasks.
Smart Images

Figure CN122197875A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a keyword determination method, keyword determination model, keyword determination device, computer-readable storage medium, and electronic device. Background Technology
[0002] Keyword extraction technology refers to the technique of identifying words or phrases from text that represent the theme or content of a document. The essence of keyword extraction technology lies in determining keywords. Keyword extraction technology is beneficial for improving the efficiency of information retrieval, enhancing text understanding capabilities, and optimizing content recommendation, and has wide applications in many fields.
[0003] Keyword extraction solutions provided by related technologies can be divided into two categories. One category includes solutions that do not involve artificial intelligence algorithms; these specifically include statistical keyword extraction methods (such as the Term Frequency-Inverse Document Frequency, TFIDF algorithm) and ranking-based keyword extraction methods (such as the TextRank algorithm). These solutions do not consider contextual information during keyword extraction, and the accuracy of the extracted keywords needs improvement. The other category includes solutions that involve artificial intelligence algorithms, such as clustering-based keyword extraction methods, which also suffer from the problem of needing further improvement in keyword extraction accuracy. Summary of the Invention
[0004] This application provides a keyword determination method, keyword determination model, keyword determination device, computer-readable storage medium, and electronic device. By combining the candidate word's own statistical information, semantic information, and contextual information in the text in which the candidate word is located, the method determines whether the candidate word belongs to the keyword of the current text, thereby improving the accuracy of keyword determination.
[0005] In a first aspect, embodiments of this application provide a keyword determination method, the method comprising: acquiring candidate words contained in a target text; determining statistical information corresponding to the candidate words; performing feature extraction on the statistical information to obtain a first feature vector corresponding to the candidate words; performing feature extraction on the target text to obtain a second feature vector corresponding to the candidate words, wherein the second feature vector includes semantic features of the candidate words and contextual features of the candidate words in the target text; and determining whether the candidate words belong to the keywords of the target text based on the first feature vector and the second feature vector corresponding to the candidate words.
[0006] Secondly, embodiments of this application provide a keyword determination model, which includes: a second feature extraction layer, used to receive target text and extract features from the target text, and output a second feature vector corresponding to candidate words in the target text, wherein the second feature vector includes the semantic features of the candidate words and the contextual features of the candidate words in the target text; a first feature extraction layer, used to extract features from the statistical information corresponding to the candidate words to obtain a first feature vector; and a classification layer, used to process the concatenated vector of the first feature vector and the second feature vector corresponding to the candidate words, and output whether the candidate words belong to the keywords of the target text.
[0007] Thirdly, embodiments of this application provide a keyword determination device, which includes: an acquisition module, a first determination module, a first feature extraction module, a second feature extraction module, and a second determination module.
[0008] The acquisition module is used to acquire candidate words contained in the target text; the first determination module is used to determine the statistical information corresponding to the candidate words; the first feature extraction module is used to extract features from the target text to obtain a second feature vector corresponding to the candidate words, wherein the second feature vector includes the semantic features of the candidate words and the contextual features of the candidate words in the target text; the second feature extraction module is used to extract features from the statistical information to obtain a first feature vector corresponding to the candidate words; and the second determination module is used to determine whether the candidate words belong to the keywords of the target text based on the first feature vector and the second feature vector corresponding to the candidate words.
[0009] In an exemplary embodiment, based on the above scheme, the first determining module is specifically used to: determine the i-th length information of the i-th candidate word; determine the statistical information of the i-th candidate word based on the i-th length information; and / or determine the i-th frequency of the i-th candidate word in the target text; determine the i-th density information based on the i-th frequency and the total number of candidate words in the target text; and determine the statistical information of the i-th candidate word based on the i-th density information; wherein i takes the value of a positive integer.
[0010] In an exemplary embodiment, based on the above scheme, the apparatus further includes: a third determining module;
[0011] The third determining module is used to: determine the position identifier sequence according to the order in which candidate words appear in the target text, wherein each candidate word corresponds to a position identifier;
[0012] The aforementioned first feature extraction module includes: a word segmentation unit, a tagging unit, an extraction unit, a first determination unit, and a second determination unit;
[0013] The word segmentation unit is used to segment the target text to obtain M words, where M is an integer greater than 1; the tagging unit is used to tag the M words based on the position identifier sequence; the extraction unit is used to extract features from the M words to obtain M intermediate feature vectors corresponding to the M words; the first determining unit is used to determine the intermediate feature vector corresponding to the tagged word among the M intermediate feature vectors; and the second determining unit is used to determine the second feature vector corresponding to the candidate word based on the intermediate feature vector corresponding to the tagged word.
[0014] In an exemplary embodiment, based on the above scheme, the apparatus further includes: a fourth determining module;
[0015] The fourth determining module is used to: determine the candidate word identifier sequence according to the order of the positions of the candidate words in the target text, wherein, for the first candidate word with a frequency greater than 1, multiple identifiers corresponding to the same first candidate word are associated in the candidate word identifier sequence;
[0016] The second determining unit is specifically used to perform vector summation on multiple intermediate feature vectors corresponding to the associated identifiers based on the candidate word identifier sequence to obtain the second feature vector corresponding to the first candidate word.
[0017] In an exemplary embodiment, based on the above scheme, the second determining unit is specifically used to: perform max pooling on the multiple intermediate feature vectors corresponding to the associated identifiers respectively; and to perform vector summation on the multiple intermediate feature vectors after max pooling on the multiple associated identifiers to obtain the second feature vector corresponding to the first candidate word.
[0018] In an exemplary embodiment, based on the above scheme, the second determining unit is further specifically used to: perform max pooling on the intermediate feature vector corresponding to the second candidate word whose frequency of occurrence is equal to 1, to obtain the second feature vector corresponding to the second candidate word.
[0019] In an exemplary embodiment, based on the above scheme, the acquisition module is further configured to: acquire the text feature vector corresponding to the target text;
[0020] The second determining module is specifically used to: concatenate the first feature vector and the second feature vector corresponding to the target candidate word, as well as the text feature vector, to obtain the third feature vector corresponding to the target candidate word; and, based on the third feature vector corresponding to the target candidate word, determine whether the target candidate word belongs to the keywords of the target text.
[0021] In an exemplary embodiment, based on the above scheme, the second feature extraction module is specifically used to: input the statistical information corresponding to the candidate words in the target text into the first feature extraction layer of the keyword determination model, so as to perform feature extraction through the first feature extraction layer and obtain the first feature vector corresponding to any candidate word in the target text;
[0022] The first feature extraction module is specifically used to: input the position identifier sequence of the target text and candidate words into the second feature extraction layer of the keyword determination model, so as to extract features from the target text through the second feature extraction layer and determine the second feature vector corresponding to the candidate words from the extracted feature vector based on the position identifier sequence;
[0023] The second determining module is specifically used to: concatenate the first feature vector and the second feature vector corresponding to the target candidate word, as well as the text feature vector, to obtain the third feature vector corresponding to the target candidate word; and process the third feature vector corresponding to the target candidate word through the classification layer of the keyword determining model to obtain the classification result of whether the target candidate word belongs to the keyword.
[0024] In an exemplary embodiment, based on the above scheme, the second determining module is specifically used to: obtain a corresponding number of third feature vectors from the third feature vectors corresponding to multiple candidate words in the target text through a sliding window; and input the obtained third feature vectors into the classification layer so that the classification layer processes the received third feature vectors to obtain a classification result of whether the relevant candidate words belong to the keywords.
[0025] In an exemplary embodiment, based on the above solution, the apparatus further includes: a model training module;
[0026] The above model training module is used to: determine multiple sets of training samples, wherein each set of training samples includes multiple candidate words that are the same as the same sample text, wherein the label of each candidate word represents a keyword or a non-keyword; and optimize the parameters of the keyword determination model through the above multiple sets of training samples, wherein before optimization, the above second feature extraction layer is a pre-trained feature extraction model.
[0027] Fourthly, embodiments of this application provide an electronic device, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to perform the keyword determination method provided in the first aspect.
[0028] Fifthly, embodiments of this application provide a chip for implementing the keyword determination method provided in the first or second aspect above. Specifically, the chip includes a processor for retrieving and running a computer program from a memory, causing a device equipped with the chip to execute the keyword determination method provided in the first aspect above.
[0029] Sixthly, embodiments of this application provide a computer-readable storage medium for storing a computer program that causes a computer to execute the keyword determination method provided in the first aspect.
[0030] In a seventh aspect, embodiments of this application provide a computer program product, including computer program instructions, which cause a computer to execute the keyword determination method provided in the first aspect.
[0031] Eighthly, embodiments of this application provide a computer program that, when run on a computer, causes the computer to execute the keyword determination method provided in the first aspect.
[0032] In summary, the keyword determination scheme provided in this application embodiment, on the one hand, determines the statistical information of candidate words in the target text and further extracts features to obtain a first feature vector of the candidate words. On the other hand, it performs feature extraction on the target text to obtain a second feature vector corresponding to each candidate word, wherein the second feature vector contains the semantic features of the current candidate word and its contextual features in the target text. Furthermore, by combining the candidate word's own statistical feature vector, semantic features, and contextual features for prediction, a classification result of whether the candidate word belongs to the keywords of the target text can be obtained. In the keyword determination process provided in this application embodiment, the candidate word's own information and its contextual information in the text are combined to determine whether the candidate word belongs to the keywords of the current text, thereby improving the accuracy of keyword determination. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0034] Figure 1 A flowchart illustrating a keyword determination method provided in an embodiment of this application;
[0035] Figure 2 A schematic diagram illustrating statistical information of candidate words in text, provided for an embodiment of this application;
[0036] Figure 3 A schematic block diagram illustrating a keyword determination model provided in an embodiment of this application;
[0037] Figure 4 A schematic block diagram of the first feature extraction layer in a keyword determination model provided in this application embodiment;
[0038] Figure 5A A schematic block diagram illustrating a method for determining a second feature vector provided in an embodiment of this application;
[0039] Figure 5B A schematic diagram illustrating a method for extracting a second feature vector by combining a second feature extraction layer, as provided in an embodiment of this application;
[0040] Figure 6A A schematic block diagram illustrating a method for determining a second feature vector provided in an embodiment of this application;
[0041] Figure 6B A schematic diagram illustrating a method for extracting a second feature vector by combining a second feature extraction layer, as provided in an embodiment of this application;
[0042] Figure 7 A schematic diagram illustrating a method for extracting a first feature vector by combining a first feature extraction layer, as provided in an embodiment of this application;
[0043] Figure 8 This is a schematic diagram illustrating keyword extraction using a keyword determination model, as provided in an embodiment of this application.
[0044] Figure 9 A schematic diagram of a training sample provided in an embodiment of this application;
[0045] Figure 10 A schematic block diagram of a keyword determination device provided in an embodiment of this application;
[0046] Figure 11 This is a schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0047] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0048] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in sequences other than those illustrated or described herein. In embodiments of this application, "B corresponding to A" means that B is associated with A. In one implementation, B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices. In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0049] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0050] Keyword extraction technology is beneficial for improving information retrieval efficiency, enhancing text understanding, and optimizing content recommendation, and has wide applications in many fields. For example, with the explosive growth of articles on the internet, an increasing amount of online text information needs filtering. Often, this filtering involves adding keywords to the information source, directing articles containing these specified keywords downstream. Therefore, extracting keywords that meet downstream needs (in this case, road-related keywords) is particularly important.
[0051] First, we will introduce the methods provided by the relevant technologies in the embodiments of this application and the problems therein.
[0052] The statistical keyword extraction methods provided by related technologies mainly utilize Term Frequency (TF) and Inverse Document Frequency (IDF) to evaluate the importance of words in text. Specifically, the TF-IDF value measures the importance of a word in a document by combining TF and IDF. This method requires TF-IDF calculation for a single document each time keywords are extracted, and usually requires statistical analysis of the entire document set, making the calculation process cumbersome. Furthermore, this method relies solely on statistical information (such as TF and IDF) without considering the contextual information within the document, which may reduce the accuracy of keyword identification.
[0053] The core idea behind ranking-based keyword extraction methods provided by related technologies is to assess word importance by constructing a word co-occurrence graph. Specifically, if a word appears frequently in the vicinity of many other words, its importance is considered high. The TextRank algorithm borrows from PageRank, iteratively calculating the score of each word, where a high TextRank value positively influences the scores of its neighboring words. However, this method typically requires combining part-of-speech tagging and word frequency filtering to remove stop words and irrelevant words. Furthermore, this approach relies heavily on statistical information (such as word frequency and co-occurrence relationships) without fully utilizing the contextual information in the document, which may reduce the accuracy of keyword identification.
[0054] The core idea of clustering-based keyword extraction methods provided by related technologies is to use word embedding models (such as Word2vec) to represent candidate keywords in text as word vectors, and then use clustering algorithms (such as K-means) to cluster these word vectors. Finally, the top N words closest to the cluster center from each cluster are selected as the final keywords. This method can capture the semantic relationships between words through word embedding models, thus better combining contextual information. However, the effectiveness of this method may be affected by the choice of clustering parameters (such as the number of clusters), leading to instability in the accuracy of keyword determination.
[0055] It is evident that the accuracy of keyword selection in the relevant technologies needs improvement.
[0056] To address the aforementioned issues, the keyword extraction scheme provided in this application embodiment involves, on one hand, determining the statistical information of candidate words in the target text and further extracting features to obtain a first feature vector for the candidate words. On the other hand, feature extraction is performed on the target text to obtain a second feature vector corresponding to each candidate word, wherein the second feature vector contains the semantic features of the current candidate word and its contextual features in the target text. Furthermore, by combining the second feature vector and the first feature vector of the candidate word for prediction, a classification result indicating whether the candidate word belongs to the keywords of the target text can be obtained. In the keyword determination process provided in this application embodiment, the statistical information of the candidate word itself and the contextual information in the text containing the candidate word are combined to determine whether the candidate word belongs to the keywords of the current text, thereby improving the accuracy of keyword determination. Simultaneously, this application embodiment does not involve the selection of clustering parameters (such as the number of clusters), therefore the accuracy of the determined keywords is not affected by this.
[0057] The keyword determination method of this application will be described in detail below through some embodiments. These embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0058] Figure 1 This is a flowchart illustrating a keyword determination method P100 provided in an embodiment of this application. The execution entity of method P100 is a computing device such as a server or terminal.
[0059] In step S110, candidate words contained in the target text are obtained.
[0060] For example, this invention can be applied to text filtering scenarios targeting specific topics, such as texts about roads being closed or closed roads being opened. The text to be filtered can come from online intelligence (such as Weibo posts, WeChat official account articles, etc.). Specifically, for the acquired online intelligence, the solution provided in this application's embodiments is used to identify keywords in each text, and the identified keywords are further used as a dependency for whether to filter the current text, thereby filtering out some articles unrelated to the aforementioned target topic.
[0061] The following describes an example of how to determine candidate words contained in a target text, where the target text is any text to be processed.
[0062] In step S110-1, the target text is segmented to obtain the first word set.
[0063] For example, word segmentation tools (such as jieba segmentation, pkuseg segmentation, etc.) can be used to segment the target text to obtain the words contained in the target text, thus obtaining the first word set mentioned above.
[0064] For example, rule-based word segmentation methods can be used. These include: forward maximum matching, which scans the sentence in the target text from left to right and selects the longest possible word as the segmentation result each time; backward maximum matching, which scans the sentence in the target text from right to left and selects the longest possible word as the segmentation result each time; and bidirectional maximum matching, which combines forward and backward maximum matching and determines the final segmentation result by comparing the results of the two methods. This allows us to obtain the words contained in the target text, resulting in the first word set mentioned above.
[0065] For example, a predefined dictionary can also be used for word segmentation. Specifically, consecutive characters in the target text are matched with words in the existing dictionary to obtain the words contained in the target text, thus obtaining the first word set mentioned above.
[0066] It is understood that the methods used to segment the target text in the embodiments of this application are not limited to the above-described schemes.
[0067] In step S110-2, the words in the first word set are filtered according to preset rules to obtain a second word set containing candidate words; wherein the preset rules include at least one of the following: part of speech of the words, and preset stop words.
[0068] Given the preset rule of word part of speech, the words in the first word set can be labeled with their part of speech, and words of other types (such as adjectives, adverbs, etc.) that are not verbs or nouns can be filtered out to obtain a second word set containing candidate words.
[0069] When the above-mentioned preset rule is a preset stop word, the first word set can be compared with a general stop word list to filter out words in the first word set that match words in the stop word list, leaving non-stop words, and thus obtaining a second word set containing candidate words.
[0070] Given that the aforementioned preset rules include word parts of speech and preset stop words, words in the first word set can be categorized by part of speech, filtering out words that are not verbs or nouns (such as adjectives, adverbs, etc.). Further, the nouns and actions obtained from the first filtering are compared with a general stop word list to filter out words that match those in the stop word list. The words after these two steps are then used as candidate words for the target text.
[0071] It is understood that the methods by which the first word set is filtered in the embodiments of this application are not limited to the above-described scheme.
[0072] In an exemplary embodiment, text A reads as follows: "Attention! The section of Dongshan Avenue near Xianmu Bridge and Fule Garden gatehouse will be temporarily closed for construction. Southern reporters learned from the Meizhou traffic police that due to the steel box girder hoisting work required for the pedestrian overpass project at the Meixian District Senior High School sports field, the section of Dongshan Avenue near Xianmu Bridge and Fule Garden gatehouse in Meijiang District needs to be temporarily closed." The candidate words included in text A are: detour, Dongshan Avenue, Xianmu Bridge, Fule Garden, gatehouse section, closure, construction, reporter, Meizhou, Meixian District, Senior High School, sports field, pedestrian overpass, project, steel box girder, hoisting, construction, fully closed, Meijiang District, Dongshan Avenue, Xianmu Bridge, Fule Garden, section, road.
[0073] Continue to refer to Figure 1 In step S120, the statistical information corresponding to the candidate words is determined.
[0074] In an exemplary embodiment, the above statistical information includes: length information of candidate words and / or density information of candidate words in the target text.
[0075] The length information of the candidate words can be the number of characters in the candidate words. For example, the length information of the candidate word "closed" is 2, and the length information of the candidate word "Dongshan Avenue" is 4, etc.
[0076] The density information of a candidate word in the target text refers to the ratio of the frequency of the current candidate word to the total number of candidate words in the target text. For example, if the frequency of the candidate word "Dongshan Avenue" in the target text is 2, and the total number of candidate words in the target text is 23, then the density information of the candidate word "Dongshan Avenue" in the target text is 2 / 23. That is, let the total number of candidate words in the current text be N (a positive integer), let n be the frequency of the current candidate word in the text, and let n / N be the density information corresponding to the current candidate word.
[0077] For example, Figure 2 The statistical information of candidate words in the aforementioned text A is shown. It is understood that the statistical information of the candidate words is not limited to the length and / or density information mentioned above. Other statistical information may also be included, such as word frequency, inverse document frequency, part-of-speech tagging, etc., which are not limited in this embodiment.
[0078] Figure 3 This is a schematic block diagram of a keyword determination model 300 provided in an embodiment of this application. The keyword determination model 300 can be configured in the execution device of method P100, or in other computing devices that can communicate with the aforementioned execution device.
[0079] This application embodiment will predict whether candidate words in the target text can be used as keywords in the target text based on the above keyword model. Specifically, refer to... Figure 3In the keyword determination model 300, the first feature extraction layer 310 processes the statistical information of the candidate words to determine the first feature vector of the candidate words, where the first feature vector is a vector of statistical features of the candidate words. The second feature extraction layer 320 in the keyword determination model 300 processes the target text to determine the second feature vector of the candidate words, where the second feature vector includes the semantic features of the candidate words and the contextual features of the candidate words in the target text. For example, the second feature extraction layer, capable of extracting feature vectors containing the semantics of the current word and its contextual semantics from the text, can be implemented using a pre-trained model, such as a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model or a pre-trained T5 (Text-to-Text Transfer Transformer) model. It should be noted that the second feature extraction layer 320, as part of the keyword determination model, is used for keyword prediction only after model training; that is, the pre-trained model is fine-tuned before being used for extracting the semantic feature vectors of the candidate words.
[0080] Keyword extraction methods based on deep learning (such as the typical JointKPE algorithm) utilize a pre-trained BERT model to represent candidate keywords as word vectors, then calculate the similarity between the candidate word vectors and document vectors, selecting the top N words with the highest similarity as the final keywords. Compared to traditional word embedding-based methods (such as Word2vec), JointKPE, by using BERT, can better capture the contextual information and semantic features of words. However, JointKPE's direct use of the pre-trained BERT model for feature extraction may lead to results influenced by the pre-trained model itself, making it less adaptable to specific downstream tasks.
[0081] Furthermore, in this embodiment, for any candidate word (such as "closed"), its corresponding first feature vector and second feature vector are concatenated to obtain a concatenated vector, which can be called the third feature vector of the current candidate word. Further, the classification layer 330 processes the third feature vector to output a classification result indicating whether the candidate word belongs to the keyword category. It is evident that in the keyword determination process of this embodiment, on the one hand, it can combine the candidate word's own statistical feature vector, semantic features, and contextual features for prediction, which helps improve the accuracy of keyword determination. On the other hand, the keyword determination model is learned using labeled data, allowing it to better predict outputs based on previous experience, thus making it more suitable for downstream road-related keywords and business scenarios.
[0082] The training examples of the keyword determination model will be described in detail below. Next, we will discuss examples of processing statistical information based on the keyword determination model.
[0083] Continue to refer to Figure 1 In step S130, feature extraction is performed on the statistical information to obtain the first feature vector corresponding to the candidate word.
[0084] Figure 4 This is a schematic block diagram of a first feature extraction layer 310 in a keyword determination model provided in an embodiment of this application. Exemplarily, the first feature extraction layer may include multiple embedding layers, each of which processes a type of statistical information. (See reference...) Figure 4 Embedding layer 1 is used to process the first type of statistical information and output the first feature vector a; embedding layer 2 is used to process the second type of statistical information and output the first feature vector b; and so on.
[0085] If the statistical information determined in this embodiment includes length information and density information, then the first feature extraction layer 310 includes two embedding layers. One embedding layer processes the length information and outputs a first feature vector (which can be called a length feature vector) containing the length information of the candidate words, thereby mapping the length information of each candidate word to a corresponding vector. For example, the dimension of each length feature vector is set to 768 dimensions to be consistent with the dimension of the feature vector output by the second feature extraction layer. The other embedding layer processes the density information and outputs a first feature vector (which can be called a density feature vector) containing the density information of the candidate words, thereby mapping the density information of each candidate word to a corresponding density feature vector, and the dimension of each density vector is consistent with the output of the embedding layer, set to 768 dimensions.
[0086] For example, before inputting the statistical information of all candidate words corresponding to the target text into the corresponding embedding layer, the statistical information can be normalized to standardize the model input information, thereby helping to ensure model training efficiency during the training process. For example, the density information of each candidate word in the target text can be reduced to a certain range, such as [0.01, 0.05], and then the values within the above range can be mapped to the corresponding density vector through the corresponding embedding layer. Normalization can ensure that the density values of different candidate words are on the same scale; if the density values differ greatly, unnormalized data may lead to instability or slower convergence speed of optimization algorithms such as gradient descent during model training.
[0087] Continue to refer to Figure 1In step S140, feature extraction is performed on the target text to obtain a second feature vector corresponding to the candidate word, wherein the second feature vector contains the semantic features of the candidate word and the contextual features of the candidate word in the target text.
[0088] To accurately obtain the semantic features of each candidate word in the target text, this embodiment of the application also determines a position identifier sequence based on the order in which the candidate words appear in the target text. Each candidate word corresponds to a position identifier, meaning different candidate words correspond to different position identifiers, and the number of position identifiers in the position identifier sequence is the same as the total number of candidate words in the target text. For example, the sequentially appearing candidate words can be directly used as the position identifier sequence. For instance, the position identifier sequence for the aforementioned text A is: detour, Dongshan Avenue, Xianmu Bridge, Fule Garden, gate tower section, closed, construction, reporter, Meizhou, Meixian District, senior high school, sports field, pedestrian overpass, engineering project, steel box girder, hoisting, construction, fully enclosed, Meijiang District, Dongshan Avenue, Xianmu Bridge, Fule Garden, building section, road.
[0089] Specifically, this application provides two implementation methods for obtaining the second feature vector, one of which is... Figure 5A and Figure 5B The second feature vector extraction method is shown, and will be extracted through... Figure 5A and Figure 5B The semantic features of the candidate words extracted in the method shown are denoted as the second feature vector A. Another method will be described in subsequent embodiments. Figure 6A and Figure 6B The second feature vector extraction method is shown, and will be illustrated in Figure 6a and Figure 6B The semantic features of the candidate words extracted in the manner shown are denoted as the second feature vector B.
[0090] The following is an introduction Figure 5A and Figure 5B An example of extracting the second feature vector A is provided.
[0091] refer to Figure 5A The target text and the position identifier sequence of its candidate words can be input into the second feature extraction layer 320. For example, the second feature extraction layer 320 is a fine-tuned BERT. The specific execution includes steps S140-1 to S140-5.
[0092] In step S140-1, the target text is segmented to obtain M words, where M is an integer greater than 1.
[0093] For example, the target text above can be tokenized using a BERT model tokenizer (such as the WordPiece tokenizer).
[0094] In step S140-2, the M words are marked based on the above position identifier sequence.
[0095] For example, the segmented results are converted into token IDs acceptable to the BERT model. Simultaneously, the position of each candidate word in the token sequence is recorded according to the aforementioned position identifier sequence.
[0096] In step S140-3, feature extraction is performed on the M word segments to obtain the M intermediate feature vectors corresponding to the M word segments.
[0097] For example, token IDs, etc., are constructed as tensors and fed into the BERT model. The BERT model is then run to obtain the hidden state vector for each token. (See reference...) Figure 5B This yields M intermediate feature vectors. These vectors contain the semantic information and contextual information of each token.
[0098] In step S140-4, the intermediate feature vector corresponding to the tagged word is determined from the M intermediate feature vectors.
[0099] refer to Figure 5B Based on the positional information of each candidate word recorded during word segmentation, their indices in the token sequence are found, thereby locating the intermediate feature vectors corresponding to the candidate words. Further, the hidden state vectors corresponding to these indices are extracted from the output of the BERT model to obtain the intermediate feature vectors corresponding to the labeled word segments.
[0100] In step S140-5, the second feature vector corresponding to the candidate word is determined based on the intermediate feature vector corresponding to the labeled word segmentation.
[0101] Exemplary Reference Figure 5B The intermediate feature vector corresponding to each labeled word can be input into pooling layer 340 for max pooling. Max pooling only retains the maximum value within each pooling window, which makes the model somewhat invariant to small-range translations of the input. Even if the positions of candidate words in the input text change slightly, the pooled features can still capture these features, thus helping to improve the model's robustness to noise and translation transformations in the input data. Simultaneously, max pooling retains the maximum value within each pooling window, which typically corresponds to the most significant or important features. In this way, the model can focus more on the features most helpful for classification or detection tasks. Therefore, in this embodiment, after pooling the intermediate feature vector corresponding to each candidate word, the semantic feature vector of the current candidate word, i.e., the second feature vector, is obtained.
[0102] exist Figure 5A and Figure 5B In the provided embodiment, the target text and its candidate words (position identifier sequences) are simultaneously input into the second feature extraction layer of the model, that is, the feature extraction process is completed in one go, which can greatly shorten the time for text feature calculation and improve the efficiency of keyword extraction.
[0103] The following is an introduction Figure 6A and Figure 5B An example of extracting the second feature vector B is provided.
[0104] refer to Figure 6A In addition to Figure 5A As shown, in addition to inputting the target text and its candidate word position identifier sequence into the second feature extraction layer 320, a candidate word identifier sequence is also input into the second feature extraction layer 320. The candidate word identifier sequence is determined based on the order in which candidate words appear in the target text. Specifically, for a first candidate word with a frequency greater than 1, multiple identifiers corresponding to the same first candidate word are associated in the candidate word identifier sequence. For example, if the word frequency of candidate word s in the target text is greater than S (S is an integer greater than 1), then S candidate words s correspond to the same candidate word identifier, and these S identifiers corresponding to candidate word s are associated. That is, different types of candidate words correspond to different candidate word identifiers. Different types of candidate words refer to candidate words with different textual representations; for example, "closed" and "detour" have different textual representations and belong to different types of candidate words. (Reference) Figure 6B In the candidate word identifier sequence, the identifier "candidate word i" represents the i-th candidate word, "candidate word 1" corresponds to "detour", "candidate word 2" corresponds to "caution detour", "candidate word 3" corresponds to "closed", and "candidate word N" corresponds to "road". Among these, "closed" is a candidate word with a frequency greater than 1 in text A, and multiple "candidate word 3" entries in the candidate word identifier sequence are related. By setting the above candidate word identifier sequence, it is beneficial to accurately extract the semantic vectors of candidate words with a frequency greater than 1 in the target text.
[0105] For example, the second feature extraction layer 320 mentioned above is a fine-tuned BERT. The specific execution method includes steps S140-1' to S140-5'.
[0106] In step S140-1', the target text is segmented to obtain M words, where M is an integer greater than 1.
[0107] In step S140-2', the M words are marked based on the above position identifier sequence.
[0108] In step S140-3', features are extracted from the M word segments to obtain the M intermediate feature vectors corresponding to the M word segments.
[0109] In step S140-4', the intermediate feature vector corresponding to the tagged word is determined from the M intermediate feature vectors.
[0110] The specific implementation methods of steps S140-1' to S140-4' are the same as those of steps S140-1 to S140-4, and will not be repeated here.
[0111] In step S140-5', the second feature vector corresponding to the candidate word is determined based on the intermediate feature vector corresponding to the marked word segmentation.
[0112] One specific implementation of step S140-5' is used to determine the second feature vector corresponding to candidate words (denoted as the first candidate word) with a word frequency greater than 1 in the target text. To enhance the feature representation of the first candidate word (e.g., "closed" in text A), the feature vectors of repeatedly occurring candidate words can be superimposed. This accumulates information about these words in different contexts, thus helping to generate a more comprehensive and representative feature vector. Simultaneously, repeatedly occurring words usually have high importance in the text; by superimposing the feature vectors of these words, they can be given more weight in the final feature representation. Therefore, in this embodiment, specifically, based on the candidate word identifier sequence, multiple intermediate feature vectors corresponding to the associated identifiers are vector-summed to obtain the second feature vector corresponding to the first candidate word.
[0113] For example, refer to Figure 6B For each of the multiple intermediate feature vectors corresponding to the associated identifier, max pooling is performed. Further, the multiple max-pooled intermediate feature vectors corresponding to the associated identifier are summed to obtain the second feature vector corresponding to the first candidate word. For example... Figure 6B The candidate word "closed" corresponding to "candidate word 3" is shown in the diagram. Multiple intermediate feature vectors 60 and 61 corresponding to the associated identifier "candidate word 3" are subjected to max pooling to obtain vectors 62 and 63. Further, the multiple intermediate feature vectors corresponding to the associated identifier "candidate word 3" after max pooling are summed to obtain the second feature vector 64 corresponding to the first candidate word "closed". Since the candidate word "closed" appears more than once in text A, its corresponding semantic feature vector is determined by the above vector summation method. This accumulates information about the candidate word in different contexts, helping to generate a more comprehensive and representative feature vector, thereby enhancing the feature representation of the candidate word ("closed").
[0114] Another specific implementation of step S140-5' is used to determine the second feature vector corresponding to the candidate word (denoted as the second candidate word) in the target text whose frequency is equal to 1. For the intermediate feature vector corresponding to the second candidate word whose frequency is equal to 1, max pooling is performed to obtain the second feature vector corresponding to the second candidate word. (Exemplary reference...) Figure 6B The word "detour" corresponding to "candidate word 1" is the second candidate word that appears in text A with a frequency of 1. Max pooling is performed on its corresponding intermediate feature vector 65 to obtain the second feature vector 66 corresponding to the second candidate word.
[0115] exist Figure 6A and Figure 6B In the provided embodiment, the target text, its candidate words (position identifier sequences), and candidate word identifier sequences are simultaneously input into the second feature extraction layer of the model. That is, the feature extraction process is completed in one go, which can greatly shorten the time for text feature calculation and improve keyword extraction efficiency.
[0116] It should be noted that the second feature vector corresponding to the candidate word is obtained through... Figure 6A and Figure 6B When the corresponding embodiment is determined, the process of determining the first feature vector corresponding to the candidate word requires inputting a sequence of candidate word identifiers to determine the statistical features corresponding to each candidate word. For example, the length feature corresponding to candidate word 1 is 'a', the length feature corresponding to candidate word 2 is 'b', and candidate word 3 exists in multiple places (e.g., the frequency of the word 'closed' in text A is greater than 1), but the length feature corresponding to candidate word 3 is unique, 'c', etc. The following combines... Figure 7 The method for determining the first feature vector of candidate words is explained in detail.
[0117] refer to Figure 7 In this embodiment, the statistical information is used as length and density information as an example for explanation. Specifically, the length information corresponding to each candidate word in the target text and the candidate word identifier sequence are input into the embedding layer 310-1 to extract features and obtain the length feature vector corresponding to each candidate word. For example, taking the following candidate words "detour, Dongshan Avenue, Xianmu Bridge, Dongshan Avenue" in text A as an example, the length information of each candidate word and its corresponding candidate word identifier are as follows: "2, 4, 3, 2", and the candidate word identifier sequence is as follows: "candidate word 1, candidate word 2, candidate word 3, candidate word 2". That is, the above input contains 3 kinds of candidate words, and the output of the embedding layer 310-1 is also the length feature vector corresponding to each of the three candidate words.
[0118] Similarly, the density information and candidate word identifier sequences corresponding to the candidate words in the target text are input into the embedding layer 310-2 to extract features and obtain the density feature vector corresponding to each candidate word. For example, taking the following candidate words "detour, Dongshan Avenue, Xianmu Bridge, Dongshan Avenue" in text A as an example, the density information of each candidate word and its corresponding candidate word identifier are as follows: "2 / 23, 4 / 23, 3 / 23, 2 / 23", and the candidate word identifier sequence is as follows: "candidate word 1, candidate word 2, candidate word 3, candidate word 2". That is, the above input contains 3 kinds of candidate words, and the output of the embedding layer 310-2 is also the density feature vector corresponding to the three kinds of candidate words.
[0119] Therefore, in a further embodiment, the first feature vector and the second feature vector corresponding to the current candidate word can be determined based on the candidate word identifier. By setting the above-mentioned candidate word identifier sequence, the orderliness of the model processing results can be guaranteed, which is conducive to ensuring the accuracy of the determined keywords.
[0120] exist Figure 7 In the provided embodiment, the statistical information of candidate words and the candidate word identifier sequence are simultaneously input into the first feature extraction layer of the model, that is, the feature extraction process is completed in one go, which can greatly shorten the time of statistical feature calculation and improve the efficiency of keyword extraction.
[0121] Continue to refer to Figure 1 In step S150, based on the first feature vector and the second feature vector corresponding to the candidate word, it is determined whether the candidate word belongs to the keyword of the target text.
[0122] For example, the first and second feature vectors belonging to the same type of candidate word can be concatenated. Since the concatenated features contain the semantic features of the current candidate word and its context, as well as the statistical information features of the current candidate word, predicting whether the candidate word belongs to the keyword of the target text based on the concatenated features can ensure the accuracy of the prediction results.
[0123] In other embodiments, to further improve prediction accuracy, the full-text context information of the target text can be added to the concatenated features of the current candidate word. Specifically, the text feature vector corresponding to the target text is obtained, and then the first feature vector, the second feature vector, and the aforementioned text feature vector belonging to the same type of candidate word are concatenated. Further, based on the concatenated features, it is predicted whether the candidate word belongs to the keywords of the target text.
[0124] The above has been approved. Figures 1 to 7 The keyword identification process is described in general, and the following is a combination of... Figure 8 The provided keyword generation model offers a specific embodiment of a keyword determination method.
[0125] refer to Figure 8 The input information for the second feature extraction layer 320 of the keyword determination model includes: target text, candidate word identifier sequence, and candidate word position identifier sequence. Then, based on... Figure 6A and Figure 6B The implementation shown determines the first feature vector corresponding to each candidate word, which will not be described in detail here. The input information of the first feature extraction layer 310 of the keyword determination model includes: candidate word identifier sequence, statistical information of candidate words, and can then be based on, for example... Figure 7 The implementation shown determines the second feature vector corresponding to each candidate word, which will not be described in detail here.
[0126] As can be seen, in this embodiment, the second feature extraction layer 320 is specifically used for: receiving target text, a sequence of position identifiers for candidate words, and a sequence of candidate word identifiers; extracting features from the target text; and determining the second feature vector corresponding to the candidate words from the extracted feature vectors based on the position identifier sequence and the candidate word identifier sequence. The candidate word identifier sequence and the candidate word position sequence are determined according to the order in which the candidate words appear in the target text. In the candidate word identifier sequence, for the first candidate word with a frequency greater than 1, its multiple corresponding candidate word identifiers are associated. In the position identifier sequence, each candidate word corresponds to one position identifier. In this embodiment, the target text, its candidate words (position identifier sequence), and the candidate word identifier sequence are simultaneously input into the second feature extraction layer of the model, meaning the feature extraction process is completed in one step. This greatly shortens the time for text feature calculation and improves keyword extraction efficiency.
[0127] In this embodiment, the first feature extraction layer 310 is specifically used to: receive statistical information of candidate words and candidate word identifier sequences in the target text, so as to determine the first feature vector corresponding to each candidate word. In this embodiment, the target text and the candidate words (position identifier sequences) therein are simultaneously input into the second feature extraction layer of the model, that is, the feature extraction process is completed in one go, which can greatly shorten the time of text feature calculation and improve the efficiency of keyword extraction.
[0128] Next, this embodiment of the application will use the processing of "candidate word 3" (closed) as an example for explanation. The first feature vector (including length feature vector 82 and density feature vector 81), the second feature vector 64, and the text feature vector 80 corresponding to "candidate word 3" are concatenated to obtain the third feature vector corresponding to candidate word 3. By analogy, the third feature vector corresponding to each candidate word can be determined.
[0129] For example, the keyword determination model may also include a custom layer. This custom layer is used to perform vector summation on multiple max-pooled feature vectors corresponding to the associated candidate word identifiers in the candidate word identifier sequence.
[0130] For example, a sliding window method is used to determine the corresponding number of third feature vectors. These third feature vectors, corresponding to the sliding window, are then input into classification layer 330 for prediction of relevant candidate words. For example, the size of the sliding window can be set to 100, meaning that each time 100 candidate words' corresponding third feature vectors are input into classification layer 330, which will then output the classification results for the 100 candidate words. If there are X candidate words in the target text, i.e., the candidate word identifier sequence is "candidate word 1" - "candidate word X", then the classification layer needs to perform X / 100 prediction processes. It is evident that using the sliding window method described above helps to accelerate the prediction process of the classification layer.
[0131] The following describes a training example of the keyword identification model. Since the keyword identification model is used to determine whether candidate words in the current text are keywords, this model can also be called a keyword extraction model. An exemplary implementation can be performed using the following steps.
[0132] In step S21, multiple sets of training samples are determined, wherein each set of training samples includes multiple candidate words that are the same as the same sample text, and the label of each candidate word represents a keyword or a non-keyword.
[0133] The sample text is processed as described in steps S110 and S120 to obtain candidate words and their statistical information, further determining the labels corresponding to the candidate words. For example, if text A is identified as a sample text, then this group of training samples is as follows: Figure 9 As shown.
[0134] In step S22, the parameters of the keyword determination model are optimized using the multiple sets of training samples. Before optimization, the second feature extraction layer is a pre-trained feature extraction model.
[0135] Specifically, the objective function is to minimize the cross-entropy loss function that identifies effective information. Since the keyword identification model described above is a binary classification task, it calculates the loss L for each candidate word. i As shown in formula (1).
[0136] L i =-∑y i log p i (1)
[0137] Where, pi y represents the prediction result of whether the i-th candidate word is a keyword, and y represents the actual result of whether the i-th candidate word is a keyword, that is, the label of the i-th candidate word.
[0138] During training, the losses of candidate words (say, Z) in all training samples are further summed, and the overall loss function is Loss. all As shown in formula (2).
[0139]
[0140] As mentioned earlier, the second feature extraction layer in the keyword determination model can employ a pre-trained model. It is understandable that the training process for the keyword determination model is equivalent to the fine-tuning process for the aforementioned pre-trained model.
[0141] The keyword determination scheme provided in this application has two advantages. First, it can combine the candidate words' own statistical feature vectors, semantic features, and contextual features for prediction, which helps improve the accuracy of keyword determination. Second, the keyword determination model is learned using labeled data, allowing it to better predict outputs based on prior experience. Therefore, it is more suitable for downstream road-related keywords and business scenarios.
[0142] In addition, both textual features and statistical features are extracted in one go based on different pieces of information (see Figure 5 for details). Figure 7 (The corresponding implementation) can effectively save computation time and improve keyword extraction efficiency. At the same time, the sliding window method of the classification layer reduces the number of inputs to the classification layer, which also greatly compresses the model training and inference time.
[0143] The above text combined Figures 1 to 9 The method embodiments and model embodiments of this application are described below, in conjunction with Figure 10 This document describes an embodiment of the keyword determination device of this application.
[0144] Figure 10 This is a schematic block diagram of a keyword determination device 1000 provided in an embodiment of this application.
[0145] refer to Figure 10 The keyword determination device 1000 provided in this application embodiment includes: an acquisition module 1010, a first determination module 1020, a first feature extraction module 1030, a second feature extraction module 1040, and a second determination module 1050.
[0146] The acquisition module 1010 is used to acquire candidate words contained in the target text; the first determination module 1020 is used to determine the statistical information corresponding to the candidate words; the first feature extraction module 1030 is used to extract features from the target text to obtain a second feature vector corresponding to the candidate words, wherein the second feature vector includes the semantic features of the candidate words and the contextual features of the candidate words in the target text; the second feature extraction module 1040 is used to extract features from the statistical information to obtain a first feature vector corresponding to the candidate words; and the second determination module 1050 is used to determine whether the candidate words belong to the keywords of the target text based on the first feature vector and the second feature vector corresponding to the candidate words.
[0147] In an exemplary embodiment, based on the above scheme, the first determining module 1020 is specifically used to: determine the i-th length information of the i-th candidate word; determine the statistical information of the i-th candidate word based on the i-th length information; and / or determine the i-th frequency of the i-th candidate word in the target text; determine the i-th density information based on the i-th frequency and the total number of candidate words in the target text; and determine the statistical information of the i-th candidate word based on the i-th density information; wherein i is a positive integer.
[0148] In an exemplary embodiment, based on the above scheme, the apparatus further includes: a third determining module;
[0149] The third determining module is used to: determine the position identifier sequence according to the order in which candidate words appear in the target text, wherein each candidate word corresponds to a position identifier;
[0150] The aforementioned first feature extraction module 1030 includes: a word segmentation unit, a tagging unit, an extraction unit, a first determination unit, and a second determination unit;
[0151] The word segmentation unit is used to segment the target text to obtain M words, where M is an integer greater than 1; the tagging unit is used to tag the M words based on the position identifier sequence; the extraction unit is used to extract features from the M words to obtain M intermediate feature vectors corresponding to the M words; the first determining unit is used to determine the intermediate feature vector corresponding to the tagged word among the M intermediate feature vectors; and the second determining unit is used to determine the second feature vector corresponding to the candidate word based on the intermediate feature vector corresponding to the tagged word.
[0152] In an exemplary embodiment, based on the above scheme, the apparatus further includes: a fourth determining module;
[0153] The fourth determining module is used to: determine the candidate word identifier sequence according to the order of the positions of the candidate words in the target text, wherein, for the first candidate word with a frequency greater than 1, multiple identifiers corresponding to the same first candidate word are associated in the candidate word identifier sequence;
[0154] The second determining unit is specifically used to perform vector summation on multiple intermediate feature vectors corresponding to the associated identifiers based on the candidate word identifier sequence to obtain the second feature vector corresponding to the first candidate word.
[0155] In an exemplary embodiment, based on the above scheme, the second determining unit is specifically used to: perform max pooling on the multiple intermediate feature vectors corresponding to the associated identifiers respectively; and to perform vector summation on the multiple intermediate feature vectors after max pooling on the multiple associated identifiers to obtain the second feature vector corresponding to the first candidate word.
[0156] In an exemplary embodiment, based on the above scheme, the second determining unit is further specifically used to: perform max pooling on the intermediate feature vector corresponding to the second candidate word whose frequency of occurrence is equal to 1, to obtain the second feature vector corresponding to the second candidate word.
[0157] In an exemplary embodiment, based on the above scheme, the acquisition module 1010 is further configured to: acquire the text feature vector corresponding to the target text;
[0158] The second determining module 1050 is specifically used to: concatenate the first feature vector and the second feature vector corresponding to the target candidate word, as well as the text feature vector, to obtain the third feature vector corresponding to the target candidate word; and determine whether the target candidate word belongs to the keyword of the target text based on the third feature vector corresponding to the target candidate word.
[0159] In an exemplary embodiment, based on the above scheme, the second feature extraction module 1040 is specifically used to: input the statistical information corresponding to the candidate words in the target text into the first feature extraction layer of the keyword determination model, so as to perform feature extraction through the first feature extraction layer and obtain the first feature vector corresponding to any candidate word in the target text;
[0160] The first feature extraction module 1030 is specifically used to: input the position identifier sequence of the target text and the candidate words into the second feature extraction layer of the keyword determination model, so as to extract features from the target text through the second feature extraction layer and determine the second feature vector corresponding to the candidate words from the extracted feature vector based on the position identifier sequence;
[0161] The second determining module 1050 is specifically used to: concatenate the first feature vector and the second feature vector corresponding to the target candidate word, as well as the text feature vector, to obtain the third feature vector corresponding to the target candidate word; and process the third feature vector corresponding to the target candidate word through the classification layer of the keyword determining model to obtain the classification result of whether the target candidate word belongs to the keyword.
[0162] In an exemplary embodiment, based on the above scheme, the second determining module 1050 is specifically used to: obtain a corresponding number of third feature vectors from the third feature vectors corresponding to multiple candidate words in the target text through a sliding window; and input the obtained third feature vectors into the classification layer so that the classification layer processes the received third feature vectors to obtain a classification result of whether the relevant candidate words belong to the keywords.
[0163] In an exemplary embodiment, based on the above solution, the apparatus further includes: a model training module;
[0164] The above model training module is used to: determine multiple sets of training samples, wherein each set of training samples includes multiple candidate words that are the same as the same sample text, wherein the label of each candidate word represents a keyword or a non-keyword; and optimize the parameters of the keyword determination model through the above multiple sets of training samples, wherein before optimization, the above second feature extraction layer is a pre-trained feature extraction model.
[0165] It should be understood that, as Figure 10 The embodiment of the keyword determination device shown corresponds to the embodiment of the keyword determination method described above, and similar descriptions can be found in the method embodiment. To avoid repetition, further details are omitted here. Specifically, through methods such as... Figure 10 The information interaction between the various modules in the keyword determination device shown can execute the above-described embodiment of the keyword determination method, through, as... Figure 10 The information interaction between the various modules in the keyword determination device shown can execute the above-described embodiments of the keyword determination method. For the sake of brevity, the method embodiments corresponding to the aforementioned and other operations and / or functions of each module in the device will not be described again here.
[0166] The above description, in conjunction with the accompanying drawings, describes the operation and maintenance related apparatus of the software agent according to the embodiments of this application from the perspective of functional modules. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by the integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in the embodiments of this application can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. Optionally, the software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.
[0167] This application also provides an electronic device.
[0168] Figure 11 This is a schematic block diagram of an electronic device 1100 provided in an embodiment of this application. As described above, the operation and maintenance related devices of the software agent can be deployed in, for example... Figure 11 The electronic device shown can therefore be used to perform the keyword determination method described above.
[0169] like Figure 11 As shown, the electronic device 1100 may include:
[0170] The system includes a memory 1110 and a processor 1120. The memory 1110 stores a computer program 1130 and transfers the program code 1130 to the processor 1120. In other words, the processor 1120 can call and run the computer program 1130 from the memory 1110 to implement the methods in the embodiments of this application.
[0171] For example, the processor 1120 can be used to execute the steps in the keyword determination method described above, or to execute the steps in the keyword determination method described above, according to the instructions in the computer program 1130.
[0172] In some embodiments of this application, the processor 1120 may include, but is not limited to:
[0173] General-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0174] In some embodiments of this application, the memory 1110 includes, but is not limited to:
[0175] Volatile memory and / or non-volatile memory. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
[0176] In some embodiments of this application, the computer program 1130 may be divided into one or more modules, which are stored in the memory 1110 and executed by the processor 1120 to complete the keyword determination method provided in this application, or to complete the steps in the keyword determination method described above. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 1130 in the electronic device.
[0177] like Figure 11As shown, the electronic device 1100 may further include:
[0178] Transceiver 1140, which can be connected to processor 1120 or memory 1110.
[0179] The processor 1120 can control the transceiver 1140 to communicate with other devices; specifically, it can send information or data to other devices or receive information or data sent by other devices. The transceiver 1140 may include a transmitter and a receiver. The transceiver 1140 may further include antennas, and the number of antennas may be one or more.
[0180] It should be understood that the various components in the electronic device 1100 are connected through a bus system, which includes a data bus, a power bus, a control bus, and a status signal bus.
[0181] According to one aspect of this application, a computer storage medium is provided that stores a computer program thereon, which, when executed by a computer, enables the computer to perform the methods of the above-described method embodiments. Alternatively, embodiments of this application also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the methods of the above-described method embodiments.
[0182] According to another aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method described in the above-described method embodiments.
[0183] In other words, when implemented using software, it can be implemented wholly or partially in the form of a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disc (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)).
[0184] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0185] In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or modules may be electrical, mechanical, or other forms.
[0186] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. For example, the functional modules in the various embodiments of this application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
[0187] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for determining keywords, characterized in that, The method includes: Retrieve candidate words contained in the target text; Determine the statistical information corresponding to the candidate words; Feature extraction is performed on the statistical information to obtain the first feature vector corresponding to the candidate word; Feature extraction is performed on the target text to obtain a second feature vector corresponding to the candidate word, wherein the second feature vector contains the semantic features of the candidate word and the contextual features of the candidate word in the target text; Based on the first feature vector and the second feature vector corresponding to the candidate word, determine whether the candidate word belongs to the keyword of the target text.
2. The method according to claim 1, characterized in that, The statistical information for determining the candidate words includes: Determine the i-th length information of the i-th candidate word; determine the statistical information of the i-th candidate word based on the i-th length information; and / or, Determine the i-th frequency of the i-th candidate word in the target text; determine the i-th density information based on the i-th frequency and the total number of candidate words in the target text; determine the statistical information of the i-th candidate word based on the i-th density information; Where i takes the value of a positive integer.
3. The method according to claim 1, characterized in that, The method further includes: Based on the order in which candidate words appear in the target text, a sequence of position identifiers is determined, wherein each candidate word corresponds to a position identifier; The step of extracting features from the target text to obtain the second feature vector corresponding to the candidate word includes: The target text is segmented to obtain M words, where M is an integer greater than 1; The M words are marked based on the location identifier sequence; Feature extraction is performed on the M word segments to obtain M intermediate feature vectors corresponding to the M word segments; Among the M intermediate feature vectors, the intermediate feature vector corresponding to the tagged word segment is determined; Based on the intermediate feature vector corresponding to the labeled word segmentation, determine the second feature vector corresponding to the candidate word.
4. The method according to claim 3, characterized in that, The method further includes: Based on the order in which candidate words appear in the target text, a candidate word identifier sequence is determined, wherein for the first candidate word with a frequency greater than 1, multiple identifiers corresponding to the same first candidate word are associated in the candidate word identifier sequence; Based on the intermediate feature vector corresponding to the labeled word segmentation, the second feature vector corresponding to the candidate word is determined, including: Based on the candidate word identifier sequence, multiple intermediate feature vectors corresponding to the associated identifiers are vector-summed to obtain the second feature vector corresponding to the first candidate word.
5. The method according to claim 4, characterized in that, The step of summing multiple intermediate feature vectors corresponding to the associated identifiers based on the candidate word identifier sequence to obtain the second feature vector corresponding to the first candidate word includes: Max pooling is performed on the multiple intermediate feature vectors corresponding to the associated identifiers. The multiple intermediate feature vectors obtained after max pooling for the associated identifiers are summed to obtain the second feature vector corresponding to the first candidate word.
6. The method according to claim 3, characterized in that, The step of determining the second feature vector corresponding to the candidate word based on the intermediate feature vector corresponding to the labeled word segmentation includes: For the intermediate feature vector corresponding to the second candidate word with a frequency of 1, max pooling is performed to obtain the second feature vector corresponding to the second candidate word.
7. The method according to any one of claims 1 to 6, characterized in that, The method further includes: Obtain the text feature vector corresponding to the target text; The step of determining whether a candidate word belongs to the keywords of the target text based on the first feature vector and the second feature vector corresponding to the candidate word includes: The third feature vector corresponding to the target candidate word is obtained by concatenating the first feature vector and the second feature vector corresponding to the target candidate word and the text feature vector. Based on the third feature vector corresponding to the target candidate word, determine whether the target candidate word belongs to the keywords of the target text.
8. The method according to claim 7, characterized in that, The step of extracting features from the statistical information to obtain the first feature vector corresponding to the candidate word includes: The statistical information corresponding to the candidate words in the target text is input into the first feature extraction layer of the keyword determination model, so as to perform feature extraction through the first feature extraction layer and obtain the first feature vector corresponding to any candidate word in the target text. The step of extracting features from the target text to obtain the second feature vector corresponding to the candidate word includes: The target text and the position identifier sequence of the candidate words are input into the second feature extraction layer of the keyword determination model, so as to extract features from the target text through the second feature extraction layer and determine the second feature vector corresponding to the candidate words from the extracted feature vector based on the position identifier sequence; The step of determining whether a candidate word belongs to the keywords of the target text based on the first feature vector and the second feature vector corresponding to the candidate word includes: The first feature vector and the second feature vector corresponding to the target candidate word are concatenated with the text feature vector to obtain the third feature vector corresponding to the target candidate word. The classification layer of the keyword determination model processes the third feature vector corresponding to the target candidate word to obtain the classification result of whether the target candidate word belongs to the keyword.
9. The method according to claim 8, characterized in that, The step of processing the third feature vector corresponding to the target candidate word through the classification layer to obtain the classification result of whether the target candidate word belongs to the keyword includes: By using a sliding window, a corresponding number of third feature vectors are obtained from the third feature vectors corresponding to multiple candidate words in the target text. The obtained third feature vector is input into the classification layer, so that the classification layer processes the received third feature vector to obtain the classification result of whether the relevant candidate words belong to the keyword.
10. The method according to claim 8, characterized in that, The method further includes: Multiple sets of training samples are determined, wherein each set of training samples includes multiple candidate words that are the same as the text of the same sample, wherein the label of each candidate word represents a keyword or a non-keyword; The parameters of the keyword determination model are optimized using the multiple sets of training samples. Before optimization, the second feature extraction layer is a pre-trained feature extraction model.
11. A keyword determination model, characterized in that, The model includes: The second feature extraction layer is used to receive the target text and extract features from the target text, and output the second feature vector corresponding to the candidate words in the target text, wherein the second feature vector contains the semantic features of the candidate words and the contextual features of the candidate words in the target text; The first feature extraction layer is used to extract features from the statistical information corresponding to the candidate words to obtain the first feature vector; The classification layer is used to process the concatenated vector of the first feature vector and the second feature vector corresponding to the candidate word, and output whether the candidate word belongs to the keyword of the target text.
12. The model according to claim 11, characterized in that, The second feature extraction layer is specifically used to receive the target text, the position identifier sequence of candidate words, and the candidate word identifier sequence, and to perform feature extraction on the target text and determine the second feature vector corresponding to the candidate word from the extracted feature vector based on the position identifier sequence and the candidate word identifier sequence; wherein, the candidate word identifier sequence and the candidate word position sequence are both determined according to the order in which the candidate words appear in the target text, and for the first candidate word in the candidate word identifier sequence, the multiple candidate word identifiers corresponding to it are associated, and each candidate word in the position identifier sequence corresponds to a position identifier; The model also includes: a max pooling layer and a custom layer; The max pooling layer is used to perform max pooling on the feature vector corresponding to any one of the location identifiers in the location identifier sequence. The custom layer is used to sum the feature vectors of multiple max-pooled feature vectors corresponding to the associated candidate word identifiers in the candidate word identifier sequence to obtain the second feature vector corresponding to the first candidate word. Before training the keyword determination model, the second feature extraction layer is a pre-trained feature extraction model.
13. A keyword determination device, characterized in that, The device includes: The acquisition module is used to acquire candidate words contained in the target text; The first determining module is used to determine the statistical information corresponding to the candidate words; The first feature extraction module is used to extract features from the target text to obtain a second feature vector corresponding to the candidate word, wherein the second feature vector includes the semantic features of the candidate word and the contextual features of the candidate word in the target text; The second feature extraction module is used to extract features from the statistical information to obtain the first feature vector corresponding to the candidate word; The second determining module is used to determine whether the candidate word belongs to the keyword of the target text based on the first feature vector and the second feature vector corresponding to the candidate word.
14. A computer-readable storage medium, characterized in that, Used to store computer programs; The computer program causes the computer to perform the keyword determination method as described in any one of claims 1 to 12.
15. An electronic device, wherein, Including processor and memory; The memory is used to store computer programs; The processor is configured to execute the computer program to implement the keyword determination method as described in any one of claims 1 to 12.