Text topic extraction method and device, computer device and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By constructing a topological graph model of BERT word vectors and sentence vectors, and using a graph autoencoder with attention mechanism and clustering algorithm, the problem of low accuracy in text topic extraction in existing technologies is solved, and more efficient text topic extraction is achieved.

CN116011436BActive Publication Date: 2026-06-23SF TECH CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SF TECH CO LTD
Filing Date: 2021-08-23
Publication Date: 2026-06-23

Application Information

Patent Timeline

23 Aug 2021

Application

23 Jun 2026

Publication

CN116011436B

IPC: G06F40/258; G06F40/30; G06F16/334; G06N3/088; G06N3/045

AI Tagging

Application Domain

Digital data information retrieval Semantic analysis

Technology Topics

Cluster algorithm Algorithm

Technical Efficacy Phrases

Improve extraction accuracyeasy to dig

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power load prediction method based on a CRKformer model, an electronic device, and a storage medium
CN122203195AImprove stability quality improvement Complex mathematical operations Ac network circuit arrangements
Methods, apparatus, and computer equipment for extracting combustion chamber vortex core features
CN120912897BImprove extraction efficiencyImprove extraction comprehensivenessCharacter and pattern recognition 3D modelling Combustion chamber Feature extraction
A Method and System for Cable Force Analysis Based on Enhanced Frequency Domain Decomposition
CN122087367AImprove extraction accuracyImprove robustness Character and pattern recognition Tension measurement Singular value decompositionSpectral density matrix
An intelligent maintenance suggestion recommendation method, system, device and medium
CN122112243AImprove processing efficiencyImprove extraction accuracySemantic analysis Special data processing applications Linguistic model Field data
An adaptive micro-doppler corner point feature extraction method and device
CN117765268BImprove extraction accuracyRobustDifference of GaussiansFeature extraction

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116011436B_ABST

Patent Text Reader

Abstract

The application provides a text topic extraction method and device, computer equipment and a storage medium. The method comprises the following steps: obtaining a Bert word vector and a Bert sentence vector of a to-be-processed text; taking the Bert word vector and the Bert sentence vector as nodes to construct a topological graph model; optimizing the nodes of the topological graph model by using a graph autoencoder with an attention mechanism to obtain multiple features of each node; and analyzing the multiple features by using a clustering algorithm to extract topic information of the to-be-processed text. The method can improve the accuracy of text topic extraction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to a text topic extraction method, apparatus, computer device, and storage medium. Background Technology

[0002] With the rapid development of e-commerce, customer complaints have become a major business focus in the industry. Accurately predicting customer complaint topics can effectively mitigate losses, thereby saving costs, increasing account activity, and improving customer satisfaction.

[0003] However, existing topic prediction technologies often involve manually screening order features and then using machine learning models, which is extremely time-consuming and results in unsatisfactory accuracy and recall.

[0004] Therefore, existing machine learning-based text topic extraction methods suffer from low accuracy. Summary of the Invention

[0005] Therefore, it is necessary to provide a text topic extraction method, apparatus, computer equipment, and storage medium to address the aforementioned technical problems, so as to improve the accuracy of text topic extraction.

[0006] Firstly, this application provides a method for extracting text topics, including:

[0007] Obtain the BERT word vectors and BERT sentence vectors of the text to be processed;

[0008] A topological graph model is constructed using BERT word vectors and BERT sentence vectors as nodes.

[0009] By using a graph autoencoder with an attention mechanism, the nodes of the topological graph model are optimized to obtain multiple features of each node.

[0010] Clustering algorithms are used to analyze multiple features in order to extract thematic information from the text to be processed.

[0011] In some embodiments of this application, multiple features are analyzed using a clustering algorithm to extract topic information of the text to be processed, including: parallel processing of multiple features of each node to obtain a feature matrix of the topological graph model, wherein the multiple features include model features, position features, weighted average features, and importance features; analyzing the feature matrix using a clustering algorithm to obtain a first probability distribution and a second probability distribution corresponding to each node, wherein the second probability distribution is the result of quadratic analysis of the first probability distribution; and extracting topic information of the text to be processed based on the first probability distribution and the second probability distribution.

[0012] In some embodiments of this application, the topic information of the text to be processed is extracted according to the first probability distribution and the second probability distribution, including: obtaining KL divergence information between the first probability distribution and the second probability distribution as the target loss; analyzing the target loss through a preset optimization algorithm to update the model parameters of the topology graph model and obtain the updated model parameters; if the KL divergence information is less than a preset divergence threshold, the topic information of the text to be processed is extracted according to the updated model parameters.

[0013] In some embodiments of this application, if the KL divergence information is less than a preset divergence threshold, the topic information of the text to be processed is extracted according to the updated model parameters, including: if the KL divergence information is less than the preset divergence threshold, the trained topology graph model is obtained according to the updated model parameters; based on the trained topology graph model, the target set to which each node belongs is obtained; the center node of each target set is extracted to obtain the topic information of the text to be processed.

[0014] In some embodiments of this application, a topology graph model is constructed using Bert word vectors and Bert sentence vectors as nodes, including: constructing an initial topology graph model using Bert word vectors and Bert sentence vectors as nodes; determining target Bert word vectors and target Bert sentence vectors in the initial topology graph model, wherein the target Bert word vectors are two adjacent Bert word vector nodes and the target Bert sentence vectors are two adjacent Bert word vector nodes and Bert sentence vector nodes; and obtaining the edge values corresponding to the target Bert word vectors and target Bert sentence vectors respectively, so as to optimize the initial topology graph model using the edge values to obtain the topology graph model.

[0015] In some embodiments of this application, the boundary values corresponding to the target Bert word vector and the target Bert sentence vector are obtained respectively, and the boundary values are used to optimize the initial topology graph model to obtain the topology graph model. This includes: analyzing the first co-occurrence degree of the target Bert word vector in the text to be processed, and obtaining the first boundary value corresponding to the target Bert word vector; analyzing the second co-occurrence degree of the Bert word vector node in the target Bert sentence vector under the corresponding Bert sentence vector node, and analyzing the third co-occurrence degree of the Bert word vector node in the target Bert sentence vector in the text to be processed, and obtaining the second boundary value corresponding to the target Bert sentence vector; and optimizing the initial topology graph model based on the first boundary value and the second boundary value to obtain the topology graph model.

[0016] In some embodiments of this application, a graph autoencoder with an attention mechanism is used to optimize the nodes of a topological graph model and obtain multiple features of each node. This includes: determining the master node in the topological graph model and the neighbor nodes adjacent to the master node; optimizing the weights of each neighbor node using the graph autoencoder with an attention mechanism to obtain optimized weights, where the weights are determined based on the importance of the neighbor nodes to the master node; and performing a weighted summation calculation based on the optimized weights to optimize the nodes of the topological graph model, obtain the high-level features of each master node, and obtain the multiple features of each node.

[0017] Secondly, this application provides a text topic extraction device, comprising:

[0018] The vector acquisition module is used to acquire BERT word vectors and BERT sentence vectors of the text to be processed;

[0019] The model building module is used to construct a topology graph model by using BERT word vectors and BERT sentence vectors as nodes.

[0020] The node optimization module is used to optimize the nodes of the topological graph model by using a graph autoencoder with an attention mechanism to obtain multiple features of each node.

[0021] The topic extraction module is used to analyze multiple features through clustering algorithms to extract topic information from the text to be processed.

[0022] Thirdly, this application also provides a computer device, comprising:

[0023] One or more processors;

[0024] Memory; and one or more applications, wherein one or more applications are stored in memory and configured to be executed by a processor to implement a text topic extraction method.

[0025] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, the computer program being loaded by a processor to perform the steps in the text topic extraction method.

[0026] Fifthly, embodiments of this application provide a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method provided in the first aspect described above.

[0027] The aforementioned text topic extraction method, apparatus, computer equipment, and storage medium acquire BERT word vectors and BERT sentence vectors of the text to be processed. Using these vectors as nodes, a topological graph model can be constructed. Then, a graph autoencoder with an attention mechanism is used to optimize the nodes of the topological graph model, obtaining multiple features of each node. Finally, clustering algorithms are used to analyze these multiple features, extracting the topic information of the text to be processed. Because this application uses a topological graph model to extract text topics, it can more comprehensively capture co-occurring word information with long, discontinuous intervals, which is beneficial for in-depth text information mining. Furthermore, using a topological graph model to analyze BERT word vectors and BERT sentence vectors avoids the problem of vectors not being able to be updated purposefully due to complete separation between the vectors and the graph model, thus improving the accuracy of text topic extraction. Attached Figure Description

[0028] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0029] Figure 1 This is a schematic diagram of a scenario illustrating the text topic extraction method in an embodiment of this application;

[0030] Figure 2 This is a flowchart illustrating the text topic extraction method in an embodiment of this application;

[0031] Figure 3 This is a schematic diagram illustrating the specific process of the text topic extraction method in the embodiments of this application;

[0032] Figure 4 This is a schematic diagram of the structure of the text topic extraction device in the embodiments of this application;

[0033] Figure 5 This is a schematic diagram of the structure of the computer device in the embodiments of this application. Detailed Implementation

[0034] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0035] In the description of this application, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the stated features. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0036] In the description of this application, the term "for example" is used to mean "used as an example, illustration, or description." Any embodiment described as "for example" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use the invention. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that the invention can be made without using these specific details. In other instances, well-known structures and processes will not be described in detail to avoid obscuring the description of the invention with unnecessary detail. Therefore, the invention is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.

[0037] The solution provided in this application involves artificial intelligence technology, which is illustrated in the following embodiments:

[0038] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0039] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0040] Natural Language Processing (NLP) is an important field within computer science and artificial intelligence. It studies the theories and methods for enabling effective communication between humans and computers using natural language. NLP is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field involves natural language—the language people use in daily life—and thus it has a close relationship with linguistic research. NLP techniques typically include text processing, semantic understanding, machine translation, question answering, and knowledge graphs.

[0041] This application provides a text topic extraction method, apparatus, computer device, and storage medium, which will be described in detail below.

[0042] See Figure 1 , Figure 1 This is a schematic diagram illustrating a scenario for the text topic extraction method provided in this application, which can be applied to a text topic extraction system. The text topic extraction system includes a terminal 100 and a server 200. The terminal 100 can be a device that includes both receiving and transmitting hardware, i.e., a device with receiving and transmitting hardware capable of performing bidirectional communication over a bidirectional communication link. Such a device can include cellular or other communication devices, having a single-line display, a multi-line display, or a cellular or other communication device without a multi-line display. Specifically, the terminal 100 can be a desktop terminal or a mobile terminal, and can also be a mobile phone, tablet computer, or laptop computer. The server 200 can be a standalone server, or a server network or server cluster, including but not limited to computers, network hosts, a single network server, multiple network server sets, or a cloud server composed of multiple servers. The cloud server consists of a large number of computers or network servers based on cloud computing. Furthermore, the terminal 100 and the server 200 establish a communication connection through a network, which can be any of a wide area network (WAN), local area network (LAN), or metropolitan area network (MAN).

[0043] Those skilled in the art will understand that Figure 1 The application environment shown is merely one applicable scenario for the solution in this application and does not constitute a limitation on the application scenario of the solution in this application. Other application environments may include more than one. Figure 1 The number of computer devices shown is more or less, for example Figure 1 Only one server, 200, is shown in the text. It is understood that this text topic extraction system may include one or more other servers; specific details are not specified here. Additionally, as... Figure 1As shown, the text topic extraction system may also include a storage device for storing data, such as storing text data for various scenarios.

[0044] It should be noted that, Figure 1 The schematic diagram of the text topic extraction system shown is merely an example. The text topic extraction system and scenario described in this embodiment of the invention are for the purpose of more clearly illustrating the technical solutions of this embodiment of the invention, and do not constitute a limitation on the technical solutions provided by this embodiment of the invention. As those skilled in the art will know, with the evolution of text topic extraction systems and the emergence of new business scenarios, the technical solutions provided by this embodiment of the invention are also applicable to similar technical problems.

[0045] See Figure 2 This application provides a method for extracting text topics. This embodiment mainly applies this method to the above-mentioned... Figure 1 Taking server 200 as an example, the method includes steps S201 to S204, as follows:

[0046] S201, obtain the BERT word vectors and BERT sentence vectors of the text to be processed.

[0047] The text to be processed can be a sentence, a paragraph, or a discourse from which the topic needs to be extracted. For example, the text to be processed can be service evaluation text, event log text, news text, etc.; the text to be processed can be viewed through an application installed on the terminal; the application can be a browser, a dedicated information browsing application, etc.

[0048] In practice, the prerequisite for text topic extraction is that server 200 receives a text topic extraction request submitted by a user. This request can be a request carrying text content or a request carrying a text identifier. If the request carries a text identifier, server 200 can retrieve the text of the topic to be extracted from its local database based on the text identifier. If the text is not stored in the local database, server 200 can request the text from other servers with which it has a pre-established communication connection and which store the text. In this case, the text identifier can be used as an index to request the other servers to return the required text content. After server 200 obtains the text to be processed, the text can be displayed on the user's electronic device (mobile phone, computer, tablet, etc.) or not.

[0049] Furthermore, after obtaining the text to be processed, the server 200 can acquire BERT word vectors and BERT sentence vectors by loading a BERT pre-trained model and inputting the text into it. Specifically, the BERT pre-trained model is a language model built on a bidirectional Transformer, which can more accurately represent the semantics of specific scenarios with structured data, especially in logistics complaint scenarios. Compared to the traditional method of looking up word2vec word vectors that are independent of word context by looking up words in a dictionary, this embodiment proposes to use a flexible, trainable, and optimizeable BERT model to train word vectors according to specific scenarios. The word vectors obtained in this way learn a semantic expression that is more consistent with the specific scenario than other word vector representations.

[0050] Therefore, in some embodiments, this application proposes that the Bert word vectors and Bert sentence vectors of the text to be processed can be obtained by loading a Bert pre-trained model. The word vectors generated by the Bert pre-trained model are dynamically calculated by combining the words around the words, and are not fixed 200-dimensional vectors, but 768-dimensional trainable and updatable variables.

[0051] S202 uses Bert word vectors and Bert sentence vectors as nodes to construct a topological graph model.

[0052] Among them, the topology graph model, also known as the heterogeneous graph model, can represent that the purpose of learning is to find a meaningful vector representation for each node to facilitate subsequent applications, such as link prediction, personalized recommendation, and node classification.

[0053] For specific implementation details, please refer to [link / reference]. Figure 3 The BERT word vectors and BERT sentence vectors currently obtained by server 200 include not only those of the text to be processed, but also those of massive amounts of text, and the resulting vectors are not a fixed sequence but rather variables. After obtaining the BERT word vectors and BERT sentence vectors, server 200 can use all word vectors and sentence vectors as nodes to build a topology graph model using a graph neural network. The topology graph model construction steps involved in this embodiment will be described in detail below.

[0054] In one embodiment, this step includes: constructing an initial topology graph model using Bert word vectors and Bert sentence vectors as nodes; determining the target Bert word vectors and target Bert sentence vectors in the initial topology graph model, wherein the target Bert word vectors are two adjacent Bert word vector nodes and the target Bert sentence vectors are two adjacent Bert word vector nodes and Bert sentence vector nodes; obtaining the edge values corresponding to the target Bert word vectors and target Bert sentence vectors respectively, so as to optimize the initial topology graph model using the edge values to obtain the topology graph model.

[0055] In specific implementation, the topological graph model used to extract the text topic in this application embodiment not only needs nodes, but also heterogeneous attributes and content associated between each node. These attributes and content can be identified by edge values, including two types of edges: edges between words and edges between words and sentences. The graph model containing edge value information between nodes can more comprehensively capture the information of co-occurring words with long distances of non-continuous intervals. That is, it is used to effectively obtain semantic co-occurring words with long distances of non-continuous intervals at the global level in the corpus, which makes up for the locality and sequentiality of other neural networks such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

[0056] Understandably, traditional word vectors and sentence vectors use methods like "TF-IDF" or "word2vec" to treat sentences as unordered word combinations or ordered word sequences, only recording information about the N nearest words. In contrast, this application's embodiment employs a graph model (topological graph model) structure to more comprehensively capture information about co-occurring words with long, discontinuous intervals. This facilitates deep mining of word-to-word, word-to-text, and word importance relationships within texts and even corpora, thus initially improving the accuracy of text topic extraction. This application's embodiment proposes supplementing the initial topological graph model with boundary values between nodes to add weights to the relationships between nodes, thereby optimizing the initial topological graph model and obtaining the final topological graph model.

[0057] In one embodiment, the step of obtaining the boundary values corresponding to the target Bert word vector and the target Bert sentence vector, and then using the boundary values to optimize the initial topology graph model to obtain the topology graph model, includes: analyzing the first co-occurrence degree of the target Bert word vector in the text to be processed, and obtaining the first boundary value corresponding to the target Bert word vector; analyzing the second co-occurrence degree of the Bert word vector node in the target Bert sentence vector under the corresponding Bert sentence vector node, and analyzing the third co-occurrence degree of the Bert word vector node in the target Bert sentence vector in the text to be processed, and obtaining the second boundary value corresponding to the target Bert sentence vector; and optimizing the initial topology graph model based on the first boundary value and the second boundary value to obtain the topology graph model.

[0058] In specific implementation, the first boundary value can represent the boundary value between words, that is, the boundary value between two adjacent Bert word vector nodes included in the target Bert word vector, while the second boundary value can represent the boundary value between words and sentences, that is, the boundary value between two adjacent Bert word vector nodes and Bert sentence vector nodes included in the target Bert word-sentence vector; the boundary value between words is determined according to the degree to which the two words appear at the same time, and the boundary value between words and sentences is determined by the ratio of the frequency of the word appearing in all texts to the frequency of the word appearing in this text.

[0059] For example, if there are 50,000 texts, and 10,000 of them contain both the word "A" and the word "B", then the edge between two adjacent BERT word vector nodes "A" and "B" is 10,000 divided by 50,000, and the first edge value corresponding to the target BERT word vector is "0.2".

[0060] For example, in two adjacent Bert word vector nodes and Bert sentence vector nodes, the word "C" appears 3 times in sentence "D", and there are 10 texts containing the word "C". Then the edge between this word and this sentence is 3 divided by 10, and the second edge value corresponding to the target Bert word sentence vector is "0.3".

[0061] S203 optimizes nodes in a topological graph model by using a graph autoencoder with an attention mechanism to obtain multiple features of each node.

[0062] Among them, graph autoencoders are mainly used in unsupervised learning and are suitable for learning graph node representations of unsupervised information.

[0063] In specific implementation, to integrate neighbor node word information and location information to obtain more abstract word and text representation features, this application proposes to employ a graph autoencoder with an attention mechanism to construct a higher-level, more abstract node representation that integrates more information. Unlike directly averaging the neighbor node information of a node, the attention mechanism learns to optimize the weights of different neighbor nodes and then calculates a weighted average to optimize the master node, avoiding the shortcomings of traditional methods that treat all nodes equally.

[0064] In one embodiment, this step includes: determining the master node in the topology graph model and the neighbor nodes adjacent to the master node; optimizing the weights of each neighbor node using a graph autoencoder with an attention mechanism to obtain optimized weights, the weights being determined based on the importance of the neighbor nodes to the master node; performing a weighted summation calculation based on the optimized weights to optimize the nodes of the topology graph model, obtain the high-level features of each master node, and obtain the multiple features of each node.

[0065] Since each word in each sentence of the text needs to have its corresponding features calculated, each word will take turns being the master node, and the words adjacent to it will be the neighbor nodes. Therefore, in this embodiment of the application, the master node can refer to the node currently being analyzed.

[0066] The weight can refer to the importance of the neighboring node to the master node.

[0067] In the specific implementation, the server 200 can call a graph autoencoder with an attention mechanism to analyze and calculate the high-level features of each node. That is, it uses a weighted summation method to fuse the information of neighboring nodes to calculate the high-level features of each master node and obtain the multiple features of each node.

[0068] S204 uses a clustering algorithm to analyze multiple features in order to extract the topic information of the text to be processed.

[0069] In specific implementation, this application proposes to use the KL divergence of node probability distribution and node pseudo-label probability distribution as the loss function of the model to optimize the model parameters, thereby extracting the topic information of the text to be processed. The specific analysis steps will be described in detail below.

[0070] In one embodiment, this step includes: processing multiple features of each node in parallel to obtain a feature matrix of the topology graph model, wherein the multiple features include model features, position features, weighted average features and importance features; analyzing the feature matrix through a clustering algorithm to obtain a first probability distribution and a second probability distribution corresponding to each node, wherein the second probability distribution is the result of quadratic analysis of the first probability distribution; and extracting the topic information of the text to be processed based on the first probability distribution and the second probability distribution.

[0071] The multiple features include model features, positional features, weighted average features, and importance features. Positional features refer to the frequency of a word within a sentence; for example, if a word appears once at positions 3, 4, and 5, then the positional feature is the average of 4. Weighted average features can refer to the weighted information from neighboring nodes. For instance, each node in a graph structure can be viewed as a combination of (k, v), where k and v for each word are randomly initialized and will be optimized later. The weighted average feature is calculated by combining the k of the current master node with the k of a neighboring node. j The product is used as the weight of this neighboring node, and this product is then multiplied by the V of this neighboring node. j This yields the weighted information of the neighboring nodes; the importance feature can be the reciprocal of the number of times the word appears in the text.

[0072] In the specific implementation, server 200 analyzes and obtains multiple features from each node. These multiple features can be integrated to obtain the "embedding" of each node. "Embedding" is a way to convert discrete variables into continuous vector representations. In neural networks, "embedding" is very useful because it can not only reduce the spatial dimensionality of discrete variables, but also meaningfully represent the variable. The parallel concatenation of the "embeddings" of all nodes yields the feature matrix of the topological graph model. Using a clustering algorithm to calculate the probability of each node belonging to each set, the first probability distribution "Q" can be obtained. Taking the square of the first probability distribution "Q" yields the second probability distribution "P".

[0073] For example, calculating the probability "q_iu" that each node "i" belongs to each set "u", and then taking the square of "q_iu" to get "p_iu=q_iu^2", can make the distribution of "q_iu" more acute.

[0074] In one embodiment, the step of extracting topic information of the text to be processed based on the first probability distribution and the second probability distribution includes: obtaining KL divergence information between the first probability distribution and the second probability distribution as the target loss; analyzing the target loss through a preset optimization algorithm to update the model parameters of the topology graph model and obtain the updated model parameters; if the KL divergence information is less than a preset divergence threshold, extracting the topic information of the text to be processed based on the updated model parameters.

[0075] The first probability distribution (“Q”) is the predicted probability distribution of each node belonging to different sets, and the second probability distribution (“P”) is the pseudo-label probability distribution of each node belonging to different sets.

[0076] In specific implementation, server 200 can quantify the difference between two probability distributions P and Q by calculating the KL divergence between "P" and "Q". Then, the difference in probability distributions is used as the loss. The model parameters are updated using the stochastic gradient descent optimization algorithm or the Adam optimization algorithm. For example, the new "Q" and "P" are repeatedly calculated, and the model parameters are iteratively optimized until the KL divergence is less than a certain threshold. The topic information of the text to be processed can then be analyzed based on the updated model parameters.

[0077] Furthermore, the formula for the target loss can be expressed as:

[0078]

[0079] Where M refers to the number of categories; y ic The indicator variable (0 or 1) is 1 if the predicted class of sample i is the same as the true class (equal to c), and 0 otherwise; p icThis represents the predicted probability that observed sample i belongs to category c.

[0080] In one embodiment, the step of extracting the topic information of the text to be processed based on the updated model parameters if the KL divergence information is less than a preset divergence threshold includes: if the KL divergence information is less than the preset divergence threshold, obtaining the trained topology graph model based on the updated model parameters; obtaining the target set to which each node belongs based on the trained topology graph model; and extracting the center node of each target set to obtain the topic information of the text to be processed.

[0081] In the specific implementation, after the server 200 analyzes and determines the set to which each node ultimately belongs, it can extract the word nodes closest to the center of each set to obtain the topic word set representing the content of this set, and then output the topic of each cluster category as the topic information of the text to be processed.

[0082] Furthermore, server 200 can employ the Single-Pass incremental clustering algorithm to categorize each node into sets, and then use the TextRank text summarization algorithm to calculate the semantic topics of each set, thereby obtaining the topic information of the text to be processed. Specifically, server 200 can calculate the similarity between the text to be processed and the centers of existing "topic sets," and determine the relationship between the text and the current topic set based on a set similarity threshold. If the similarity is within the set threshold range, the text will be categorized into this topic set; if it exceeds the threshold, a new topic set will be created. After server 200 completes the set classification operation, it can select the N words closest to the center point of each set as representatives of that set, and then call Python toolkits to analyze the representative words of each set to calculate the text topic information.

[0083] If the aforementioned text theme extraction method is applied to a logistics customer complaint scenario, it can periodically and rapidly aggregate the semantics of complaint texts on a large scale, automatically summarizing and categorizing different complaint themes. This allows decision-makers to quickly grasp customer demands in real time and apply them to optimize business processes. Furthermore, because machine computation costs are very low, its application in text theme extraction can quickly respond to problems arising in business processes, enabling timely loss mitigation and optimization. Therefore, it can improve both the accuracy and efficiency of text theme extraction.

[0084] The text topic extraction method in the above embodiments, by employing a BERT language model followed by a topological graph model, achieves task-oriented integrated optimization of word vectors and model parameters, unlike traditional methods which obtain word vectors that are not task-oriented and cannot be flexibly adjusted according to different scenarios. This avoids the problem of vectors not being able to be updated purposefully due to the complete separation of vectors and graph models. Furthermore, because a topological graph model is used to extract text topics, it can more comprehensively capture co-occurring word information with long distances and discontinuous intervals, which is beneficial for in-depth text information mining. Moreover, because a graph autoencoder with an attention mechanism is used to compute the high-level features of nodes, it avoids the problem of traditional methods treating all neighboring nodes equally, thus improving the accuracy of text topic extraction.

[0085] To better implement the text topic extraction method provided in the embodiments of this application, based on the text topic extraction method provided in the embodiments of this application, this application also provides a text topic extraction device, such as... Figure 4 As shown, the text topic extraction device 400 includes:

[0086] Vector acquisition module 410 is used to acquire BERT word vectors and BERT sentence vectors of the text to be processed;

[0087] Model building module 420 is used to construct a topology graph model by using Bert word vectors and Bert sentence vectors as nodes.

[0088] The node optimization module 430 is used to optimize the nodes of the topological graph model by using a graph autoencoder with an attention mechanism to obtain multiple features of each node.

[0089] The topic extraction module 440 is used to analyze multiple features through clustering algorithms to extract topic information from the text to be processed.

[0090] In some embodiments of this application, the topic extraction module 440 is also used to process the multiple features of each node in parallel to obtain the feature matrix of the topology graph model. The multiple features include model features, position features, weighted average features, and importance features. The feature matrix is analyzed by a clustering algorithm to obtain the first probability distribution and the second probability distribution corresponding to each node. The second probability distribution is the result of the quadratic analysis of the first probability distribution. The topic information of the text to be processed is extracted based on the first probability distribution and the second probability distribution.

[0091] In some embodiments of this application, the topic extraction module 440 is further configured to obtain KL divergence information between the first probability distribution and the second probability distribution as the target loss; analyze the target loss through a preset optimization algorithm to update the model parameters of the topology graph model and obtain the updated model parameters; if the KL divergence information is less than a preset divergence threshold, extract the topic information of the text to be processed according to the updated model parameters.

[0092] In some embodiments of this application, the topic extraction module 440 is further configured to, if the KL divergence information is less than a preset divergence threshold, obtain the trained topology graph model according to the updated model parameters; obtain the target set to which each node belongs based on the trained topology graph model; extract the center node of each target set to obtain the topic information of the text to be processed.

[0093] In some embodiments of this application, the model building module 420 is further configured to construct an initial topology graph model using Bert word vectors and Bert sentence vectors as nodes; determine the target Bert word vectors and target Bert sentence vectors in the initial topology graph model, wherein the target Bert word vectors are two adjacent Bert word vector nodes and the target Bert sentence vectors are two adjacent Bert word vector nodes and Bert sentence vector nodes; and obtain the edge values corresponding to the target Bert word vectors and target Bert sentence vectors respectively, so as to optimize the initial topology graph model using the edge values to obtain the topology graph model.

[0094] In some embodiments of this application, the model building module 420 is further configured to analyze the first co-occurrence degree of the target BERT word vector in the text to be processed, and obtain the first edge value corresponding to the target BERT word vector; analyze the second co-occurrence degree of the BERT word vector node in the target BERT word vector under the corresponding BERT sentence vector node, and analyze the third co-occurrence degree of the BERT word vector node in the target BERT word vector in the text to be processed, and obtain the second edge value corresponding to the target BERT word vector; and optimize the initial topology graph model based on the first edge value and the second edge value to obtain the topology graph model.

[0095] In some embodiments of this application, the node optimization module 430 is further used to determine the master node in the topology graph model and the neighbor nodes adjacent to the master node; optimize the weights of each neighbor node by using a graph autoencoder with an attention mechanism to obtain optimized weights, the weights being determined based on the importance of the neighbor nodes to the master node; perform weighted summation calculation based on the optimized weights to optimize the nodes of the topology graph model, obtain the high-level features of each master node, and obtain the multiple features of each node.

[0096] In the above embodiments, by adopting the BERT language model followed by a topological graph model, the word vectors obtained by traditional methods are not task-oriented and cannot be flexibly adjusted according to different scenarios and tasks. This achieves the goal of integrated optimization of word vectors and model parameters in a task-oriented manner, avoiding the problem of vectors not being able to be updated purposefully due to the complete separation of vectors and graph models. Furthermore, because a topological graph model is used to extract text topics, it can more comprehensively capture co-occurring word information with long distances of discontinuity, which is conducive to in-depth mining of text information. Moreover, because a graph autoencoder with an attention mechanism is used to calculate the high-level features of nodes, it avoids the problem of traditional methods treating all neighboring nodes equally, thus improving the accuracy of text topic extraction.

[0097] In some embodiments of this application, the text topic extraction device 400 can be implemented as a computer program, which can be implemented in, for example... Figure 5 The computer device shown operates on this device. The computer device's memory can store the various program modules that make up the text topic extraction device 400, for example, Figure 4 The diagram shows a vector acquisition module 410, a model building module 420, a node optimization module 430, and a topic extraction module 440. The computer program comprised of these modules causes the processor to execute the steps in the text topic extraction methods of the various embodiments of this application described in this specification.

[0098] For example, Figure 5 The computer equipment shown can be used as follows Figure 4 The vector acquisition module 410 in the text topic extraction device 400 shown executes step S201. The computer device can execute step S202 via the model building module 420. The computer device can execute step S203 via the node optimization module 430. The computer device can execute step S204 via the topic extraction module 440. The computer device includes a processor, memory, and a network interface connected via a system bus. The processor of the computer device provides computational and control capabilities. The memory of the computer device includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with external computer devices via a network connection. When the computer program is executed by the processor, it implements a text topic extraction method.

[0099] Those skilled in the art will understand that Figure 5The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0100] In some embodiments of this application, a computer device is provided, including one or more processors; a memory; and one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processors using the steps of the text topic extraction method described above. The steps of the text topic extraction method here may be steps from the text topic extraction methods of the various embodiments described above.

[0101] In some embodiments of this application, a computer-readable storage medium is provided, storing a computer program that is loaded by a processor, causing the processor to execute the steps of the text topic extraction method described above. The steps of the text topic extraction method here can be the steps in the text topic extraction methods of the various embodiments described above.

[0102] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.

[0103] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0104] The foregoing has provided a detailed description of a text topic extraction method, apparatus, computer device, and storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A method for extracting text topics, characterized in that, include: Obtain the BERT word vectors and BERT sentence vectors of the text to be processed; wherein the BERT word vectors and the BERT sentence vectors are trainable and updatable variables; Using the Bert word vectors and the Bert sentence vectors as nodes, a topological graph model is constructed. By using a graph autoencoder with an attention mechanism, the topological graph model is optimized to obtain multiple features of each node. The multiple features are analyzed using a clustering algorithm to extract the topic information of the text to be processed. The step of analyzing the multiple features using a clustering algorithm to extract the topic information of the text to be processed includes: The multiple features of each node are processed in parallel to obtain the feature matrix of the topology graph model, wherein the multiple features include model features, position features, weighted average features and importance features; The feature matrix is analyzed by clustering algorithm to obtain the first probability distribution and the second probability distribution corresponding to each node. The second probability distribution is the result of quadratic analysis of the first probability distribution. Based on the first probability distribution and the second probability distribution, the topic information of the text to be processed is extracted, wherein the first probability distribution and the second probability distribution are used to determine the target loss of the topological graph model.

2. The method as described in claim 1, characterized in that, The step of extracting the topic information of the text to be processed based on the first probability distribution and the second probability distribution includes: Obtain the KL divergence information between the first probability distribution and the second probability distribution as the target loss; The target loss is analyzed by a preset optimization algorithm to update the model parameters of the topology graph model, and the updated model parameters are obtained. If the KL divergence information is less than the preset divergence threshold, the topic information of the text to be processed is extracted according to the updated model parameters.

3. The method as described in claim 2, characterized in that, If the KL divergence information is less than a preset divergence threshold, the topic information of the text to be processed is extracted based on the updated model parameters, including: If the KL divergence information is less than the preset divergence threshold, the trained topology graph model is obtained according to the updated model parameters. Based on the trained topology graph model, obtain the target set to which each node belongs; Extract the central node of each target set to obtain the topic information of the text to be processed.

4. The method as described in claim 1, characterized in that, The step of constructing a topology graph model using the Bert word vectors and the Bert sentence vectors as nodes includes: Using the Bert word vectors and the Bert sentence vectors as nodes, an initial topology graph model is constructed. Determine the target Bert word vector and target Bert sentence vector in the initial topology graph model. The target Bert word vector is two adjacent Bert word vector nodes, and the target Bert sentence vector is two adjacent Bert word vector nodes and Bert sentence vector nodes. Obtain the edge values corresponding to the target Bert word vector and the target Bert sentence vector, and use the edge values to optimize the initial topology graph model to obtain the topology graph model.

5. The method as described in claim 4, characterized in that, The step of obtaining the edge values corresponding to the target BERT word vector and the target BERT sentence vector, and using the edge values to optimize the initial topology graph model to obtain the topology graph model, includes: Analyze the first co-occurrence degree of the target BERT word vector in the text to be processed, and obtain the first edge value corresponding to the target BERT word vector; Analyze the second co-occurrence degree of the Bert word vector nodes in the target Bert word and sentence vectors under the corresponding Bert sentence vector nodes, and analyze the third co-occurrence degree of the Bert word vector nodes in the target Bert word and sentence vectors in the text to be processed, and obtain the second boundary value corresponding to the target Bert word and sentence vectors; The initial topology graph model is optimized based on the first edge value and the second edge value to obtain the topology graph model.

6. The method as described in claim 1, characterized in that, The graph autoencoder with an attention mechanism is used to optimize the nodes of the topological graph model and obtain multiple features of each node, including: Determine the master node in the topology graph model, and the neighbor nodes adjacent to the master node; The weights of each neighbor node are optimized using a graph autoencoder with an attention mechanism to obtain optimized weights, which are determined based on the importance of each neighbor node to the master node. Based on the optimized weights, a weighted summation calculation is performed to optimize the nodes of the topology graph model, obtain the high-level features of each master node, and obtain the multiple features of each node.

7. A text topic extraction device, characterized in that, include: The vector acquisition module is used to acquire the BERT word vectors and BERT sentence vectors of the text to be processed; wherein the BERT word vectors and the BERT sentence vectors are trainable and updatable variables; The model building module is used to construct a topology graph model by using the Bert word vectors and the Bert sentence vectors as nodes. The node optimization module is used to optimize the nodes of the topological graph model by using a graph autoencoder with an attention mechanism to obtain multiple features of each node. The topic extraction module is used to analyze the multiple features through a clustering algorithm to extract the topic information of the text to be processed. The topic extraction module is also used for: The multiple features of each node are processed in parallel to obtain the feature matrix of the topology graph model, wherein the multiple features include model features, position features, weighted average features and importance features; The feature matrix is analyzed by clustering algorithm to obtain the first probability distribution and the second probability distribution corresponding to each node. The second probability distribution is the result of quadratic analysis of the first probability distribution. Based on the first probability distribution and the second probability distribution, the topic information of the text to be processed is extracted, wherein the first probability distribution and the second probability distribution are used to determine the target loss of the topological graph model.

8. A computer device, characterized in that, The computer device includes: One or more processors; The memory; and one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the text topic extraction method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, It stores a computer program, which is loaded by a processor to perform the steps of the text topic extraction method according to any one of claims 1 to 6.

Citation Information

Patent Citations

CN107679135A
CN110688537A
CN113127632A

Patent Information

Abstract

Description

Patent Citations

CN107679135A

CN110688537A

CN113127632A