A sentence similarity generation method and device

By using an encoder model and topic-word distribution to apply vector weights to sentences, the accuracy problem of sentence similarity calculation in search engines is solved, enabling more flexible and objective sentence similarity calculation and supporting fine-grained analysis in fields such as financial analysis.

CN116108827BActive Publication Date: 2026-06-16THE PEOPLES BANK OF CHINA NAT CLEARING CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
THE PEOPLES BANK OF CHINA NAT CLEARING CENT
Filing Date
2022-11-11
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, the calculation of sentence similarity in search engines relies on human judgment, which lacks accuracy and leads to inaccurate calculation results.

Method used

By using a pre-defined encoder model and topic-word distribution, sentences are vector-weighted to generate sentence similarity. A document topic generation algorithm is then used for topic-word decomposition, and the topic-word distribution is dynamically updated to improve the flexibility and objectivity of sentence similarity.

🎯Benefits of technology

It achieves both accuracy and flexibility in sentence similarity, enabling more precise calculation of information changes in text and supporting fine-grained analysis in fields such as financial analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116108827B_ABST
    Figure CN116108827B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a kind of sentence similarity generation method and device, can be used in artificial intelligence technical field, the method comprises: obtaining first sentence and second sentence to be compared, first sentence is the sentence in first text, second sentence is the sentence in second text;Through the encoder model of preestablished, obtain the word vector in first sentence and second sentence;First sentence and second sentence word vector are respectively carried out vector weighting by theme-word distribution, obtain first sentence vector and second sentence vector, wherein, theme-word distribution is obtained by carrying out theme-word decomposition to the text of same type text library;Similarity is calculated according to first sentence vector and second sentence vector, obtain sentence similarity, generate the sentence similarity between first sentence and second sentence based on theme-word distribution, improve the flexibility and objectivity of set theme and word, to ensure the accuracy of sentence similarity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, particularly to the field of artificial intelligence technology, and especially to a method and apparatus for generating sentence similarity. Background Technology

[0002] With the rapid development of the internet, the amount of information online is growing exponentially. This massive amount of data provides users with a rich source of query data. In related technologies, users typically use various search engines to find similar statements or texts. However, the topics for similar statements or texts in search engines are determined subjectively. This method of judging the attribution of topics and words based on subjective means relies on human cognition and lacks accuracy, resulting in inaccurate calculations of statement similarity. Summary of the Invention

[0003] One object of this invention is to provide a sentence similarity generation method that generates the sentence similarity between a first sentence and a second sentence based on topic-word distribution, improving the flexibility and objectivity of setting topics and words, thereby ensuring the accuracy of sentence similarity. Another object of this invention is to provide a sentence similarity generation apparatus. A further object of this invention is to provide a computer-readable medium. A still other object of this invention is to provide a computer device.

[0004] To achieve the above objectives, this invention discloses a method for generating sentence similarity, comprising:

[0005] Retrieve the first statement and the second statement to be compared. The first statement is the statement in the first text, and the second statement is the statement in the second text.

[0006] Using a pre-defined encoder model and topic-word distribution, the first and second sentences are weighted by vectors to obtain the vectors for the first and second sentences respectively.

[0007] The similarity is calculated based on the vectors of the first and second sentences to obtain the sentence similarity.

[0008] Preferably, before performing vector weighting on the first and second sentences respectively using a preset encoder model and topic-word distribution to obtain the first sentence vector and the second sentence vector, the method further includes:

[0009] Retrieve historical text;

[0010] By using a document topic generation algorithm, topic-word decomposition is performed on historical texts to obtain topic-word distribution.

[0011] Preferably, the method further includes:

[0012] Perform language preprocessing on the first text and the second text respectively to obtain the preprocessed first text and the second text;

[0013] Using a document topic generation algorithm, topic-word decomposition is performed on the preprocessed first and second texts respectively to obtain the topic-word distribution of the first and second texts.

[0014] Based on the topic-word distribution of the first text and the topic-word distribution of the second text, the topic-word distribution is updated to obtain the updated topic-word distribution.

[0015] Preferably, the topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability of each word;

[0016] Using a pre-defined encoder model and topic-word distribution, the first and second sentences are weighted by vectors to obtain the first sentence vector and the second sentence vector, respectively, including:

[0017] The word weight distribution is obtained by normalizing the probability corresponding to each word.

[0018] Using a pre-defined encoder model and word weight distribution, the first and second sentences are weighted by vectors to obtain the vectors for the first and second sentences.

[0019] Preferably, by using a preset encoder model and word weight distribution, the first and second sentences are weighted into vectors to obtain the first sentence vector and the second sentence vector, respectively, including:

[0020] The first sentence is encoded using a pre-defined encoder model to obtain multiple first word vectors;

[0021] By weighting and fusing multiple first word vectors through word weight distribution, the first sentence vector is obtained;

[0022] The second sentence is encoded using a pre-defined encoder model to obtain multiple second word vectors.

[0023] By weighting and fusing multiple second word vectors through word weight distribution, the second sentence vector is obtained.

[0024] Preferably, after calculating the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity, the method further includes:

[0025] Using a preset retrieval and annotation strategy, the first statement is annotated in the first text based on statement similarity, and the second statement is annotated in the second text.

[0026] The present invention also discloses a sentence similarity generation device, comprising:

[0027] The first acquisition unit is used to acquire a first statement and a second statement to be compared, wherein the first statement is a statement in the first text and the second statement is a statement in the second text;

[0028] The vector weighting unit is used to perform vector weighting on the first and second sentences respectively using a preset encoder model and topic-word distribution to obtain the first sentence vector and the second sentence vector;

[0029] The similarity generation unit is used to calculate the similarity between the first sentence vector and the second sentence vector to obtain the sentence similarity.

[0030] Preferably, the device further includes:

[0031] The second acquisition unit is used to acquire historical text;

[0032] The first topic-word decomposition unit is used to perform topic-word decomposition on historical text using a document topic generation algorithm to obtain the topic-word distribution.

[0033] Preferably, the device further includes:

[0034] The language preprocessing unit is used to perform language preprocessing on the first text and the second text respectively, to obtain the preprocessed first text and the second text.

[0035] The second topic-word decomposition unit is used to perform topic-word decomposition on the preprocessed first and second texts respectively using a document topic generation algorithm to obtain the topic-word distribution of the first text and the topic-word distribution of the second text.

[0036] The update unit is used to update the topic-word distribution based on the first text topic-word distribution and the second text topic-word distribution to obtain the updated topic-word distribution.

[0037] Preferably, the topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability of each word;

[0038] The vector weighting unit is specifically used to normalize the probability of each word to obtain the word weight distribution; through the preset encoder model and word weight distribution, the first sentence and the second sentence are weighted by vector to obtain the first sentence vector and the second sentence vector.

[0039] Preferably, the vector weighting unit is specifically used to encode the first sentence using a preset encoder model to obtain multiple first word vectors; to weight and fuse the multiple first word vectors using word weight distribution to obtain a first sentence vector; to encode the second sentence using a preset encoder model to obtain multiple second word vectors; and to weight and fuse the multiple second word vectors using word weight distribution to obtain a second sentence vector.

[0040] Preferably, the device further includes:

[0041] The annotation unit is used to annotate the first statement in the first text and the second statement in the second text based on the statement similarity according to the preset retrieval annotation strategy.

[0042] The present invention also discloses a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the method described above.

[0043] The present invention also discloses a computer device, including a memory and a processor, wherein the memory is used to store information including program instructions, and the processor is used to control the execution of the program instructions, wherein the processor executes the program to implement the method described above.

[0044] The present invention also discloses a computer program product, including a computer program / instruction, which, when executed by a processor, implements the method described above.

[0045] This invention obtains a first statement and a second statement to be compared. The first statement is a statement in a first text, and the second statement is a statement in a second text. Through a preset encoder model and topic-word distribution, the first statement and the second statement are weighted by vectors to obtain a first statement vector and a second statement vector. The similarity is calculated based on the first statement vector and the second statement vector to obtain the statement similarity. The statement similarity between the first statement and the second statement is generated based on the topic-word distribution, which improves the flexibility and objectivity of setting topics and words, thereby ensuring the accuracy of the statement similarity. Attached Figure Description

[0046] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0047] Figure 1 A flowchart of a sentence similarity generation method provided in an embodiment of the present invention;

[0048] Figure 2 A flowchart illustrating another sentence similarity generation method provided in an embodiment of the present invention;

[0049] Figure 3 A schematic diagram of a topic-word distribution provided in an embodiment of the present invention;

[0050] Figure 4 This is a schematic diagram of the structure of a sentence similarity generation device provided in an embodiment of the present invention;

[0051] Figure 5 This is a schematic diagram of the structure of a computer device provided in an embodiment of the present invention. Detailed Implementation

[0052] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0053] It should be noted that the sentence similarity generation method and apparatus disclosed in this application can be used in the field of artificial intelligence technology, or in any field other than artificial intelligence technology. The application field of the sentence similarity generation method and apparatus disclosed in this application is not limited.

[0054] To facilitate understanding of the technical solution provided in this application, the relevant content of the technical solution will be explained below. Currently, text similarity is widely used in search engines for recommending similar articles, online websites for recommending similar products, and social media for recommending similar content. Text similarity calculation is also widely used in financial analysis, such as in financial sentiment analysis and tracking, financial report analysis, and central bank communication behavior analysis.

[0055] In financial professional reports, the texts are often rich in content, cover multiple financial topics, and use highly specialized vocabulary. Directly comparing the similarity of the entire text is too coarse-grained and not conducive to further quantitative analysis. If the text could be broken down into fine-grained topics and similarity calculations could be performed on each topic, it would facilitate the work of financial analysts, allowing them to quickly focus their attention on the topics of interest and providing support for quantitative analysis.

[0056] This invention proposes a sentence similarity generation method. It constructs a topic-word model by performing machine learning on a large amount of text; it then performs topic segmentation on the target text, weights the sentence vectors according to the target topic, and extracts the principal components of the target topic direction to achieve fine-grained topic-direction similarity calculation. This invention can perform machine learning on large amounts of text, dynamically build and improve the topic-word model in real time, and accurately calculate the similarity of text based on target topics of interest to users, helping users to more accurately grasp the information changes in the text.

[0057] The following uses a statement similarity generation device as an example to illustrate the implementation process of the statement similarity generation method provided in this embodiment of the invention. It is understood that the execution subject of the statement similarity generation method provided in this embodiment of the invention includes, but is not limited to, a statement similarity generation device.

[0058] Figure 1 A flowchart of a sentence similarity generation method provided in an embodiment of the present invention is shown below. Figure 1 As shown, the method includes:

[0059] Step 101: Obtain the first statement and the second statement to be compared. The first statement is the statement in the first text, and the second statement is the statement in the second text.

[0060] Step 102: Using the preset encoder model and topic-word distribution, the first sentence and the second sentence are weighted by vector to obtain the vector of the first sentence and the vector of the second sentence.

[0061] Step 103: Calculate the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity.

[0062] It is worth noting that the acquisition, storage, use, and processing of data in the technical solution of this application all comply with the relevant provisions of national laws and regulations. The user information in the embodiments of this application was obtained through legal and compliant means, and the acquisition, storage, use, and processing of user information have been authorized and agreed upon by the client.

[0063] In the technical solution provided by this invention, a first statement and a second statement to be compared are obtained. The first statement is a statement in a first text, and the second statement is a statement in a second text. Through a preset encoder model and topic-word distribution, the first statement and the second statement are weighted by vectors to obtain a first statement vector and a second statement vector. The similarity is calculated based on the first statement vector and the second statement vector to obtain the statement similarity. The statement similarity between the first statement and the second statement is generated based on the topic-word distribution, which improves the flexibility and objectivity of setting topics and words, thereby ensuring the accuracy of the statement similarity.

[0064] Figure 2A flowchart of another sentence similarity generation method provided in an embodiment of the present invention is shown below. Figure 2 As shown, the method includes:

[0065] Step 201: Obtain historical text.

[0066] In this embodiment of the invention, each step is performed by a statement similarity generation device.

[0067] In this embodiment of the invention, historical texts can be selected from a database, such as financial reports. Multiple financial reports constitute a financial text sequence. Each text in the sequence, viewed from a time series perspective, is equivalent to different words appearing in the thesaurus at different points in time.

[0068] Step 202: Using the Document Topic Generation (LDA) algorithm, perform topic-word decomposition on the historical text to obtain the topic-word distribution, which includes multiple words and the probability corresponding to each word.

[0069] In this embodiment of the invention, the Document Topic Generation (LDA) algorithm is a three-layer Bayesian probabilistic model, comprising three layers: word, subject, and document. Each word in a text is obtained through a process of selecting a topic with a certain probability, and then selecting the word from that topic with a certain probability. The LDA algorithm is an unsupervised learning technique that can be used to identify hidden topic information in massive documents. It employs the bag-of-words method, which identifies a document as a word frequency vector, transforming textual information into mathematical information.

[0070] Specifically, the historical text is input into the LDA algorithm for decomposition, and the output text topic distribution and topic-word distribution are obtained. The text topic distribution includes multiple topics corresponding to the historical text and the probability that the historical text belongs to each topic. The topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability of each word. All words belong to the historical text. Figure 3 This is a schematic diagram of a topic-word distribution provided in an embodiment of the present invention, such as... Figure 3 As shown, the distribution plot on the top side is topic 0, with words on the horizontal axis and probabilities on the vertical axis, i.e., the number of times a word appears in topic 0 of the text; the distribution plot on the bottom side is topic 1, with words on the horizontal axis and probabilities on the vertical axis, i.e., the number of times a word appears in topic 1 of the text.

[0071] In this embodiment of the invention, the same word can belong to different topics, and the probability of a word is determined by word frequency, thereby improving the flexibility of word usage.

[0072] Step 203: Obtain the first statement and the second statement to be compared. The first statement is the statement in the first text, and the second statement is the statement in the second text.

[0073] In this embodiment of the invention, the first text and the second text are two texts to be compared. The first statement is extracted from the first text, and the second statement is extracted from the second text.

[0074] It is worth noting that the embodiments of the present invention do not limit the correspondence between the first statement in the first text and the second statement in the second text. The embodiments of the present invention take the example of the first text including only the first statement and the second text including only the second statement to illustrate the process of statement similarity calculation.

[0075] Further, language preprocessing is performed on the first and second texts respectively to obtain preprocessed first and second texts. Specifically, natural language processing (NLP) is used to preprocess the first and second texts, including but not limited to word segmentation and word cleaning. Word segmentation includes, but is not limited to, sentence segmentation and word segmentation, while word cleaning includes, but is not limited to, stop word removal. The LDA algorithm is then used to perform topic-word decomposition on the preprocessed first and second texts respectively, resulting in topic-word distributions for the first and second texts. Based on these topic-word distributions, the topic-word distribution is updated to obtain an updated topic-word distribution. Specifically, the topic-word distributions of the first and second texts are added to the topic-word distribution, and the probability of each word under each topic is updated to obtain the updated topic-word distribution. Dynamically updating the topic-word distribution improves the accuracy of sentence similarity.

[0076] Step 204: Normalize the probability of each word to obtain the word weight distribution.

[0077] As an alternative, under topic A, word a1 has the highest probability, with a maximum probability of 0.9, while word a10 has the lowest probability, with a minimum probability of 0.1. Using (0.9-0.1) / 100 as the interval scale, if the probability of word a3 falls in the nth interval, then the weight of word a3 is n(0.9-0.1) / 100.

[0078] It is worth noting that other normalization methods can also be adopted, and the embodiments of the present invention do not limit this.

[0079] Step 205: Using the preset encoder model and word weight distribution, the first sentence and the second sentence are weighted by vector to obtain the vector of the first sentence and the vector of the second sentence.

[0080] In this embodiment of the invention, the encoder model is either a BERT model or an s-BERT model.

[0081] In this embodiment of the invention, step 205 specifically includes:

[0082] Step 2051: Encode the first sentence using a preset encoder model to obtain multiple first word vectors.

[0083] Specifically, the first sentence is input into the encoder model for encoding, and multiple first word vectors are output.

[0084] Step 2052: Through word weight distribution, multiple first word vectors are weighted and fused to obtain the first sentence vector.

[0085] Specifically, the weight corresponding to each first word vector is retrieved from the word weight distribution; through The first sentence vector is obtained by weighting the first word vector and its corresponding weight. Here, s1 is the first sentence vector, and α... i w represents the weight corresponding to the first word vector. i This is the first word vector.

[0086] Step 2053: Encode the second statement using a preset encoder model to obtain multiple second word vectors.

[0087] Specifically, the second statement is input into the encoder model for encoding, and multiple second word vectors are output.

[0088] Step 2054: Through word weight distribution, multiple second word vectors are weighted and fused to obtain the second sentence vector.

[0089] Specifically, the weight corresponding to each second word vector is retrieved from the word weight distribution; through The weights corresponding to the second word vector are weighted and calculated to obtain the second sentence vector. Here, s2 is the second sentence vector, and β... m w represents the weight corresponding to the second word vector. m This is the second word vector.

[0090] Step 206: Calculate the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity.

[0091] Specifically, through Where s1 is the vector of the first sentence, s2 is the vector of the second sentence, and cosθ is the sentence similarity between the vectors of the first sentence and the vector of the second sentence.

[0092] Step 207: Using a preset retrieval and annotation strategy, the first statement is annotated in the first text based on statement similarity, and the second statement is annotated in the second text.

[0093] In this embodiment of the invention, the retrieval annotation strategy can be set according to actual needs, and this embodiment of the invention does not limit it. As an optional solution, the retrieval annotation strategy includes a similarity threshold. If the similarity of statements is greater than or equal to the similarity threshold, the first statement is annotated in the first text and the second statement is annotated in the second text according to the set annotation method. For example, if the annotation method is highlighting, the first statement is highlighted in the first text and the second statement is highlighted in the second text, which can quickly help users locate relevant content.

[0094] In the technical solution of the sentence similarity generation method provided in this embodiment of the invention, a first sentence and a second sentence to be compared are obtained. The first sentence is a sentence in a first text, and the second sentence is a sentence in a second text. Through a preset encoder model and topic-word distribution, the first sentence and the second sentence are respectively weighted by vector to obtain a first sentence vector and a second sentence vector. The similarity is calculated based on the first sentence vector and the second sentence vector to obtain the sentence similarity. The sentence similarity between the first sentence and the second sentence is generated based on the topic-word distribution, which improves the flexibility and objectivity of setting topics and words, thereby ensuring the accuracy of sentence similarity.

[0095] Figure 4 This is a schematic diagram of a statement similarity generation device provided in an embodiment of the present invention. This device is used to execute the above-described statement similarity generation method, such as... Figure 4 As shown, the device includes: a first acquisition unit 11, a vector weighting unit 12, and a similarity generation unit 13.

[0096] The first acquisition unit 11 is used to acquire a first statement and a second statement to be compared. The first statement is a statement in the first text, and the second statement is a statement in the second text.

[0097] The vector weighting unit 12 is used to perform vector weighting on the first sentence and the second sentence respectively through a preset encoder model and topic-word distribution to obtain the first sentence vector and the second sentence vector.

[0098] The similarity generation unit 13 is used to calculate the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity.

[0099] In this embodiment of the invention, the device further includes a second acquisition unit 14 and a first topic-word decomposition unit 15.

[0100] The second acquisition unit 14 is used to acquire historical text.

[0101] The first topic-word decomposition unit 15 is used to perform topic-word decomposition on historical text using a document topic generation algorithm to obtain topic-word distribution.

[0102] In this embodiment of the invention, the apparatus further includes: a language preprocessing unit 16, a second topic-word decomposition unit 17, and an update unit 18.

[0103] The language preprocessing unit 16 is used to perform language preprocessing on the first text and the second text respectively, to obtain the preprocessed first text and the second text.

[0104] The second topic-word decomposition unit 17 is used to perform topic-word decomposition on the preprocessed first text and second text respectively through a document topic generation algorithm to obtain the topic-word distribution of the first text and the topic-word distribution of the second text.

[0105] The update unit 18 is used to update the topic-word distribution based on the first text topic-word distribution and the second text topic-word distribution to obtain the updated topic-word distribution.

[0106] In this embodiment of the invention, the topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability corresponding to each word; the vector weighting unit 12 is specifically used to perform normalization processing based on the probability corresponding to each word to obtain the word weight distribution; through the preset encoder model and word weight distribution, the first sentence and the second sentence are respectively vector-weighted to obtain the first sentence vector and the second sentence vector.

[0107] In this embodiment of the invention, the vector weighting unit 12 is specifically used to encode the first sentence using a preset encoder model to obtain multiple first word vectors; to perform weighted fusion of the multiple first word vectors using word weight distribution to obtain a first sentence vector; to encode the second sentence using a preset encoder model to obtain multiple second word vectors; and to perform weighted fusion of the multiple second word vectors using word weight distribution to obtain a second sentence vector.

[0108] In this embodiment of the invention, the device further includes a labeling unit 19.

[0109] The annotation unit 19 is used to annotate the first statement in the first text and the second statement in the second text according to the statement similarity based on a preset retrieval annotation strategy.

[0110] In the embodiment of the present invention, a first statement and a second statement to be compared are obtained. The first statement is a statement in a first text, and the second statement is a statement in a second text. Through a preset encoder model and topic-word distribution, the first statement and the second statement are weighted by vectors to obtain a first statement vector and a second statement vector. The similarity is calculated based on the first statement vector and the second statement vector to obtain the statement similarity. The statement similarity between the first statement and the second statement is generated based on the topic-word distribution, which improves the flexibility and objectivity of setting topics and words, thereby ensuring the accuracy of the statement similarity.

[0111] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer device, specifically, a computer device can be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any combination of these devices.

[0112] This invention provides a computer device including a memory and a processor. The memory is used to store information including program instructions, and the processor is used to control the execution of the program instructions. When the program instructions are loaded and executed by the processor, they implement the steps of the above-described statement similarity generation method. For a detailed description, please refer to the above-described statement similarity generation method embodiments.

[0113] The following is for reference. Figure 5 It shows a schematic diagram of the structure of a computer device 600 suitable for implementing the embodiments of this application.

[0114] like Figure 5 As shown, the computer device 600 includes a central processing unit (CPU) 601, which can perform various appropriate tasks and processes based on programs stored in read-only memory (ROM) 602 or programs loaded from storage section 608 into random access memory (RAM) 603. The RAM 603 also stores various programs and data required for the operation of the computer device 600. The CPU 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.

[0115] The following components are connected to I / O interface 605: an input section 606 including a keyboard, mouse, etc.; an output section 607 including a cathode ray tube (CRT), liquid crystal feedback (LCD), etc., and speakers, etc.; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card such as a LAN card, modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to I / O interface 605 as needed. A removable medium 611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 610 as needed so that computer programs read from it can be installed in storage section 608 as needed.

[0116] In particular, according to embodiments of the present invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program including program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 609, and / or installed from removable medium 611.

[0117] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0118] For ease of description, the above devices are described separately by function as various units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.

[0119] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0120] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0121] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0122] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0123] The acquisition, storage, use, and processing of data in this application all comply with the relevant provisions of national laws and regulations.

[0124] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0125] This application can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0126] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0127] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A method for generating sentence similarity, characterized by, The method includes: Retrieve historical text; The historical text is decomposed into topics and words using a document topic generation algorithm to obtain an initial topic-word distribution. Obtain the first statement and the second statement to be compared, where the first statement is a statement in the first text and the second statement is a statement in the second text; Perform language preprocessing on the first text and the second text respectively to obtain the preprocessed first text and the second text; Using a document topic generation algorithm, the preprocessed first and second texts are respectively subjected to topic-word decomposition to obtain the topic-word distribution of the first text and the topic-word distribution of the second text. The initial topic-word distribution is updated based on the first text topic-word distribution and the second text topic-word distribution to obtain the target topic-word distribution. The update includes adding the first text topic-word distribution and the second text topic-word distribution to the initial topic-word distribution, and updating the probability of each word under each topic. Using a pre-defined encoder model and target topic-word distribution, the first and second sentences are weighted by vectors to obtain the first sentence vector and the second sentence vector, respectively. The similarity is calculated based on the first sentence vector and the second sentence vector to obtain the sentence similarity.

2. The sentence similarity generation method according to claim 1, characterized in that, The target topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability of each word. The process involves using a preset encoder model and target topic-word distribution to perform vector weighting on the first and second sentences respectively, resulting in a first sentence vector and a second sentence vector, including: The word weight distribution is obtained by normalizing the probability of each word in the target topic-word distribution. Using a pre-defined encoder model and word weight distribution, the first and second sentences are weighted by vectors to obtain the first sentence vector and the second sentence vector.

3. The statement similarity generation method according to claim 2, characterized in that, The process of using a preset encoder model and word weight distribution to perform vector weighting on the first and second sentences respectively, to obtain the first sentence vector and the second sentence vector, includes: The first statement is encoded using a pre-defined encoder model to obtain multiple first word vectors; By weighting and fusing multiple first word vectors using the aforementioned word weight distribution, the first sentence vector is obtained. The second statement is encoded using a pre-defined encoder model to obtain multiple second word vectors; By using the word weight distribution, multiple second word vectors are weighted and fused to obtain the second sentence vector.

4. The statement similarity generation method according to claim 1, characterized in that, After calculating the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity, the method further includes: Using a preset retrieval and annotation strategy, the first statement is annotated in the first text based on the statement similarity, and the second statement is annotated in the second text.

5. A sentence similarity generation device, characterized in that, The device includes: The first acquisition unit is used to acquire historical text; The first topic-word decomposition unit is used to perform topic-word decomposition on the historical text using a document topic generation algorithm to obtain an initial topic-word distribution; The second acquisition unit is used to acquire a first statement and a second statement to be compared, wherein the first statement is a statement in a first text and the second statement is a statement in a second text; The language preprocessing unit is used to perform language preprocessing on the first text and the second text respectively to obtain the preprocessed first text and the second text. The second topic-word decomposition unit is used to perform topic-word decomposition on the preprocessed first text and second text respectively using a document topic generation algorithm to obtain the topic-word distribution of the first text and the topic-word distribution of the second text. The update unit is used to update the initial topic-word distribution according to the first text topic-word distribution and the second text topic-word distribution to obtain the target topic-word distribution; The updating unit is specifically used to add the first text topic-word distribution and the second text topic-word distribution to the initial topic-word distribution, and update the probability of each word under each topic to obtain the target topic-word distribution; The vector weighting unit is used to perform vector weighting on the first statement and the second statement respectively using a preset encoder model and target topic-word distribution to obtain the first statement vector and the second statement vector; The similarity generation unit is used to calculate the similarity based on the first sentence vector and the second sentence vector to obtain the sentence similarity.

6. The sentence similarity generation device according to claim 5, characterized in that, The target topic-word distribution includes multiple topics, multiple words corresponding to each topic, and the probability of each word. The vector weighting unit is specifically used to normalize the probability of each word in the target topic-word distribution to obtain the word weight distribution; and to perform vector weighting on the first sentence and the second sentence respectively through the preset encoder model and word weight distribution to obtain the first sentence vector and the second sentence vector.

7. The statement similarity generation device according to claim 6, characterized in that, The vector weighting unit is specifically used to encode the first statement using a preset encoder model to obtain multiple first word vectors; to perform weighted fusion of the multiple first word vectors using the word weight distribution to obtain a first sentence vector; to encode the second statement using a preset encoder model to obtain multiple second word vectors; and to perform weighted fusion of the multiple second word vectors using the word weight distribution to obtain a second sentence vector.

8. The statement similarity generation device according to claim 5, characterized in that, The device further includes: The annotation unit is used to annotate the first statement in the first text and the second statement in the second text according to the statement similarity based on a preset retrieval annotation strategy.

9. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the statement similarity generation method as described in any one of claims 1 to 4.

10. A computer device comprising a memory and a processor, the memory for storing information including program instructions, and the processor for controlling the execution of the program instructions, characterized in that, When the program instructions are loaded and executed by the processor, they implement the statement similarity generation method according to any one of claims 1 to 4.

11. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the statement similarity generation method according to any one of claims 1 to 4.