Text deduplication method and electronic device

By using a deduplication model based on a self-attention mechanism and multi-dimensional similarity calculation, the problem of low text deduplication accuracy in professional scenarios is solved, achieving efficient and accurate text deduplication.

CN122285643APending Publication Date: 2026-06-26LAUNCH TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
LAUNCH TECH CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing text deduplication solutions have low deduplication accuracy in professional scenarios and struggle to identify synonymous and heterogeneous text.

Method used

A deduplication model based on self-attention mechanism is adopted, which combines domain parameters and multi-dimensional similarity calculation, dynamically adjusts the similarity threshold, and performs semantic understanding and text deduplication through self-attention mechanism.

Benefits of technology

It improves the accuracy and efficiency of text deduplication, adapts to the text characteristics of different fields, and is suitable for efficient deduplication in professional scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122285643A_ABST
    Figure CN122285643A_ABST
Patent Text Reader

Abstract

This invention provides a text deduplication method and an electronic device. The text deduplication method includes: acquiring a set of texts to be deduplicated; inputting the set of texts to be deduplicated into a deduplication model, causing the model to deduplicate the texts to obtain deduplication results. The set of texts to be deduplicated includes multiple texts, which belong to at least one domain. The deduplication model is configured as a semantic understanding model based on a self-attention mechanism and provides at least one domain parameter. The domain parameter is used to determine whether the texts to be deduplicated in the corresponding domain are similar texts. Texts determined to be similar texts are deleted after deduplication. This text deduplication method can improve the accuracy of text deduplication.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the technical field of data deduplication, and particularly relates to a text deduplication method and electronic device. Background Technology

[0002] With the rapid development of the internet and information technology, text data has experienced explosive growth, widely distributed across social media, corporate documents, academic databases, and various professional platforms. While this massive amount of text contains valuable information, it also contains a large amount of redundant data with duplicate or similar content. This not only consumes storage resources but also affects the efficiency and accuracy of subsequent data analysis, retrieval, and recommendation tasks. Therefore, efficient and accurate text deduplication technology has become a crucial step in data processing.

[0003] However, current text deduplication solutions suffer from low deduplication accuracy, which is particularly pronounced in professional scenarios. Summary of the Invention

[0004] This application provides a text deduplication method and electronic device, which can solve the problem of low deduplication accuracy in current text deduplication schemes.

[0005] Firstly, embodiments of this application provide a text deduplication method. This text deduplication method includes: obtaining a set of texts to be deduplicated; inputting the set of texts to be deduplicated into a deduplication model, so that the deduplication model deduplicates the set of texts to be deduplicated, thereby obtaining a deduplication result. The set of texts to be deduplicated includes multiple texts to be deduplicated, and the multiple texts to be deduplicated are texts from at least one domain. The deduplication model is configured as a model for semantic understanding based on a self-attention mechanism and provides at least one domain parameter, which is used to determine whether the texts to be deduplicated in the corresponding domain are similar texts. Texts determined to be similar texts are deleted after deduplication.

[0006] The text deduplication method provided in this embodiment introduces a deduplication model based on a self-attention mechanism for semantic understanding. This self-attention mechanism allows the deduplication model to directly calculate the association weights between a given word and all other words in the text, and then aggregate the information in a weighted manner. This enables the model to simultaneously focus on all information in the entire text and deeply understand the context and long-distance dependencies, thus possessing the ability to identify synonymous heterogeneous text. Furthermore, the deduplication model utilizes domain parameters that are dynamically adjusted according to the domain of the text to be deduplicated to identify similar texts, allowing the text deduplication method to adapt to the text characteristics of different domains. Therefore, this text deduplication method enhances the model's generalization ability across different domains while improving the depth of semantic understanding to increase deduplication accuracy, thereby improving the deduplication accuracy in professional scenarios.

[0007] In some design approaches, the deduplication model described above deduplicates the text set to be deduplicated, obtaining a deduplication result. This includes: the deduplication model preprocessing each text in the text set to be deduplicated, obtaining at least one preprocessed text. The preprocessing includes at least labeling with domain tags. Each preprocessed text has corresponding domain parameters based on its labeled domain tag. These domain parameters include a similarity threshold range, which is used to determine the similarity threshold parameter. For each preprocessed text, the similarity between the preprocessed text and each compared text in the corresponding domain's comparison text set is calculated. The similarity is then compared with the similarity threshold parameter corresponding to the preprocessed text to obtain at least one comparison result for the preprocessed text. Based on the at least one comparison result for each preprocessed text, similar texts in the text set to be deduplicated are determined. All texts determined to be similar are deleted from the text set to be deduplicated, resulting in the deduplication result.

[0008] It should be noted that deduplication of the text set to be deduplicated involves deleting texts identified as similar to other texts within the set. Similar texts are those with a high degree of similarity to the comparison object. In this embodiment, the comparison object is each text in the corresponding domain's comparison text set. It can be understood that the higher the similarity between the preprocessed text and the comparison text, the more similar the two texts are, and vice versa. Therefore, this embodiment uses the comparison result between the similarity between the preprocessed text and the comparison text and a similarity threshold parameter as the basis for determining the level of similarity to identify similar texts in the text set to be deduplicated.

[0009] This embodiment uses each compared text in the corresponding domain's comparison text set as the comparison object, focusing on comparing the preprocessed text with the corresponding domain's comparison text. This filters out a large number of invalid text comparisons, thereby improving deduplication efficiency. Furthermore, this embodiment's scheme uses domain tags to dynamically adjust the similarity threshold—the similarity threshold parameter—based on the domain to which the text belongs, solving the problem of poor adaptability of a fixed threshold across different domains.

[0010] Optionally, the preprocessing also includes removing general stop words from the text to be deduplicated; and calling a domain stop word list based on the domain of the text to be deduplicated to remove domain stop words from the text to be deduplicated.

[0011] In this embodiment, text cleaning is performed by combining general stop words and domain-specific stop words to remove the interference of irrelevant words, strengthen the extraction effect of the core semantics of the text, and further improve the purity of semantic representation and the deduplication accuracy. General stop words (such as "of, is, in") and domain stop words (such as non-distinguishing words like "inspection, repair" in the automotive repair field) are deleted to eliminate meaningless redundant information, enabling the semantic vector to focus on core keywords (such as "OBD fault code P0135"), and improving the accuracy and efficiency of similarity calculation.

[0012] In some optional embodiments, the domain parameters further include model fine-tuning parameters. Calculating the similarity between the preprocessed text and each comparison text in the comparison text set corresponding to the domain includes: determining the semantic vector of the preprocessed text and obtaining the semantic vector of each comparison text in the comparison text set corresponding to the domain; calculating the cosine similarity between the semantic vector of the preprocessed text and the semantic vector of each comparison text respectively; inputting each pair of text pairs formed by the preprocessed text and each comparison text into the semantic embedding module of the deduplication module, so that the semantic model configured with the model fine-tuning parameters in the semantic embedding module outputs the semantic correlation degree between the preprocessed text and each comparison text respectively; performing a fusion calculation on the semantic correlation degree between the preprocessed text and each comparison text respectively and the cosine similarity between their semantic vectors to obtain the similarity between the preprocessed text and each comparison text.

[0013] In this embodiment, both the surface features are considered through cosine similarity and the deep semantic association is considered through the semantic correlation degree output by the semantic model. A multi-dimensional similarity fusion calculation method is used to obtain the similarity between the preprocessed text and the comparison text, effectively improving the accuracy of similarity judgment and reducing the risk of misjudgment caused by single-vector similarity. In addition, through the configuration of the dimensionality reduction parameter and the third parameter, the similarity calculation is adapted to the domain characteristics, further improving the deduplication accuracy in professional scenarios and the adaptability to different domains.

[0014] In some non-restrictive embodiments, the comparison text set is a set formed by at least part of the original texts in the original text set. Determining the similar texts in the text set to be deduplicated according to at least one comparison result of each preprocessed text includes: for each preprocessed text, if there is a comparison result with a similarity exceeding the similarity threshold parameter in at least one comparison result, the text to be deduplicated corresponding to the preprocessed text is determined as a similar text.

[0015] This embodiment provides a deduplication scheme for incremental text deduplication scenarios. In this scenario, there is typically a pre-deduplicated and stored set of original text. Newly acquired text only needs to be compared with at least a portion of the existing original text to determine if it is similar, thus achieving efficient deduplication of incremental text and avoiding duplicate storage. Furthermore, this scheme compares the newly added text with the existing text, eliminating the need for pairwise comparisons within the newly added text, reducing computational complexity and improving deduplication efficiency. It is particularly suitable for real-time data streams or periodically updated data scenarios.

[0016] In other non-limiting embodiments, the comparison text set is a collection of at least some of the other texts in at least one preprocessed text. The above-described determination of similar texts in the text set to be deduplicated based on at least one comparison result for each preprocessed text includes: dividing the at least one preprocessed text into at least one connected component based on at least one comparison result for each preprocessed text; for each connected component, determining the text to be deduplicated corresponding to one node in the connected component as the representative text, and determining the text to be deduplicated corresponding to the remaining nodes in the connected component as similar texts.

[0017] This embodiment provides a deduplication scheme for a full-text deduplication scenario. In this scenario, the focus is on addressing the repetitive relationships within the entire text set to be deduplicated. The scheme first obtains similarity relationships through similarity comparison, and then clusters directly or indirectly similar texts into the same connected component based on the connected component algorithm. Within each connected component, a representative text is selected and retained, while the rest are marked as similar texts and deleted. This achieves the goal of removing redundancy while retaining representative information within each cluster of similar texts, improving the deduplication efficiency of large-scale data. It is suitable for comprehensive deduplication scenarios that require a thorough cleaning of large-scale text sets in one go.

[0018] In some design approaches, obtaining the semantic vector of each compared text in the comparison text set corresponding to the domain includes: performing an ANN search in a vector database that supports Approximate Nearest Neighbor Search (ANN) to obtain the semantic vector of each compared text in the comparison text set corresponding to the domain; the vector database stores the semantic vectors of multiple texts and their corresponding domain labels, and the search scope of the ANN search is the semantic vector of the text associated with the domain labels of the preprocessed text.

[0019] This embodiment utilizes a vector database that supports Approximate Nearest Neighbor (ANN) search to achieve efficient semantic vector retrieval. By limiting the retrieval scope through domain tags, it avoids indiscriminate searching across the entire database, significantly reducing computational power consumption and retrieval time, and supporting rapid deduplication of large-scale text.

[0020] Optionally, the domain parameters also include dimensionality reduction parameters. The above-mentioned determination of the semantic vector of the preprocessed text includes: determining the semantic embedding vector of the preprocessed text; performing vector optimization on the semantic embedding vector to obtain the semantic vector of the preprocessed text; the vector optimization includes at least one of dimensionality reduction processing according to the dimensionality reduction dimension indicated by the dimensionality reduction parameters, unit vector normalization processing, and redundant vector filtering.

[0021] In this embodiment, by performing vector optimization (dimensionality reduction, normalization, and redundancy filtering) on ​​the semantic vector of the preprocessed text, the storage and computational load are significantly reduced while retaining key semantic information, thereby improving the computational efficiency of large-scale text deduplication.

[0022] Optionally, preprocessing may also include keyword addition for very short text and / or segmentation for very long text. Very long text refers to text with more than a first threshold of characters, and very short text refers to text with less than a second threshold of characters.

[0023] In this embodiment, keyword enhancement and segmentation strategies are employed for very short texts and very long texts, respectively, to optimize the semantic representation quality of texts of different lengths. To address the insufficient semantic information in very short texts, keywords are added to supplement the semantics; to address the issue of very long texts exceeding the model's input limits, segmentation is used to adapt the model's capabilities. This ensures the quality of semantic vector generation for texts of different lengths, avoids deduplication bias caused by text length, and improves the adaptability of the deduplication model to texts of varying lengths.

[0024] In some examples, if the text to be deduplicated is a long text, it is segmented into a preprocessed text containing multiple segments. The determination of the semantic embedding vector of the preprocessed text includes: inputting each segment of the preprocessed text into the semantic embedding module to extract the basic semantic vector of each segment from the semantic model of the semantic embedding module; and performing at least attention-weighted average aggregation on the basic semantic vectors of each segment to obtain the semantic embedding vector of the preprocessed text.

[0025] In this embodiment, an attention-weighted aggregation strategy is used to integrate the core semantics of each segment of the long text into a complete basic semantic vector of the long text. This avoids the problem of semantic loss in long texts and semantic fragmentation caused by segmentation processing, and significantly improves the accuracy of deduplication of long texts.

[0026] In some embodiments, before comparing the similarity with the similarity threshold parameter corresponding to the preprocessed text, the text deduplication method further includes: determining a value from the similarity threshold range corresponding to the domain label of the preprocessed text as the similarity threshold parameter corresponding to the preprocessed text based on the text length of the preprocessed text.

[0027] In this embodiment, the similarity threshold (similarity threshold parameter) can be dynamically adjusted according to the text domain and text (e.g., a more lenient threshold for long texts and a more stringent threshold for academic fields), which enhances the adaptability of the deduplication model to diverse text scenarios and further improves the flexibility and accuracy of deduplication judgment.

[0028] Secondly, embodiments of this application provide an electronic device. The electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the text deduplication method provided in any embodiment of the first aspect.

[0029] Thirdly, embodiments of this application provide a computer program product. This computer program product stores a computer program, which, when executed by a processor, implements the text deduplication method provided in any embodiment of the first aspect.

[0030] It is understood that the beneficial effects of the electronic equipment provided in the second aspect and the computer program product provided in the third aspect can be adapted to the relevant description of the text deduplication method provided in the first aspect, and will not be repeated here. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart illustrating a text deduplication method provided in an embodiment of this application; Figure 2 This is a schematic diagram of the deduplication process of the deduplication model provided in the embodiments of this application; Figure 3 A schematic diagram of the architecture of a deduplication model provided in an embodiment of this application; Figure 4 A flowchart illustrating the model semantic embedding provided in this application embodiment. Figure 1 ; Figure 5 A flowchart illustrating the model semantic embedding provided in this application embodiment. Figure 2 ; Figure 6 A flowchart illustrating the multi-dimensional similarity calculation process provided in this application embodiment; Figure 7 A flowchart illustrating the process of determining and deduplicating similar text provided in this application embodiment. Figure 1 ; Figure 8 A flowchart illustrating the process of determining and deduplicating similar text provided in this application embodiment. Figure 2 ; Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0033] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0034] In this application, "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist; for example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.

[0035] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0036] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0037] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0038] Current text deduplication schemes are mainly divided into two categories: traditional deduplication schemes based on string matching and semantic deduplication schemes based on deep learning.

[0039] Traditional deduplication schemes rely on surface character features of text. They cannot accurately determine repetition of texts that have superficial differences but the same meaning, such as synonym replacement, word order adjustment, and sentence transformation. They also have difficulty identifying synonymous heterogeneous texts.

[0040] Currently, most semantic deduplication schemes based on deep learning are based on RNNs and LSTMs. While these schemes improve semantic understanding to some extent, because RNNs / LSTMs use a sequential recursive structure to process text, their hidden states decay or are lost in the long sequence, resulting in weak long-range dependency capture capabilities and difficulty in capturing deep semantic relationships.

[0041] It is evident that both traditional deduplication schemes and RNN / LSTM-based semantic deduplication schemes suffer from insufficient semantic understanding depth, resulting in low deduplication accuracy, which is particularly pronounced in specialized scenarios. To address this technical problem, embodiments of this application provide... Figures 1 to 9 The text deduplication method shown.

[0042] Please see Figure 1 , Figure 1 This is a flowchart illustrating a text deduplication method provided in an embodiment of this application. The text deduplication method includes the following steps S100 to S300: S100, obtain the set of texts to be deduplicated.

[0043] The text set to be deduplicated includes multiple texts to be deduplicated; that is, the text set to be deduplicated is a collection of multiple texts to be deduplicated. Furthermore, the multiple texts to be deduplicated belong to at least one domain, meaning the text set to be deduplicated covers texts from at least one domain, such as medical, auto repair, and technology. For example, if the text set to be deduplicated includes five texts to be deduplicated: t1, t2, t3, t4, and t5, where t1 and t2 belong to the medical domain, t3 and t4 belong to the auto repair domain, and t5 belongs to the technology domain, then the text set to be deduplicated covers texts from three domains: medical, auto repair, and technology.

[0044] The set of texts to be deduplicated can be obtained through various methods, such as receiving batch files via an interface, extracting data from a database, monitoring data streams in real time, or loading data from storage media. Each text in the obtained set, along with its corresponding text ID, can be stored in a text database.

[0045] S200: Input the text set to be deduplicated into the deduplication model so that the deduplication model can deduplicate the text set to be deduplicated, obtain the deduplication result, and generate a deduplication result report.

[0046] The term "deduplication of a text set" refers to the process of deleting texts identified as similar to other texts within the set. It's important to note that "similar texts" refers to texts with a high degree of similarity to the comparison objects. In this embodiment, the comparison objects can be other texts within the current deduplication set, or original texts stored in a database or other database. The former typically involves full-scale text deduplication, aiming to thoroughly clean a large-scale text set in one go; the latter typically involves incremental text deduplication, aiming to clean real-time or periodically updated text data.

[0047] As can be seen, the primary task in deduplicating a set of texts is to determine whether each text to be deduplicated is similar. This embodiment achieves this through a deduplication model. To improve the accuracy of similar text identification, the deduplication model is required to have deep semantic understanding capabilities. Based on this, the deduplication model is configured as a model for semantic understanding based on a self-attention mechanism. In other words, the module of the deduplication model used for semantic understanding is implemented based on a self-attention mechanism.

[0048] For example, deduplication models embed deep self-attention (transformer) models, such as self-attention-based bidirectional encoder representations from transformers (BERt) models or generative pre-trained transformer (GPt) models, which enable semantic understanding.

[0049] To enhance semantic understanding capabilities in professional scenarios, this deduplication model provides at least one domain parameter covering at least one domain of the text set to be deduplicated. The domain parameter is used to determine whether the text to be deduplicated in the corresponding domain is similar to other text. In this way, the deduplication model can determine whether the text to be deduplicated is similar to other text based on the domain parameter of the domain to which the text to be deduplicated belongs.

[0050] Continuing with the previous example, when the set of texts to be deduplicated covers three fields: medical, auto repair, and technology, the deduplication model should provide at least the domain parameters for the medical field, the auto repair field, and the technology field. Specifically, the deduplication model uses the domain parameters for the medical field to identify whether texts t1 and t2 are similar, uses the domain parameters for the auto repair field to identify whether texts t3 and t4 are similar, and uses the domain parameters for the technology field to identify whether text t5 is similar.

[0051] The deduplication result report mentioned above indicates which texts in the text set to be deduplicated have been deleted and / or retained. For example, the report may include a list of similar (also called redundant) / dissimilar (also called unique) texts, similarity scores, etc. Because the deduplication model needs to determine which texts in the text set are similar during the deduplication process, and can rely on the similarity scores between the texts to be deduplicated and the comparison texts to determine similar texts, a deduplication result report can be generated simultaneously.

[0052] S300 outputs a deduplication result report. During implementation, the deduplication result report can be output via a display screen or other means.

[0053] Figure 1 The text deduplication method provided in the illustrated embodiment introduces a deduplication model based on a self-attention mechanism for semantic understanding. This self-attention mechanism allows the deduplication model to directly calculate the association weights between a word and all other words in the text during semantic understanding, and then aggregate the information in a weighted manner. This enables the deduplication model to simultaneously focus on all information in the entire text and deeply understand the context and long-distance dependencies, thus possessing the ability to identify synonymous heterogeneous texts. Furthermore, the deduplication model utilizes domain parameters that are dynamically adjusted according to the domain of the text to be deduplicated to identify similar texts, allowing the text deduplication method to adapt to the text characteristics of different domains. Therefore, this text deduplication method enhances the model's generalization ability across different domains while improving the depth of semantic understanding.

[0054] The following explanation uses at least a portion of the original text from the original text set already stored in the text database as the comparison text. In this case... Figure 1 The text deduplication method shown is suitable for incremental text deduplication scenarios.

[0055] Please refer to Figure 2 , Figure 2 The deduplication process diagram of the deduplication model provided in the embodiments of this application can be understood as a... Figure 1 Further refinement of step S200 may include steps S210 to S291.

[0056] Step S210: Receive the set of texts to be deduplicated; Step S220: Text preprocessing (domain labeling → data cleaning → text standardization → adaptive length processing); Step S230: Determine if the text is empty / invalid. If so, discard the text; otherwise, proceed to step S2004. Step S240: Model semantic embedding; Step S250: Vector optimization (PCA dimensionality reduction → L2 normalization → redundant vector filtering); Step S260: Vector database retrieval (domain filtering → ANN retrieval → return Top-K vectors); Step S270: Multi-dimensional similarity calculation (cosine similarity → semantic relevance → weighted fusion); Step S280: Dynamic threshold decision (load domain threshold → adjust based on text length → determine similarity threshold); Step S290: Identify and remove duplicate text; Step S291: Generate a deduplication result report.

[0057] The following is combined with Figure 3 The deduplication model shown below explains each of the above steps in detail.

[0058] Please see Figure 3 , Figure 3 This is a schematic diagram of the architecture of the deduplication model provided in this embodiment. The deduplication model package may include a text preprocessing module, a semantic embedding module, a vector optimization module, a vector retrieval module, a similarity matching and dynamic decision-making module, and a domain adaptation module. In specific implementations, the deduplication model may include more or fewer modules, and this embodiment does not limit this.

[0059] The text preprocessing module primarily preprocesses the text to be deduplicated, providing high-quality input for the semantic embedding module. Subsequent embodiments will describe the preprocessing process in more detail; it will not be elaborated upon here.

[0060] The semantic embedding module leverages the strong semantic understanding capabilities of large models to extract deep semantic features from text and generate highly discriminative semantic embedding vectors. This module is the core of the deduplication model for semantic understanding and can include semantic models based on self-attention mechanisms, such as deep Transformer models. Unlike shallow Transformer models, deep Transformer models typically have more than six neural network layers, such as 24 or more.

[0061] For example, the semantic model can be either a BERT model or a GPT model. In practice, the size of the semantic model can be dynamically selected based on hardware resource constraints. For instance, if hardware resources are sufficient, the BERT-large version can be chosen to improve accuracy; if hardware resources are limited, the BERT-base version can be chosen to balance efficiency. The BERT-Base version typically supports 768-dimensional vectors, while the BERT-Large version typically supports 1024-dimensional vectors. Higher dimensions represent stronger semantic representation capabilities, but also higher computational and storage costs.

[0062] The semantic model is trained through two stages: pre-training and fine-tuning. Pre-training involves training the initial model using general text to enable it to understand the semantics of common terms. Fine-tuning involves further training the pre-trained model using domain-specific text to enable it to understand specialized terminology. The model training task is text semantic matching, which gives the trained semantic model the ability to match text semantics, such as calculating the semantic relevance between two texts.

[0063] The vector optimization module is used to further optimize the semantic embedding vectors generated by the semantic embedding module. While ensuring semantic distinguishability, it reduces vector dimension and storage cost, and improves subsequent retrieval efficiency.

[0064] The vector retrieval module is used to achieve efficient vector storage and near nearest neighbor retrieval, thereby improving the efficiency of deduplication in large-scale text.

[0065] The similarity matching and dynamic decision-making module is used to accurately calculate text similarity and determine whether the text to be deduplicated is similar to other texts.

[0066] The domain adaptation module provides domain parameters to other modules, improving scenario adaptability. For example, domain parameters may include a domain stop word list, model fine-tuning parameters, dimensionality reduction parameters, and similarity threshold ranges. Among these, model fine-tuning parameters refer to the model's specific parameters obtained after fine-tuning training using domain text. Dimensionality reduction parameters indicate the dimensionality reduction dimension of the vector dimensionality reduction process. Similarity threshold ranges are used to determine the similarity threshold parameters.

[0067] In practice, a domain parameter library can be built to store domain parameters and corresponding domain tags for each domain. When loading is required, the domain adaptation module can respond to the needs of each module and dynamically load domain parameters for each module based on the domain tags of the text. Furthermore, this domain parameter library can be developed to support user adjustments to parameters and updates to the library, continuously optimizing deduplication performance.

[0068] It is evident that, apart from the semantic embedding module, the other modules do not participate in semantic understanding and perform functions other than semantic understanding in the deduplication process. To further understand the role of each module, the following section will examine the functions of each module in conjunction with the modules described above. Figure 2 The steps of step S200 shown will be explained.

[0069] The aforementioned step S220 can be executed by the text preprocessing module, specifically including: the text preprocessing module preprocesses each text to be deduplicated in the text set to be deduplicated, obtaining at least one preprocessed text, the preprocessing of which at least includes domain labels. After text preprocessing, high-quality input can be provided to the semantic embedding module.

[0070] The following takes the preprocessing including the following four processing steps as an example for illustration.

[0071] (1) Marking domain tags: The domain of the text can be automatically recognized by the domain classification module. In this case, the deduplication model also has a domain classification module, which is not shown in the figure. Of course, in some other embodiments, it can also be manually specified. The marked domain tags are used for the adaptation of domain parameters. It should be noted that each preprocessed text has corresponding domain parameters according to the domain tag it is marked with. In other words, the domain parameters of the domain to which the text to be deduplicated belongs are the corresponding domain parameters of the domain marked by the corresponding preprocessed text.

[0072] (2) Data cleaning: Regular expressions are used to match and remove special characters (such as the characters "@", "#", "¥"), redundant spaces, line breaks, and meaningless characters; a Chinese spelling checker is integrated to correct common spelling mistakes; non-target characters in multilingual texts are filtered. Non-target characters are non-target language characters for language conversion of the characters of the target language.

[0073] It should be noted that the target language can be configured as needed. For example, if the target language is configured as Chinese, then characters in other languages in the document, such as English characters, etc., belong to non-target language characters. In this embodiment, the difference between non-target characters and non-target language characters is that non-target characters are characters in non-target language characters used for language conversion (i.e., translation) of the characters of the target language. For example, in the text "今天(today)是个好天气,I like the weather", both "today" and "I like the weather" belong to non-target language characters, but "today" is the translation of the Chinese character "今天", so "today" is a non-target character. The non-target character and the target language character express the same semantics. Repeated calculation twice not only does not help semantic recognition but also increases the computational amount. Therefore, in this embodiment, filtering out non-target characters can reduce the deduplication computational amount and improve the deduplication efficiency.

[0074] (3) Text standardization: Unify the text case (English) and punctuation format; use Jieba分词 (Chinese) and the natural language toolkit (nltk) for word segmentation (English) for processing; call the general stop word list to delete the general stop words of the text to be deduplicated, and / or call the domain stop word list according to the domain of the text to be deduplicated to delete the domain stop words of the text to be deduplicated.

[0075] It should be noted that the field based on which the domain stop word list is called here can come from the domain tags marked for the text to be de-duplicated mentioned above. In this case, it is required that the marking of the domain tags occurs before the domain stop words in the text to be de-duplicated are deleted by calling the domain stop word list. The domain stop word list can be regarded as one of the parameters in the domain parameters.

[0076] In this embodiment, general stop words and domain-specific stop words are combined for text cleaning to remove the interference of irrelevant words and enhance the extraction effect of the core semantics of the text, further improving the purity of semantic representation and the de-duplication accuracy. General stop words (such as "of, is, in") and domain stop words (such as non-distinguishing words like "inspection, repair" in the automotive repair field) are deleted to eliminate meaningless redundant information, allowing the semantic vector to focus on the core keywords (such as "OBD fault code P0135"), improving the accuracy and efficiency of similarity calculation.

[0077] (4) Adaptive length processing: The keyword enhancement strategy is adopted to perform keyword addition processing on extremely short texts, such as extracting keywords through a model and splicing them with the original text; and / or, the segmentation strategy is adopted to perform segmentation processing on extremely long texts, such as segmenting them into segmented texts of about 500 characters according to semantic logic. Here, an extremely long text refers to a text with a character count greater than the first threshold, and an extremely short text refers to a text with a character count less than the second threshold. The first threshold and the second threshold can be set as needed.

[0078] Through the adaptive length processing in aspect (4) above, keyword enhancement and segmentation processing strategies are respectively adopted for extremely short texts and extremely long texts to optimize the semantic representation quality of texts of different lengths. For the problem of insufficient semantic information in extremely short texts, semantics are supplemented by adding keywords; for the problem that extremely long texts exceed the input limit of the model, the model capabilities are adapted through segmentation processing. In this way, the quality of generating semantic vectors for texts of different lengths is guaranteed, avoiding de-duplication biases caused by text length, and improving the adaptability of the de-duplication model to texts of different lengths.

[0079] Through the above preprocessing in aspects (1) to (4), text noise is removed, the format is standardized, making the text suitable for large model processing, and adaptive processing is performed in combination with text length and domain characteristics, providing high-quality input for semantic embedding. It should be noted that the preprocessing methods in each of the above aspects (2) to (4) can be more or less, or even completely omitted.

[0080] The above step S240 can be executed by the semantic embedding module, specifically including: for each preprocessed text, the semantic embedding module determines the semantic embedding vector of the preprocessed text.

[0081] It should be noted that the semantic embedding module is used to leverage the strong semantic understanding capabilities of large models to extract deep semantic features from text and generate highly discriminative semantic embedding vectors. A semantic embedding vector is a vector that represents the core semantics of the preprocessed text, typically obtained from the output vectors of deeper neural network layers, such as the output vector of the last encoder layer in BERT.

[0082] The semantic embedding module is the core module for semantic understanding in the deduplication model. It can include semantic models that achieve semantic understanding based on self-attention mechanisms, such as deep Transformer models. Unlike shallow Transformer models, deep Transformer models typically have more than 6 neural network layers, such as 24 or more.

[0083] For example, the semantic model can be either a BERT model or a GPT model. In practice, the size of the semantic model can be dynamically selected based on hardware resource constraints. For instance, if hardware resources are sufficient, the BERT-large version can be chosen to improve accuracy; if hardware resources are limited, the BERT-base version can be chosen to balance efficiency. The BERT-Base version typically supports 768-dimensional vectors, while the BERT-Large version typically supports 240-dimensional vectors. Higher dimensions represent stronger semantic representation capabilities, but also higher computational and storage costs.

[0084] The semantic model is trained through two stages: pre-training and fine-tuning. Pre-training involves training the initial model using general text to enable it to understand the semantics of common terms. Fine-tuning involves further training the pre-trained model using domain-specific text to enable it to understand specialized terminology. The model training task is text semantic matching, which gives the trained semantic model the ability to match text semantics, such as calculating the semantic relevance between two texts.

[0085] The specific semantic embedding process of the above model differs depending on whether the text to be deduplicated is extremely short or extremely long. This will be discussed in separate cases below. It should be noted that text with fewer than the second threshold is considered extremely short text, while text with more than the first threshold is considered extremely long text. Extremely short text is not segmented, while extremely long text is segmented into multiple segments.

[0086] For cases where the text to be deduplicated is short, please refer to... Figure 4 , Figure 4 A flowchart illustrating the model semantic embedding provided in this application embodiment. Figure 1 This includes the following steps S241a to S243a: S241a, the semantic embedding module converts the preprocessed text into the model input format.

[0087] For example, the input format for the BERT model is as follows: [CLS] + word segmentation + [SEP], where [CLS] is the summary marker at the beginning of the text, and [SEP] is the end marker at the end of the text. By recognizing [CLS] and [SEP], the start and end positions of the text can be determined. For example, "car engine failure" can be broken down into three words: "car", "engine", and "failure". "Car engine failure" can be transformed into [CLS] Car engine failure [SEP].

[0088] In S242a, the semantic embedding module inputs the converted text into the semantic model and extracts the basic semantic vector from the semantic model. For example, the basic semantic vector is the output vector of the last encoder layer of the BERT model. In specific implementation, the basic semantic vector corresponding to a very short text can be the CLS vector of the BERT model.

[0089] S243a performs unit vector normalization (such as Euclidean L2 normalization) on the basic semantic vectors to obtain the semantic embedding vectors of the preprocessed text. By processing the semantic vectors through L2 normalization, the similarity calculation process is simplified, the interference of vector magnitude differences on similarity evaluation is reduced, and computational efficiency and result consistency are further improved.

[0090] For cases where the text to be deduplicated is a long text, please refer to... Figure 5 , Figure 5 A flowchart illustrating the model semantic embedding provided in this application embodiment. Figure 2 This includes the following steps S241b to S243b: S241b, the semantic embedding module converts the preprocessed text into the model input format.

[0091] The conversion format for each segmented text can be referenced in the example of the conversion format for unsegmented preprocessed text in the aforementioned S202a1.

[0092] S242b, the semantic embedding module inputs the transformed segmented text into the semantic model and extracts the corresponding basic semantic vectors from the semantic model.

[0093] S243b, perform at least attention-weighted average aggregation on the basic semantic vectors corresponding to each segment of text to obtain the semantic embedding vector of the preprocessed text.

[0094] Attention-weighted average aggregation can be performed according to the following formula: .in, For global semantic vectors, Let j be the semantic vector of the j-th segment of text. The number of segments into which the text to be deduplicated is divided. For the first Attention weights for each segment of text. , For the first The original attention scores for each segment of text are calculated by a learnable feedforward neural network, which contains a hidden layer and an output layer, and can be represented as follows: () is the activation function, which represents all parameters in the network. , , , All of these parameters can be obtained by training the semantic model through two stages: pre-training and fine-tuning. These parameters can also be regarded as one of the domain parameters.

[0095] In practice, an L2 normalization process can be performed on the vector obtained by attention-weighted average aggregation to reduce the interference of vector magnitude differences on similarity evaluation, thereby obtaining the semantic embedding vector of the preprocessed text. Of course, in some other embodiments, L2 normalization may not be performed.

[0096] In this embodiment, an attention-weighted aggregation strategy is used to integrate the core semantics of each segment of the long text into a complete basic semantic vector of the long text. This avoids the problem of semantic loss in long texts and semantic fragmentation caused by segmentation processing, and significantly improves the accuracy of deduplication of long texts.

[0097] The above step S250 can be executed by the vector optimization module, specifically including: the vector optimization module performs vector optimization on each semantic embedding vector to obtain the semantic vector of each preprocessed text. Vector optimization includes at least one of the following: dimensionality reduction processing according to the dimensionality reduction parameter, L2 normalization processing, and redundant vector filtering.

[0098] Dimensionality reduction refers to using Principal Component Analysis (PCA) to reduce the dimensionality of semantic embedding vectors, resulting in semantic vectors with lower dimensions. For example, the original semantic vectors can be reduced from 768 / 240 dimensions to 256 / 512 dimensions (adjustable). In practice, variance analysis can be performed before dimensionality reduction to ensure that the reduced semantic vectors retain more than 90% of the variance of the semantic embedding vectors, thus avoiding the loss of semantic information.

[0099] L2 normalization refers to performing L2 normalization on vectors to make the vector magnitude 1, simplifying similarity calculation (converting it into inner product calculation) and reducing the bias caused by differences in magnitude.

[0100] Redundant vector filtering refers to constructing a filtering mechanism that marks a semantic embedding vector as a redundant vector and discards it directly if the similarity between the semantic embedding vector and an existing vector in the vector database exceeds a preset threshold (such as 0.95, which can be dynamically adjusted), thereby reducing invalid computation.

[0101] In this embodiment, the vector optimization module performs vector optimization (dimensionality reduction, L2 normalization, and redundancy filtering) on ​​the semantic embedding vector, which can reduce storage costs and computational load while retaining key semantic information, improve the computational efficiency of large-scale text deduplication, and improve subsequent retrieval efficiency.

[0102] The above step S260 can be executed by the vector retrieval module, specifically including: for each preprocessed text, the vector retrieval module obtains the semantic vector of each compared text in the comparison text set of the corresponding field.

[0103] This embodiment uses each comparison text in the corresponding domain's comparison text set as the comparison object, focusing on comparing the preprocessed text with the corresponding domain's comparison text. This filters out a large number of invalid text comparisons, thereby improving deduplication efficiency. It should be noted that, for incremental text deduplication scenarios, the text database and vector database already store the original text set and its semantic vectors. The comparison text set is a collection of at least a portion of the original text stored in the text database. Based on this, the vector retrieval module obtains the semantic vector of each comparison text in the corresponding domain's comparison text set, which may include: An ANN search is performed on a vector database that supports Approximate Nearest Neighbor (ANN) search to obtain the semantic vector of each compared text in the corresponding domain comparison text set. The vector database stores the semantic vectors of multiple texts and their corresponding domain labels. The search scope of the ANN search is the semantic vector associated with the domain labels of the preprocessed text.

[0104] In practice, during retrieval, the scope can be filtered by domain tags first, and then an ANN retrieval is performed to return Top-K similar vectors (K value is configurable, such as 10). This embodiment utilizes a vector database that supports Approximate Nearest Neighbor (ANN) search to achieve efficient semantic vector retrieval. By limiting the retrieval scope through domain tags, indiscriminate retrieval across the entire database can be avoided, significantly reducing computational power consumption and retrieval time, and supporting rapid deduplication of large-scale text. Optionally, when storing the semantic vectors of multiple texts, the vector database can also store text IDs, so that the text associated with the text ID can be retrieved from the text database or other locations where texts are stored.

[0105] It's worth noting that databases supporting Approximate Nearest Neighbor (ANN) search include Milvus and FAISS. These databases are specifically designed to store semantic vectors and can quickly find the most similar vectors. Configure parameters based on vector dimensions and data size, selecting IVF_FLAT (for small to medium scales) or HNSW (for large scales) for the index type, and dynamically adjusting the number of clusters / hierarchical depth. Vector databases can be sharded by domain / timestamp to improve read and write speeds, and maintainers can regularly clean the database, deleting invalid vectors and updating expired vectors to ensure data validity.

[0106] In other embodiments, the comparison text can also be deduplicated text. For incremental text deduplication scenarios, the text database and vector database already store the original text sets and their semantic vectors, respectively. The comparison text set is a collection of at least a portion of the original texts stored in the text database. Based on this, the aforementioned vector retrieval module obtains the semantic vector of each comparison text in the comparison text set corresponding to the domain, which may include: For full text deduplication scenarios, the semantic vectors of each preprocessed text, along with their corresponding domain labels and text IDs, can be stored in the aforementioned vector database. The above step S270 can be executed jointly by similarity matching, dynamic decision-making module, and semantic embedding module.

[0107] Please refer to Figure 6 , Figure 6 The flowchart for multi-dimensional similarity calculation provided in the embodiments of this application includes the following steps S271 to S273: S271, for each preprocessed text, the similarity matching and dynamic decision module calculates the cosine similarity between the semantic vector of the preprocessed text and the semantic vector of each compared text. Each compared text refers to the abbreviation of each compared text in the compared text set corresponding to the domain of the preprocessed text.

[0108] Continuing with the example above, suppose that after preprocessing, the five texts to be deduplicated yield the preprocessed text T1 corresponding to text t1, T2 corresponding to text t2, T3 corresponding to text t3, T4 corresponding to text t4, and T5 corresponding to text t5. Furthermore, the comparison text set corresponding to preprocessed text T1, the preprocessed text set corresponding to text t2, the preprocessed text set corresponding to text t3, the preprocessed text set corresponding to text t4, and the preprocessed text set corresponding to text t5 are also described.

[0109] Taking preprocessed text T1 and the comparison text set R1 corresponding to the domain of preprocessed text T1, which includes comparison texts R11, R12, and R13, as an example, step S205 requires calculating the cosine similarity between the semantic vector of preprocessed text T1 and the semantic vector of comparison text R11, the cosine similarity between the semantic vector of preprocessed text T1 and the semantic vector of comparison text R12, and the cosine similarity between the semantic vector of preprocessed text T1 and the semantic vector of comparison text R13 for the single text T1.

[0110] Taking the cosine similarity between the semantic vector of preprocessed text T1 and the semantic vector of contrasting text R11 as an example, assuming the semantic vector of preprocessed text T1 is A=(a1, a2, ..., a... i , ..., a n ), where n is the dimension of the semantic vector, such as 256 / 512 dimensions; the semantic vector of the contrast text R11 is B=( b1, b2, ..., b i , ..., b n The cosine similarity is obtained using the following formula:

[0111] It should be noted that the calculation of the cosine similarity between the semantic vectors of other preprocessed texts and the semantic vectors of each compared text can be understood in the same way.

[0112] S272, for each preprocessed text, the similarity matching and dynamic decision module inputs each text pair consisting of the preprocessed text and each pair of texts in the corresponding domain comparison text set into the semantic embedding module, so that the semantic model configured with model fine-tuning parameters of the semantic embedding module outputs the semantic correlation between the preprocessed text and each pair of texts.

[0113] In practice, each text pair can be retrieved from a text database based on its text ID.

[0114] Continuing with the example of preprocessed text T1, preprocessed text T1 and comparison text R11 form a text pair, preprocessed text T1 and comparison text R12 form a text pair, and preprocessed text T1 and comparison text R13 form a text pair. Inputting these three text pairs into the semantic model, we can obtain the semantic correlation between preprocessed text T1 and comparison text R11, the semantic correlation between preprocessed text T1 and comparison text R12, and the semantic correlation between preprocessed text T1 and comparison text R13 output by the semantic model.

[0115] It should be noted that the calculation of the semantic relevance between other preprocessed texts and each comparison text can be understood in the same way.

[0116] S273, for each preprocessed text, the similarity matching and dynamic decision module fuses and calculates the semantic relevance between the preprocessed text and each comparison text, as well as the cosine similarity between their semantic vectors, to obtain the similarity between the preprocessed text and each comparison text.

[0117] Continuing with the example of preprocessed text T1, this step requires fusing and calculating the semantic relevance between preprocessed text T1 and comparison text R11, as well as the cosine similarity between the semantic vectors of preprocessed text T1 and comparison text R11, to obtain the similarity between preprocessed text T1 and comparison text R11; it also requires fusing and calculating the semantic relevance between preprocessed text T1 and comparison text R12, as well as the cosine similarity between the semantic vectors of preprocessed text T1 and comparison text R12, to obtain the similarity between preprocessed text T1 and comparison text R12; finally, it requires fusing and calculating the semantic relevance between preprocessed text T1 and comparison text R13, as well as the cosine similarity between the semantic vectors of preprocessed text T1 and comparison text R13, to obtain the similarity between preprocessed text T1 and comparison text R13.

[0118] Taking the similarity between preprocessed text T1 and comparison text R11 as an example, the fusion calculation can be achieved by weighted average fusion using the following formula: S1=S11*W1+ S12*W2.

[0119] Wherein, S1 represents the similarity between the preprocessed text T1 and the comparison text R11, S11 is the cosine similarity between the semantic vectors of the preprocessed text T1 and the comparison text R11, S12 is the semantic relevance between the preprocessed text T1 and the comparison text R11, W1 is the weight of the cosine similarity, and W2 is the weight of the semantic relevance. W1 and W2 can be dynamically adjusted according to the domain; for example, the semantic relevance weight is higher in academic domains compared to other domains. W1 and W2 can be regarded as a type of domain parameter.

[0120] It should be noted that the calculation of similarity between other preprocessed texts and each comparison text can be understood in the same way.

[0121] Steps S271 to S273 utilize both cosine similarity to consider surface features and semantic relevance output from the semantic embedding module of the deduplication model to consider deep semantic relevance. A multi-dimensional similarity fusion calculation method is employed to obtain the similarity between the preprocessed text and the comparison text, effectively improving the accuracy of similarity judgment and reducing the risk of misjudgment due to single-vector similarity. Furthermore, by configuring dimensionality reduction parameters and a third parameter, the similarity calculation is adapted to domain characteristics, further improving deduplication accuracy in professional scenarios and adaptability to different domains.

[0122] The above step S280 can be executed by the similarity matching and dynamic decision module, specifically including: for each preprocessed text, the similarity matching and dynamic decision module determines a value from the similarity threshold range of the corresponding field as the similarity threshold parameter corresponding to the preprocessed text based on the text length of the preprocessed text.

[0123] It should be noted that the similarity threshold parameter is a similarity threshold. In the specific implementation, the domain threshold library can store the similarity threshold range for each domain. The domain threshold library stores domain labels and their corresponding similarity threshold ranges in association. For example, the similarity threshold range for the medical domain is [0.85, 0.95], and the similarity threshold range for the general domain is [0.75, 0.85]. When it is necessary to determine the similarity threshold parameter, it is fine-tuned within the aforementioned similarity threshold ranges for each domain based on the text length to obtain the specific value of the similarity threshold parameter. For example, the similarity threshold parameter for very short texts in the general domain is relatively low, such as 0.75, while the similarity threshold parameter for very long texts is relatively high, such as 0.85.

[0124] This embodiment adopts a dynamic threshold decision mechanism, abandoning fixed thresholds. The similarity threshold (similarity threshold parameter) can be dynamically adjusted according to the text domain and text length (e.g., the threshold is more lenient for long texts and more stringent for academic fields), which enhances the adaptability of the deduplication model to diverse text scenarios and further improves the flexibility and accuracy of deduplication judgment.

[0125] In addition, in some other embodiments, the similarity threshold parameter can also be a value that the user can fine-tune within the above threshold range according to their needs. In this case, the electronic device can record user preferences to achieve adaptive optimization.

[0126] The above steps S290 and S291 can be executed through the similarity matching and dynamic decision-making module.

[0127] Please refer to Figure 7 , Figure 7 A flowchart illustrating the process of determining and deduplicating similar text provided in this application embodiment. Figure 1 This can be understood as a further refinement of step S290, specifically including the following steps S2901a to S2903a: S2901a: For each preprocessed text, the similarity matching and dynamic decision module compares the calculated similarity with the similarity threshold parameter corresponding to the preprocessed text to obtain at least one comparison result corresponding to the preprocessed text.

[0128] Continuing with the example of preprocessed text T1, through the aforementioned steps S205 to S207, the similarity scores S1 between preprocessed text T1 and comparison text R11, S2 between preprocessed text T1 and comparison text R12, and S3 between preprocessed text T1 and comparison text R13 are obtained. Assuming the similarity threshold parameter for the medical field to which preprocessed text T1 belongs is 0.9, step S208 compares similarity scores S1, S2, and S3 with 0.9 respectively, obtaining three comparison results. It should be noted that the process of obtaining at least one comparison result for other preprocessed texts can be understood similarly. In this step, by labeling with domain tags, the similarity threshold parameter can be dynamically adjusted according to the domain to which the text belongs, solving the problem of poor adaptability of a fixed threshold in different domains.

[0129] S2902a: For each preprocessed text, if at least one of the corresponding comparison results has a similarity exceeding the similarity threshold parameter, then the text to be deduplicated corresponding to the preprocessed text is determined as a similar text, and the similar text in the text to be deduplicated set is obtained.

[0130] Continuing with the example of preprocessed text T1, if among the comparison results of similarity S1, similarity S2, and similarity S3 compared with 0.9, there is at least one comparison result with a similarity greater than 0.9, then the text t1 to be deduplicated corresponding to preprocessed text T1 is determined to be similar text. The way to determine the text t1 to be deduplicated corresponding to preprocessed text T1 as similar text is to mark it as redundant text (referred to as marked redundancy), for example, it can be marked with the label of similar text. If there is no case where any similarity exceeds 0.9, then the text t1 to be deduplicated corresponding to preprocessed text T1 is determined to be dissimilar text. The way to determine the text t1 to be deduplicated corresponding to preprocessed text T1 as dissimilar text is to mark it as unique text (referred to as marked unique), for example, it can be marked with the label of dissimilar text.

[0131] S2903a: Delete all texts identified as similar from the set of texts to be deduplicated, and obtain the deduplication result.

[0132] It should be noted that texts identified as similar to other texts can be discarded directly to obtain the deduplication result. Texts identified as dissimilar to other texts can be retained and their text IDs associated with them stored in the text database. Their semantic vectors, corresponding domain labels, and text IDs can also be associated with them and stored in the vector database, becoming part of the original text set for the next cleaning process.

[0133] The foregoing embodiments provide a deduplication scheme for incremental text deduplication scenarios. In this scenario, there is typically a pre-deduplicated and stored set of original text. Newly acquired text only needs to be compared with at least a portion of the existing original text to determine if it is similar, thus achieving efficient deduplication of incremental text and avoiding redundant storage. Furthermore, this scheme compares newly added text with existing text, eliminating the need for pairwise comparisons within the newly added text, reducing computational complexity and improving deduplication efficiency. It is particularly suitable for real-time data streams or periodically updated data scenarios.

[0134] In practice, there will also be scenarios involving full-text deduplication, where a large-scale text set is thoroughly cleaned and deduplicated in one go. In this case, the comparison text set is a set consisting of at least a portion of the other texts from at least one pre-processed text. For this full-text deduplication scenario, please refer to [link / reference needed]. Figure 8 , Figure 8 A flowchart illustrating the process of determining and deduplicating similar text provided in this application embodiment. Figure 2 This can be understood as a further refinement of step S290, specifically including the following steps S2901b to S2903b.

[0135] S2901b: The similarity matching and dynamic decision module divides the semantic vector of at least one preprocessed text into at least one connected component based on at least one comparison result corresponding to each preprocessed text.

[0136] In the specific implementation process, based on at least one comparison result corresponding to each preprocessed text, all semantically similar vectors can be found and a similarity graph can be constructed. If the similarity exceeds the similarity threshold, an edge is established between the two semantic vectors. For example, if A and B are similar, then an edge is established between A and B; if A and C are similar, then an edge is established between A and C. Then, the connected component algorithm is used to traverse the graph and divide the vectors connected by edges into "repeating clusters" (connected components). The connected components cover direct similarity (A and B are similar) and indirect similarity (A and B are similar, B and C are similar → A and C are indirectly similar). For example, if an edge is established between A and B, and an edge is established between A and C, then A, B, and C are divided into the same cluster.

[0137] S2902b: For each connected component, the text to be deduplicated corresponding to one node in the connected component is determined as the representative text (i.e., the dissimilar text, also known as the unique text), and the text to be deduplicated corresponding to the remaining nodes in the connected component is determined as the similar text.

[0138] A connected component is a cluster of multiple directly or indirectly repeated semantic vectors, where each semantic vector serves as a node of the connected component. In S2902b above, the text to be deduplicated corresponding to a node in the connected component that satisfies a preset rule can be determined as the representative text. For example, the vector closest to the cluster center can be selected from the connected component; another example is based on text quality indicators; yet another example is selecting the earliest or latest timestamp.

[0139] It should be noted that the methods for identifying a text to be deduplicated as representative text or similar text include, but are not limited to, marking the corresponding tags.

[0140] S2903b: Delete all texts identified as similar from the set of texts to be deduplicated, and obtain the deduplicated result. The specific implementation of S2903b can be understood by referring to S2903a.

[0141] In this full text deduplication scenario, the solution addresses the repetitive relationships within the entire text set to be deduplicated. The approach first obtains similarity relationships through similarity comparison, and then clusters directly or indirectly similar texts into the same connected component based on the connected component algorithm. In each connected component, a representative text is selected and retained, while the rest are marked as similar texts and deleted. This achieves the goal of removing redundancy while retaining representative information in each cluster of similar texts, which can improve the deduplication efficiency of large-scale data and is suitable for full deduplication scenarios that thoroughly clean large-scale text sets at once.

[0142] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0143] Please see Figure 9 , Figure 9 This is a schematic diagram of the structure of an electronic device 100 provided in an embodiment of this application. The electronic device 100 can be a server, including a processor 110 and a memory 120. The memory 120 is used to store computer programs; the processor 110 is used to execute the computer programs. When the processor 110 executes the computer programs, it can implement the steps in the text deduplication method provided in any of the above embodiments. The number of processors 110 can be one or more, and this embodiment of the application does not limit this.

[0144] Furthermore, this application also provides a computer program product that stores a computer program that can be executed by a processor. When the computer program is executed by the processor, it can implement the steps in the upgrade method provided in any of the above embodiments.

[0145] The aforementioned computer program includes computer program code, which may be in the form of source code, object code, executable files, or certain intermediate forms. The computer program product may include at least: any entity or device capable of carrying the computer program code to a photographic device / terminal device, recording media, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media may not be electrical carrier signals or telecommunication signals.

[0146] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0147] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these components are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described components for each specific application, but such implementation should not be considered beyond the scope of this application.

[0148] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A text deduplication method, characterized in that, The text deduplication method includes: Obtain a set of texts to be deduplicated, the set of texts to be deduplicated includes multiple texts to be deduplicated, the multiple texts to be deduplicated are texts from at least one domain; The set of texts to be deduplicated is input into a deduplication model so that the deduplication model can deduplicate the set of texts to be deduplicated to obtain a deduplication result; wherein, the deduplication model is configured to be a model based on a self-attention mechanism for semantic understanding and provides at least one domain parameter, the domain parameter is used to determine whether the texts to be deduplicated in the corresponding domain are similar texts, and the texts to be deduplicated that are determined to be similar texts are deleted after deduplication.

2. The text deduplication method according to claim 1, characterized in that, The deduplication model removes duplicates from the set of texts to be deduplicated, obtaining the deduplication results, including: The deduplication model preprocesses each of the texts to be deduplicated in the text set to be deduplicated, to obtain at least one preprocessed text. The preprocessing includes at least labeling a domain tag. Each preprocessed text has a corresponding domain parameter based on its labeled domain tag. The domain parameter includes a similarity threshold range, which is used to determine the similarity threshold parameter. For each preprocessed text, the similarity between the preprocessed text and each comparison text in the comparison text set of the corresponding domain is calculated, and each calculated similarity is compared with the similarity threshold parameter corresponding to the preprocessed text to obtain at least one comparison result corresponding to the preprocessed text; Based on at least one comparison result corresponding to each preprocessed text, similar texts in the set of texts to be deduplicated are determined; The deduplication result is obtained by deleting all texts identified as similar from the set of texts to be deduplicated.

3. The text deduplication method according to claim 2, characterized in that, The domain parameters also include model fine-tuning parameters. The calculation of the similarity between the preprocessed text and each compared text in the corresponding domain comparison text set includes: Determine the semantic vector of the preprocessed text and obtain the semantic vector of each of the compared texts in the compared text set corresponding to the domain; Calculate the cosine similarity between the semantic vector of the preprocessed text and the semantic vector of each of the compared texts; Each text pair consisting of the preprocessed text and each of the comparison texts is input into the semantic embedding module of the deduplication model, so that the semantic model configured with the model fine-tuning parameters of the semantic embedding module outputs the semantic correlation between the preprocessed text and each of the comparison texts. The semantic relevance between the preprocessed text and each of the comparison texts, as well as the cosine similarity between their semantic vectors, are fused and calculated to obtain the similarity between the preprocessed text and each of the comparison texts.

4. The text deduplication method according to claim 3, characterized in that, The comparison text set is a collection of at least a portion of the original text from the original text set; The step of determining similar texts in the set of texts to be deduplicated based on at least one comparison result for each of the preprocessed texts includes: For each preprocessed text, if there is a comparison result in the corresponding at least one comparison result whose similarity exceeds the similarity threshold parameter, then the text to be deduplicated corresponding to the preprocessed text is determined as the similar text.

5. The text deduplication method according to claim 3, characterized in that, The comparison text set is a collection of at least some of the other texts in the at least one preprocessed text; The step of determining similar texts in the set of texts to be deduplicated based on at least one comparison result for each of the preprocessed texts includes: Based on at least one alignment result for each preprocessed text, the at least one preprocessed text is divided into at least one connected component; For each connected component, the text to be deduplicated corresponding to one node in the connected component is determined as the representative text, and the text to be deduplicated corresponding to the remaining nodes in the connected component is determined as the similar text.

6. The text deduplication method according to any one of claims 3 to 5, characterized in that, The step of obtaining the semantic vector of each of the comparison texts in the comparison text set corresponding to the domain includes: An ANN search is performed in a vector database that supports Approximate Nearest Neighbor (ANN) search to obtain the semantic vector of each of the compared texts in the comparison text set corresponding to the domain. The vector database stores the semantic vectors of multiple texts and their respective domain labels. The search scope of the ANN search is the semantic vectors of the texts associated with the domain labels of the preprocessed texts.

7. The text deduplication method according to any one of claims 3 to 5, characterized in that, The domain parameters also include dimensionality reduction parameters; determining the semantic vector of the preprocessed text includes: Determine the semantic embedding vector of the preprocessed text; The semantic embedding vector is optimized to obtain the semantic vector of the preprocessed text; the vector optimization includes at least one of dimensionality reduction processing according to the dimensionality reduction dimension indicated by the dimensionality reduction parameter, unit vector normalization processing, and redundant vector filtering.

8. The text deduplication method according to claim 7, characterized in that, The preprocessing also includes adding keywords to very short texts and / or segmenting very long texts. The term "extremely long text" refers to text with a character count greater than a first threshold, while "extremely short text" refers to text with a character count less than a second threshold.

9. The text deduplication method according to claim 8, characterized in that, If the text to be deduplicated is a long text, the text to be deduplicated is segmented into the preprocessed text containing multiple segments; Determining the semantic embedding vector of the preprocessed text includes: Each segment of the preprocessed text is input into the semantic embedding module to extract the basic semantic vector of each segment from the semantic model of the semantic embedding module; The basic semantic vectors of each segmented text are subjected to at least attention-weighted average aggregation to obtain the semantic embedding vector of the preprocessed text.

10. An electronic device, characterized in that, The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the text deduplication method as described in any one of claims 1 to 9.