Text deduplication method and device, computer device and storage medium

By segmenting the published text, counting the frequency of the first letter, and using local sensitive hashing to bin the text, the problem of low text deduplication efficiency is solved, and a highly efficient text deduplication effect is achieved.

CN117216239BActive Publication Date: 2026-06-12DONSON TIMES INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DONSON TIMES INFORMATION TECH CO LTD
Filing Date
2023-10-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for deduplication of Chinese text have low efficiency and cannot effectively handle the problem of massive amounts of duplicate content on the Internet.

Method used

By segmenting the published text, counting and normalizing the frequency of the first letter of each segment, and then using the Locality Sensitive Hash algorithm to bucket the text matrix, the target text is filtered based on the bucketing results.

🎯Benefits of technology

It improves the accuracy and efficiency of text deduplication, reduces the comparison time and storage overhead of published text, and effectively deletes a large amount of duplicate text.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117216239B_ABST
    Figure CN117216239B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of text processing, and particularly discloses a text deduplication method.The method comprises the following steps: obtaining at least one published text, performing segmentation processing on all the published texts according to a preset text length to obtain segmented texts, counting the first letter frequencies of the segmented texts and performing normalization to obtain text matrices, performing bucketing processing on the published texts corresponding to the text matrices through a local sensitive hashing algorithm to obtain a bucketing result, and performing text filtering on all the published texts based on the bucketing result to obtain target texts.Through counting the first letter frequencies of the segmented texts and performing normalization, the determination of the text matrices is realized.Through the bucketing processing on the published texts through the local sensitive hashing algorithm, the classification of the published texts is realized, and then the published texts are reduced in dimension through the hashing algorithm, so that the comparison time of the published texts is reduced, the text storage cost is reduced, and the efficiency of the text deduplication is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of text processing technology, and in particular to a text deduplication method, apparatus, computer device, and storage medium. Background Technology

[0002] With the explosive growth of information in the internet age, the speed and breadth of information dissemination have increased dramatically. The internet is filled with massive amounts of text, including a large amount of duplicate content. For example, a single marketing message may be reprinted, modified, and edited by various media outlets, resulting in multiple similar marketing texts. The presence of a large amount of duplicate content on the internet not only lowers the overall quality of content but also wastes significant storage resources. Therefore, text deduplication is necessary.

[0003] Existing text deduplication methods primarily rely on the similarity of text feature vectors or the Hamming distance from word segmentation results, comparing multiple texts pairwise and deduplicating them based on the comparison results. However, these methods are inefficient for handling massive amounts of text. Therefore, improving the efficiency of text deduplication is a pressing issue. Summary of the Invention

[0004] Therefore, it is necessary to provide a text deduplication method, apparatus, computer device, and storage medium to address the aforementioned technical problems and solve the problem of low text deduplication efficiency in the prior art.

[0005] A text deduplication method includes:

[0006] Obtain at least one published text, and segment all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text;

[0007] The frequency of the first letter of each character in each segmented text is counted and normalized to obtain a text matrix corresponding to each published text.

[0008] The published texts corresponding to each of the text matrices are binned using the Locality Sensitive Hash algorithm to obtain the binning results.

[0009] Based on the bucketing results, all the published texts are filtered to obtain the target text.

[0010] A text deduplication device, comprising:

[0011] The segmented text module is used to obtain at least one published text, segment all the published texts according to a preset text length, and obtain at least one segmented text corresponding to the published text.

[0012] The text matrix module is used to count the frequency of the first letter of each character in each segmented text and normalize it to obtain a text matrix corresponding to each published text.

[0013] The bucketing module is used to perform bucketing on the published text corresponding to each of the text matrices using the locality-sensitive hashing algorithm to obtain the bucketing result.

[0014] The text filtering module is used to filter all the published text based on the bucketing results to obtain the target text.

[0015] A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the above-described text deduplication method when executing the computer-readable instructions.

[0016] One or more readable storage media storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the above-described text deduplication method.

[0017] The aforementioned text deduplication method, apparatus, computer equipment, and storage medium of this invention segment the published text according to a preset text length, thereby achieving text segmentation and acquisition. By statistically analyzing and normalizing the frequency of the first letter of each character in the segmented text, the frequency of the first letter is converted into a text matrix. Locality Sensitive Hashing (LSH) is used to bucket the text matrix, classifying the published text and acquiring the bucketing results. Then, a hash algorithm is used to reduce the dimensionality of the published text, reducing the comparison time, lowering storage overhead, and improving the accuracy and efficiency of text deduplication. Based on the bucketing results, text filtering is performed on all published text, deleting a large amount of duplicate text and thus filtering the target text, avoiding a large amount of invalid and redundant calculations. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a flowchart of a text deduplication method in one embodiment of the present invention;

[0020] Figure 2 This is a flowchart of step S20 of the text deduplication method in one embodiment of the present invention;

[0021] Figure 3 This is a schematic diagram of the structure of a text deduplication device in one embodiment of the present invention;

[0022] Figure 4 This is a schematic diagram of a computer device according to an embodiment of the present invention. Detailed Implementation

[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0024] In one embodiment, such as Figure 1 As shown, a text deduplication method is provided, including the following steps:

[0025] S10. Obtain at least one published text, and segment all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text.

[0026] Understandably, the published text can be a report, analysis, or commentary on a specific topic. Segmented text refers to shorter texts resulting from dividing the published text, with each segment having a preset length.

[0027] Specifically, web crawler software is used to scrape a large amount of published text from various data sources (such as news websites, social media platforms, blogs, etc.) to obtain at least one published text. Then, a preset text length is obtained, and all published texts are segmented according to this preset length (e.g., 50 characters). If the length of the published text is not an integer multiple of the preset length, special characters are added to the end of the last segment to ensure that the length of the segmented text reaches the preset length. In this way, at least one segmented text corresponding to the original published text is obtained.

[0028] S20. Count the frequency of the first letter of each character in each segmented text and normalize it to obtain a text matrix corresponding to each published text.

[0029] Understandably, the initial letter frequency refers to the number of times the first letter in the pinyin of the characters in the segmented text. For example, if the text content of a text is "wbdwbnr" and the text content of "真好看" is "dwbnrzhk", among them, the initial letter frequency of "b" is 3, the initial letter frequency of "w" is 3, and the initial letter frequency of "d" is 2. The text matrix is a 26-dimensional matrix obtained by mapping the initial letter frequencies. For example, [0, 3, 0, 5, 6, 7, 3, 1, 4, 2, ……, 6, 3, 1].

[0030] Specifically, perform cleaning processing on each segmented text to remove stop words and punctuation marks in the segmented text. Then, perform pinyin conversion on each segmented text, that is, obtain a preset pinyin toolkit, and convert each segmented text into a pinyin text through the preset pinyin toolkit, that is, the characters in each segmented text can be converted into pinyin, or pinyin annotations can be added above each character. Then, judge whether the pinyin conversion is incorrect in combination with the context. For example, judge whether there are polyphonic characters. If there are polyphonic characters, detect whether the conversion result is correct. Next, count the number of times of the initial letters of the characters in each segmented text to obtain the number of times of each initial letter. For example, "H" is 23 times, "N" is 21 times, etc. Perform an overall calculation on the number of times of the initial letters of all segmented texts corresponding to the same published text, map them to values, and then perform normalization processing on the values. That is, the Z-Score standardization method can be used. First, subtract the mean from each value, and then divide by the standard deviation of all values, realizing the conversion of each value into a distribution with a mean of 0 and a variance of 1, so as to obtain the text matrix corresponding to each published text. Among them, the normalization process is not limited.

[0031] In another embodiment, the initial letter frequency corresponding to each segmented text can be normalized to obtain a segmented matrix corresponding to each segmented text. Then, according to the segmented texts corresponding to the same published text, integrate all the segmented matrices corresponding to the same published text to obtain the text matrix corresponding to each published text.

[0032] S30. Perform bucketing processing on the published texts corresponding to each of the text matrices through the locality sensitive hashing algorithm to obtain a bucketing result.

[0033] Understandably, the locality sensitive hashing algorithm (Locality Sensitive Hashing, LSH) is an algorithm for measuring text similarity, that is, if two texts are similar in the original data space, they also have a high similarity after being converted by a hash function; on the contrary, if they are not similar, they should still not be similar after being converted.

[0034] Specifically, the locality-sensitive hashing algorithm is used to hash the text matrix corresponding to the published text. That is, an appropriate hash function (such as MD5, SHA-1, or SHA-256) is selected to hash the text matrix corresponding to each published text, thus obtaining a hash value for each published text. Then, based on the hash values, published texts that are similar (i.e., the Euclidean distance or cosine similarity between the text matrices of two texts is less than a certain threshold) are mapped to the same bucket, while different published texts are mapped to different buckets, thus obtaining the bucketing results corresponding to each published text.

[0035] In another embodiment, the k-means clustering algorithm is used to classify the published texts. This involves dividing the published texts into K groups and randomly selecting K published texts as initial cluster centers. The distance between each published text and each cluster center is then calculated, and each published text is assigned to the nearest cluster center. Each cluster center and its assigned published text represents a cluster. After each published text is assigned, the cluster centers are recalculated based on the existing published texts in the cluster, until a termination condition is met. The termination condition could be that no (or a minimum number) objects are reassigned to different clusters, no (or a minimum number) cluster centers change, and the sum of squared errors reaches a local minimum. This yields the binning results corresponding to each published text.

[0036] S40. Based on the bucketing results, perform text filtering on all the published texts to obtain the target text.

[0037] Understandably, target text refers to one of the texts selected from a large number of duplicate texts, such as the earliest published text.

[0038] Specifically, based on the bucketing results, text filtering is performed on all published texts. This involves calculating the similarity value between published texts in each bucket, comparing the calculated similarity value with a preset threshold, and comparing the publication times of all published texts with similarity values ​​greater than the preset threshold. The earliest published text is selected, or any one of these texts is chosen, and the others with similarity values ​​greater than the preset threshold are deleted, thus obtaining the target text. In another embodiment, the similarity is calculated for all published texts at each cluster center, and the calculated similarity value is compared with a preset threshold. When the similarity value is greater than the preset threshold, one of the published texts with similarity values ​​greater than the preset threshold is selected as the target text.

[0039] In this embodiment of the invention, the published text is segmented according to a preset text length, thus achieving text segmentation and acquisition. By statistically analyzing and normalizing the frequency of the first letter of each segment, the statistically obtained first letter frequencies are converted into a text matrix. The text matrix is ​​then bucketed using a locality-sensitive hashing algorithm, enabling the classification of the published text and obtaining the bucketing results. Furthermore, a hash algorithm is used to reduce the dimensionality of the published text, reducing comparison time, lowering storage overhead, and improving the accuracy and efficiency of text deduplication. Based on the bucketing results, all published text is filtered, deleting a large amount of duplicate text and thus filtering the target text, avoiding a large amount of invalid and repetitive calculations.

[0040] In one embodiment, step S10, namely, segmenting all the published text according to a preset text length to obtain at least one segmented text corresponding to the published text, includes:

[0041] S101, determine the text length corresponding to each of the published texts;

[0042] S102, determine the text overlap rate based on the text length and the preset text length;

[0043] S103, the published text is segmented using the text overlap rate and the preset text length to obtain at least one segmented text.

[0044] Understandably, text length can refer to the length of a published text, that is, the number of characters or bytes in the text. Text overlap rate is used to measure the degree of overlap between two or more texts.

[0045] Specifically, after obtaining the published text, for each published text, a text length recognition model or other text length calculation tool is used to calculate the length information of that published text, thus obtaining the text length corresponding to each published text. For example, a text length recognition model can be built using the len() function in Python to determine the text length of published text. Next, the length of each published text is divided by a preset text length to obtain the length ratio of each published text. Then, this length ratio is compared with 1 to determine whether the length ratio is greater than or equal to a preset overlap rate threshold. If the length ratio is greater than or equal to the preset overlap rate threshold, it is considered that the published text overlaps with the preset text length, and the preset text length is divided by the text length, with the ratio determined as the text overlap rate. For example, assuming the text content length is 2n and the preset text length is n, the overlap rate is 50%. Otherwise, it is considered that the published text does not overlap with the preset text length, and the preset text length is adjusted according to the text length of the published text, for example, adjusted to half of the text length. Furthermore, the published text is segmented using the text overlap rate and a preset text length. Specifically, the published text is first segmented from the first character according to the preset text length to obtain the first segment. Then, the second segment is also segmented according to the preset text length, and the overlap length between the second and first segments is equal to the text overlap rate. This results in at least one segment. For example, assuming a text length of 20, a preset text length of 10, and an overlap rate of 50%, three segments can be obtained: [1-10], [5-15], and [10-20].

[0046] The text segmentation method described in this embodiment, based on text length and a preset text length, enables automated segmentation of long published texts and achieves text splitting into segments. By using preset text length and text overlap rate thresholds, the granularity of text segmentation can be flexibly adjusted to meet the needs of different application scenarios.

[0047] In one embodiment, before step S10, that is, before segmenting all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text, the method further includes:

[0048] S104, perform cleaning processing on each of the published texts to remove stop words and punctuation marks from the published texts.

[0049] S105, perform pinyin conversion on each of the published texts to obtain pinyin texts corresponding to each of the published texts;

[0050] S106, extract the first letter of each pinyin in the pinyin text to obtain the target pinyin text corresponding to each of the published texts.

[0051] Understandably, the pinyin text refers to the text obtained by converting the characters in the published text into pinyin. The target pinyin text refers to the text that includes the initial letters of each character in the published text.

[0052] Specifically, before segmenting the published text according to the preset text length, each published text is first cleaned to remove stop words and punctuation marks in the published text. That is, a preset stop word list is obtained, and the stop words in the published text are removed by matching. Then, all punctuation marks are deleted. Next, a Chinese word segmentation tool is used to segment the published text to obtain the word segmentation result. Then, a dictionary containing a large number of Chinese words is used to query the corresponding pinyin for each word segmentation result. When a word segmentation result cannot find the corresponding pinyin in the dictionary, a rule-based method or the pypinyin toolkit is used to convert its pinyin, so as to obtain the pinyin text corresponding to each published text. For example, some common combinations of initials and finals can be used to infer the pinyin of the word. Further, for each pinyin in the pinyin text, first judge whether it is a polyphonic character. That is, look up the pinyin of the character in the dictionary. If there are multiple different pronunciations of the character in the dictionary, it is determined that the character is a polyphonic character. If it is a polyphonic character, its correct pronunciation is determined according to its context, and its initial letter is extracted; if it is not a polyphonic character, its initial letter is directly extracted.

[0053] In another embodiment, a model can also be trained using a deep learning-based method. This model can automatically judge the correct pronunciation of pinyin according to the context and extract its initial letter. That is, the pinyin text is input into the trained extraction model. Through the extraction model, error correction and inspection of the pinyin text are performed, and the initial letter is extracted. That is, each pinyin in the pinyin text is separately segmented, and then, the first letter of each pinyin is extracted, so as to achieve the extraction of the initial letter and output the target pinyin text. For example, if the text content of the text is wbdwbnr and the text content of "真好看" is dwbnrzhk, the output target pinyin text is wbdwbnr+dwbnrzhk.

[0054] In another embodiment, the published text is first segmented, and then the pypinyin toolkit is called to perform pinyin conversion on the segmented text. The pinyin of the segmented text is converted with initials, finals, and tones as basic units to obtain a pinyin sequence corresponding to each segmented text. After error correction detection on the pinyin sequence, it is then spliced in order, and finally a pinyin text is constructed. For example, a certain segmented text is: Please measure the weight of the following items, which contains the polyphonic characters "重" (corresponding pinyin "chong2, zhong4") and "量" (corresponding pinyin "liang2, liang4"). In the tones, "level tone, first tone, second tone, third tone, fourth tone" are represented by the numbers "0, 1, 2, 3, 4" respectively. Using initials, finals, and tones as basic units to perform pinyin conversion on the segmented text, we get "qing3 ce4 liang2 yi3 xia4 wu4 pin3zhong4 liang4".

[0055] In this embodiment, through pinyin conversion and initial letter extraction of the published text, efficient pinyin conversion and initial letter extraction of the published text are achieved, and further conversion of the target pinyin text is realized.

[0056] In one embodiment, as Figure 2 shown, in step S20, that is, to count the initial letter frequencies of the characters in each of the segmented texts and perform normalization to obtain a text matrix corresponding to each of the published texts, including:

[0057] S201, according to all the segmented texts corresponding to the same published text and the target pinyin text, determine the segmented pinyin texts corresponding to each of the segmented texts.

[0058] S202, count the number of times of the initial letters of the pinyin in the segmented pinyin texts corresponding to the published text to obtain the initial letter frequencies;

[0059] S203, perform mapping and normalization on the initial letter frequencies corresponding to the same published text to obtain the text matrix.

[0060] It can be understood that the initial letter frequency refers to the number of times a certain letter appears in the published text. For example, the number of times h appears is 21 times. The text matrix refers to converting text data into a numerical matrix form for easy data analysis. For example, [0, 2, 4, 3, 8, 6,...] is a 26-dimensional matrix.

[0061] Specifically, after obtaining the segmented text, based on all the segmented texts corresponding to the same published text and the target pinyin text, determine the segmented pinyin text corresponding to each segmented text, that is, match the characters marked with pinyin in the segmented text and the target pinyin text, and determine the characters marked with pinyin that match successfully as the segmented pinyin text. Then, count the number of times the first letter of the pinyin in the segmented pinyin text corresponding to each segmented text, that is, count the number of times the first letter is a, b, c, d, etc., so as to obtain the number of occurrences of each letter. Then, add up the number of occurrences of the same letter corresponding to the same published text, that is, add up the number of occurrences of the same letter in the segmented pinyin text corresponding to each segmented text, and the initial letter frequency can be obtained. For example, the text content of the text is wbdwbnr, the number of times is [a:0; b:2;...], the text content of the text is really good-looking is dwbnrzhk, the number of times is [b:1; c:0;...], the output target pinyin text is wbdwbnr+dwbnrzhk, and the initial letter frequency is [a:0; b:3; c:0;...]. Further, sort the initial letter frequencies of each letter, that is, sort them in alphabetical order, so as to obtain an ordered initial letter frequency sequence. Then, use a mapping function to map this sequence to a pre-defined mapping space, that is, functions such as sine function, cosine function, sigmoid function, etc. can be used for mapping, so as to obtain a mapped value. And perform normalization processing on this value, that is, methods such as L1 norm normalization, L2 norm normalization, maximum-minimum normalization, etc. can be used, so as to obtain a normalized vector or value. Finally, use the normalized vector or value as each element of the text matrix, and the text matrix can be constructed.

[0062] In this embodiment, by performing initial letter frequency statistics and text matrix construction on the published text, efficient initial letter frequency statistics of the published text are realized, and text matrix construction is realized, thereby meeting the user's requirements for initial letter frequency statistics and text matrix construction of the published text.

[0063] In one embodiment, in step S40, that is, based on the bucketing result, perform text filtering on all the published texts to obtain the target text, including:

[0064] S401, calculate the similarity between each of the published texts in the bucketing result to obtain a text similarity value corresponding to each of the published texts;

[0065] S402, perform text filtering on all the published texts through the text similarity value to obtain the target text.

[0066] Understandably, the text similarity value refers to the similarity between two published texts. The target text refers to the text that meets the conditions and is screened out.

[0067] Specifically, after obtaining the binning results, similarity calculation is performed between the published texts in the binning results, that is, the published texts are vector-encoded. This can be done using a deep learning-based vector model, such as Word2Vec or BERT, to obtain vector representations. Then, for any two published texts, methods such as cosine similarity, Jaccard similarity, or edit distance are used to calculate the cosine similarity or Euclidean distance between the published texts to evaluate their similarity, thus obtaining the text similarity value corresponding to each published text. In another embodiment, a trained text similarity model is used to calculate the similarity between the published texts in the binning results to obtain the text similarity value corresponding to each published text. Further, a preset similarity threshold is obtained, and highly similar published text pairs are selected based on the similarity threshold. That is, the text similarity value is compared with the preset similarity threshold. When the text similarity value is greater than the preset similarity threshold, the two published texts corresponding to the text similarity value are obtained and determined as a published text pair. Then, for each published text, it is sorted according to its similarity value with other texts. That is, a sorting algorithm, such as quicksort or mergesort, can be used to select the text with the highest similarity value as the target text.

[0068] In this embodiment, similarity calculation is performed on the published texts within the bucketed results. This enables the calculation of similarity within the same bucket, thereby reducing the computational workload of text similarity calculation and improving the efficiency of text similarity calculation. Text filtering is then performed on all published texts based on the text similarity values, enabling the selection of target texts and further improving the efficiency of target text selection.

[0069] In one embodiment, step S401, namely, calculating the similarity between each published text in the bucketing result to obtain a text similarity value corresponding to each published text, includes:

[0070] S4011, Randomly shuffle the text matrix corresponding to each of the published texts, and determine the minimum hash value of the text matrix after random row shuffling;

[0071] S4012, Count the minimum hash value of the preset number of times corresponding to the same published text to obtain the hash signature corresponding to each published text;

[0072] S4013, calculate the similarity between the published texts in the bucketing result using all the hash signatures to obtain the text similarity value.

[0073] Understandably, the minimum hash value refers to the numerical value used to estimate the similarity between two documents, which is calculated using a hash algorithm. A hash signature is a set of multiple hash values, that is, a set of multiple minimum hash values, for example, h1[0, 0, 2]; h2[0, 1, 0]; h3[0, 0, 1].

[0074] Specifically, after obtaining the bucketing results, the text matrix corresponding to each published text is randomly shuffled. That is, for each row of the text matrix, a random number generator is used to generate a random number, which is then mapped to the row index. Based on this random number, each row of the text matrix is ​​reordered, thus shuffling the order of the text matrix. For example, S1[1, 0, 1, ...]; S2[1, 1, 0, ...]; S3[1, 1, 1, ...]; after shuffling, it becomes S3, S1, S2. The minimum hash value of the text matrix after random row shuffling is determined by calculating the minimum hash value of the text matrix using a hash function (such as MD5, SHA-1, etc.). Specifically, the row number containing the first 1 in each column is the minimum hash value. For example, L1 is 0, L2 is 1, and L3 is 0. Furthermore, the minimum hash value corresponding to the same published text a predetermined number of times is counted. This involves shuffling the text matrix a predetermined number of times to obtain the minimum hash value equal to the predetermined number of times. All minimum hash values ​​corresponding to the same published text are then counted, meaning the minimum hash value corresponding to the same line of text matrix is ​​counted to obtain the hash signature corresponding to each published text. For example, if L1 values ​​for S1 are 0, 2, 1, and 3, then the hash signature is h(S1) = [0, 2, 1, 3, 1]. Next, the similarity between the published texts in the bucketing result is calculated using the hash signatures. This is achieved by calculating the similarity between hash signatures, which is the number of elements with the same corresponding position divided by the hash signature length, thus obtaining the text similarity value. For example, h(S1) = [0, 2, 1, 3, 1]; h(S2) = [0, 2, 1, 4, 1]; h(S3) = [0, 2, 2, 4, 1]. The similarity between S1 and S2 is calculated to be 0.8, the similarity between S1 and S3 is 0.6, and the similarity between S2 and S3 is 0.8.

[0075] In this embodiment, the minimum hash value is calculated by randomly shuffling the text matrix. By statistically analyzing the minimum hash value corresponding to the same published text a preset number of times, the hash signature is determined, thereby enabling the calculation of text similarity values, reducing text matching time, and improving the efficiency of text deduplication.

[0076] In one embodiment, step S402, namely, filtering all the published texts using the text similarity value to obtain the target text, includes:

[0077] S4011, Obtain a preset similarity threshold, and compare the text similarity value with the preset similarity threshold;

[0078] S4012, when the text similarity value is greater than the preset similarity threshold, obtain the text publication time corresponding to each of the published texts;

[0079] S4013, the published text corresponding to the earliest published time of the text is determined as the target text.

[0080] Understandably, the preset similarity threshold can be adjusted according to actual needs, for example, it can be set to 0.6 or 0.9. The text publication time refers to the time when the text was published, for example, 2023-09-01, etc.

[0081] Specifically, after obtaining the text similarity value, a preset similarity threshold is acquired. Then, each text similarity value is compared with the preset similarity threshold. When the text similarity value is greater than the preset similarity threshold, the text publication time corresponding to each published text is acquired. This text publication time is acquired simultaneously with the published text and may be included in the published text. Next, the two text publication times are compared, and the published text corresponding to the earliest text publication time is determined as the target text. In another embodiment, when multiple published texts have a similarity exceeding the preset similarity threshold with the target text, the text publication times of all texts are compared, and the published text corresponding to the earliest text publication time is selected and determined as the target text.

[0082] When the text similarity value is less than or equal to a preset similarity threshold, two published texts corresponding to that text similarity value are retrieved, and the similarity between these two published texts and other published texts is checked. If neither of the two published texts has a similarity value with other published texts, then both published texts are identified as target texts. If one of them has no similarity value with other published texts, then that published text is identified as target text, and the other published text is re-evaluated. If both published texts have similarity values ​​with other published texts, then the published text is re-evaluated until it is deleted or retained.

[0083] In this embodiment, the first screening of published texts is achieved by comparing them with a preset similarity threshold, thereby selecting published text pairs with high similarity. By comparing the publication times of all texts corresponding to the published text, the earliest publication time is selected, thus determining the target text and enabling the deletion of a large number of duplicate texts.

[0084] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0085] In one embodiment, a text deduplication device is provided, which corresponds one-to-one with the text deduplication methods described in the above embodiments. For example... Figure 3 As shown, the text deduplication device includes a segmented text module 10, a text matrix module 20, a bucketing processing module 30, and a text filtering module 40. Detailed descriptions of each functional module are as follows:

[0086] The segmented text module 10 is used to acquire at least one published text, segment all the published texts according to a preset text length, and obtain at least one segmented text corresponding to the published text.

[0087] The text matrix module 20 is used to count the frequency of the first letter of each character in each segmented text and normalize it to obtain a text matrix corresponding to each published text.

[0088] Bucketing module 30 is used to perform bucketing on the published text corresponding to each of the text matrices using a locality-sensitive hashing algorithm to obtain bucketing results.

[0089] The text filtering module 40 is used to filter all the published text based on the bucketing results to obtain the target text.

[0090] Optionally, the segmented text module 10 includes:

[0091] A text length unit is used to determine the text length corresponding to each of the published texts;

[0092] The text overlap rate unit is used to determine the text overlap rate based on the text length and the preset text length;

[0093] A text segmentation unit is used to segment the published text using the text overlap rate and the preset text length to obtain at least one segmented text; there is partial overlap between adjacent segments.

[0094] Optionally, the segmented text module 10 further includes:

[0095] The text cleaning unit is used to clean each of the published texts to remove stop words and punctuation marks from the published texts;

[0096] The pinyin conversion unit is used to convert each of the published texts into pinyin to obtain pinyin text corresponding to each of the published texts.

[0097] The initial letter extraction unit is used to extract the initial letter of each pinyin in the pinyin text to obtain the target pinyin text corresponding to each of the published texts.

[0098] Optionally, the text matrix module 20 includes:

[0099] The segmented pinyin text unit is used to determine the segmented pinyin text corresponding to each segmented text based on all segmented texts corresponding to the same published text and the target pinyin text;

[0100] The frequency statistics unit is used to count the number of times the first letter of the pinyin in the segmented pinyin text corresponding to the published text, and obtain the frequency of the first letter;

[0101] The mapping and normalization unit is used to map and normalize the frequencies of the first letters corresponding to the same published text to obtain the text matrix.

[0102] Optionally, the text filtering module 40 includes:

[0103] The similarity calculation unit is used to calculate the similarity between each of the published texts in the bucketing result to obtain the text similarity value corresponding to each of the published texts;

[0104] A text filtering unit is used to filter all the published texts based on the text similarity value to obtain the target text.

[0105] Optionally, the similarity calculation unit includes:

[0106] The random row shuffling subunit is used to randomly shuffle the text matrix corresponding to each of the published texts, and determine the minimum hash value of the text matrix after random row shuffling.

[0107] The hash signature subunit is used to count the minimum hash value corresponding to the same published text a preset number of times, and obtain the hash signature corresponding to each published text.

[0108] The similarity value subunit is used to calculate the similarity between each of the published texts in the bucketing result using all the hash signatures, and obtain the text similarity value.

[0109] Optionally, the text filtering unit includes:

[0110] A threshold comparison subunit is used to obtain a preset similarity threshold and compare the text similarity value with the preset similarity threshold.

[0111] The publication time subunit is used to obtain the text publication time corresponding to each of the published texts when the text similarity value is greater than the preset similarity threshold.

[0112] The target text subunit is used to determine the published text corresponding to the earliest published time of the text as the target text.

[0113] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 4 As shown, the computer device includes a processor, memory, network interface, display screen, and input device connected via a system bus. The processor provides computing and control capabilities. The memory includes a readable storage medium and internal memory. The non-volatile storage medium stores the operating system and computer-readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the readable storage medium. The network interface is used to communicate with an external server via a network connection. When the computer-readable instructions are executed by the processor, they implement a text deduplication method. The readable storage medium provided in this embodiment includes both non-volatile and volatile readable storage media.

[0114] In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the above-described text deduplication method when executing the computer-readable instructions.

[0115] In one embodiment, one or more computer-readable storage media storing computer-readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. The computer-readable instructions stored on the readable storage media implement the above-described text deduplication method when executed by one or more processors.

[0116] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by instructing related hardware with computer-readable instructions. These computer-readable instructions can be stored in a non-volatile readable storage medium or a volatile readable storage medium. When executed, these computer-readable instructions can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

[0117] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0118] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A text deduplication method, characterized in that, include: Obtain at least one published text, and segment all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text; The frequency of the first letter of each character in each segmented text is counted and normalized to obtain a text matrix corresponding to each published text; the text matrix is ​​a 26-dimensional matrix that maps the frequency of the first letter of the text content of the published text and performs normalization. The published texts corresponding to each of the text matrices are binned using the Locality Sensitive Hash algorithm to obtain the binning results. Based on the bucketing results, all the published texts are filtered to obtain the target text.

2. The text deduplication method as described in claim 1, characterized in that, Before segmenting all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text, the method further includes: Each of the published texts is cleaned to remove stop words and punctuation marks; Each of the published texts is converted to Pinyin to obtain the Pinyin text corresponding to each of the published texts; The first letter of each pinyin syllable in the pinyin text is extracted to obtain the target pinyin text corresponding to each of the published texts.

3. The text deduplication method as described in claim 1, characterized in that, The step of segmenting all the published texts according to a preset text length to obtain at least one segmented text corresponding to the published text includes: Determine the text length corresponding to each of the published texts; The text overlap rate is determined based on the text length and the preset text length; The published text is segmented using the text overlap rate and the preset text length to obtain at least one segmented text; some content overlaps between adjacent segments.

4. The text deduplication method as described in claim 2, characterized in that, The process of statistically analyzing and normalizing the frequency of the first letter of each character in the segmented text to obtain a text matrix corresponding to each published text includes: Based on all segmented texts corresponding to the same published text and the target pinyin text, determine the segmented pinyin text corresponding to each segmented text; The frequency of the first letter of each syllable in the segmented pinyin text corresponding to the published text is obtained by counting the number of times the first letter is used. The frequency of each initial letter corresponding to the same published text is mapped and normalized to obtain the text matrix.

5. The text deduplication method as described in claim 1, characterized in that, The step of filtering all published texts based on the bucketing results to obtain the target text includes: The similarity between each published text in the bucketing result is calculated to obtain the text similarity value corresponding to each published text; The target text is obtained by filtering all the published texts using the text similarity value.

6. The text deduplication method as described in claim 5, characterized in that, The step of calculating the similarity between the published texts in the bucketing results to obtain the text similarity value corresponding to each published text includes: The text matrix corresponding to each of the published texts is randomly shuffled, and the minimum hash value of the text matrix after random row shuffling is determined. The minimum hash value corresponding to a preset number of times the same published text is used to obtain the hash signature corresponding to each published text; The similarity between the published texts in the bucketing results is calculated using all the hash signatures to obtain a text similarity value.

7. The text deduplication method as described in claim 5, characterized in that, The step of filtering all published texts using the text similarity value to obtain the target text includes: Obtain a preset similarity threshold, and compare the text similarity value with the preset similarity threshold; When the text similarity value is greater than the preset similarity threshold, the text publication time corresponding to each of the published texts is obtained; The published text corresponding to the earliest published time of the text is determined as the target text.

8. A text deduplication device, characterized in that, include: The segmented text module is used to obtain at least one published text, segment all the published texts according to a preset text length, and obtain at least one segmented text corresponding to the published text. The text matrix module is used to count the frequency of the first letter of each character in each segmented text and normalize it to obtain a text matrix corresponding to each published text; the text matrix is ​​a 26-dimensional matrix that maps the frequency of the first letter of the text content of the published text and performs normalization processing. The bucketing module is used to perform bucketing on the published text corresponding to each of the text matrices using the locality-sensitive hashing algorithm to obtain the bucketing result. The text filtering module is used to filter all the published text based on the bucketing results to obtain the target text.

9. A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that, When the processor executes the computer-readable instructions, it implements the text deduplication method as described in any one of claims 1 to 7.

10. One or more readable storage media storing computer-readable instructions, characterized in that, When the computer-readable instructions are executed by one or more processors, the one or more processors cause the text deduplication method as described in any one of claims 1 to 7 to be performed.