A method, apparatus, device and medium for text clustering
By establishing a vocabulary and calculating word vectors, and randomly selecting center vectors for short text clustering, the problem of difficulty in representing semantic and word order information in existing technologies is solved, and efficient automatic short text clustering is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINAN INSPUR DATA TECH CO LTD
- Filing Date
- 2023-04-27
- Publication Date
- 2026-06-23
AI Technical Summary
Existing short text clustering methods are unable to fully represent semantic and word order information, resulting in poor clustering performance.
By building a vocabulary and calculating word vectors, a set of text vectors is obtained. The center vectors are randomly selected for clustering. The center vectors with the least disorder are selected for repeated division until the preset conditions are met.
It achieves efficient automatic clustering of short texts while preserving textual semantics and word order information, thus improving the accuracy and efficiency of clustering.
Smart Images

Figure CN116578702B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computers, and more specifically to a method, apparatus, device, and readable medium for text clustering. Background Technology
[0002] Short text clustering can be used in many scenarios, such as automatic classification of news messages, automatic classification of logs in large-scale distributed systems, automatic classification of products, and user profile clustering for social products. Currently, most short text clustering methods map short texts to a TF-IDF feature vector and then use classic machine learning or deep learning models for classification. However, its shortcomings are also quite obvious. First, the TF-IDF feature vector only considers keyword information and cannot fully represent semantic and word order information; second, feature vectors for various fixed-length texts are inherently difficult to fully represent the information of short texts of different lengths. Summary of the Invention
[0003] In view of this, the purpose of this invention is to provide a method, apparatus, device and readable medium for text clustering. By using the technical solution of this invention, efficient semantic clustering of short texts can be achieved, and automatic clustering of short texts can be achieved while fully preserving the semantic and word order information of the text.
[0004] To achieve the above objectives, one aspect of the present invention provides a text clustering method, comprising the following steps:
[0005] Create a vocabulary list and calculate the word vector for each word in the vocabulary list;
[0006] Obtain the text vector of each text to be clustered and form a text vector set, and calculate the distance between every two text vectors in the text vector set;
[0007] A threshold number of text vectors are randomly selected from the text vector set as candidate center vectors, and then the text vectors are divided into two categories by grouping every two text vectors into a center vector.
[0008] Select the center vector with the highest disorder from the center vectors of the group with the lowest disorder in each division, and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset condition is met.
[0009] According to one embodiment of the present invention, establishing a vocabulary list and calculating the word vector for each word in the vocabulary list includes:
[0010] The text that has undergone preliminary manual screening and sampling is segmented into words, ignoring numbers and random strings in the text;
[0011] Build a vocabulary list using the segmented words;
[0012] Calculate the word vector for each word in the vocabulary.
[0013] According to one embodiment of the present invention, obtaining the text vector of each text to be clustered and forming a text vector set, and calculating the distance between every two text vectors in the text vector set includes:
[0014] Obtain the text for each group to be clustered, and perform word segmentation on each text;
[0015] The word vectors of each word are searched in the vocabulary list according to the order of the words after each text segmentation.
[0016] The set of word vectors for each word in the text is used as the text vector;
[0017] The text vectors of each text are grouped together to form a text vector set;
[0018] Calculate the distance between any two text vectors in the set of text vectors.
[0019] According to one embodiment of the present invention, calculating the distance between every two text vectors in the text vector set includes:
[0020] Using recursive formulas Calculate the distance between any two text vectors, where the boundary conditions are distance(0,0)=0; distance(i,0)=|A i |;distance(0,j)=|B j |, distance(n,m) is the distance between two vectors, A is the first text vector, B is the second text vector, n is the length of the first text vector, m is the length of the second text vector, 0 <i<n,0<j<m。
[0021] According to one embodiment of the present invention, a threshold number of text vectors are randomly selected from the text vector set as candidate center vectors, and the text vectors are divided into two categories by grouping every two text vectors into centers:
[0022] Two text vectors are selected as center vectors from the candidate center vectors. The distances from other text vectors to the two center vectors are queried in turn. If the distance from a text vector to the first center vector is less than the distance to the second center vector, the text vector is assigned to the subset of the first center vector. If the distance from a text vector to the first center vector is greater than the distance to the second center vector, the text vector is assigned to the subset of the second center vector to divide the text vector into two subsets.
[0023] Repeat the previous step until the text vector is divided into two subsets multiple times, where the two center vectors selected each time cannot be exactly the same.
[0024] According to one embodiment of the present invention, selecting the center vector with the highest disorder from the group of center vectors with the lowest disorder in each partitioning and the corresponding classification text vector, and repeating the previous step with the selected text vector until a preset condition is met includes:
[0025] Calculate the disorder of the two subsets in each partition;
[0026] Choose the center vector with the highest disorder from the set of center vectors with the lowest disorder.
[0027] Select the elements from the subset corresponding to the center vector with high disorder as the new text vector set;
[0028] A threshold number of text vectors are randomly selected from the new text vector set as candidate center vectors. Then, the new text vectors are divided into two categories by grouping every two text vectors into a center vector.
[0029] Select the center vector with the highest disorder from the center vectors of the group with the lowest disorder in each division, and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset condition is met.
[0030] According to one embodiment of the present invention, the preset conditions include the disorder level being less than a preset threshold or the total number of partitioned subsets reaching a threshold number.
[0031] Another aspect of the embodiments of the present invention also provides a text clustering apparatus, the apparatus comprising:
[0032] The module is configured to create a vocabulary list and calculate the word vector for each word in the vocabulary list.
[0033] The calculation module is configured to obtain the text vector of each text to be clustered and form a set of text vectors, and calculate the distance between every two text vectors in the set of text vectors;
[0034] The segmentation module is configured to randomly select a threshold number of text vectors from the text vector set as candidate center vectors, and then divide the text vectors into two categories by grouping every two text vectors into a center vector.
[0035] The selection module is configured to select the center vector with the highest disorder from the group of center vectors with the lowest disorder in each division, and the corresponding text vector of the classification. The previous step is repeated with the selected text vector until the preset condition is met.
[0036] Another aspect of the embodiments of the present invention also provides a computer device, the computer device comprising:
[0037] At least one processor; and
[0038] The memory stores computer instructions that can be executed by a processor, which, when executed by the processor, implement the steps of any of the methods described above.
[0039] In another aspect, embodiments of the present invention also provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of any of the above methods.
[0040] The present invention has the following beneficial technical effects: The text clustering method provided in the embodiments of the present invention establishes a vocabulary and calculates the word vector of each word in the vocabulary; obtains the text vector of each text to be clustered and forms a text vector set, and calculates the distance between every two text vectors in the text vector set; randomly selects a threshold number of text vectors in the text vector set as candidate center vectors, and divides the text vectors into two categories by taking every two text vectors as a group in the candidate center vectors; selects the center vector with the largest disorder in the group of center vectors with the smallest disorder in each division and the corresponding classification text vector, and repeats the previous step with the selected text vectors until the preset conditions are met. This technical solution can achieve efficient semantic clustering of short texts and can achieve automatic clustering of short texts while fully preserving the semantic and word order information of the text. Attached Figure Description
[0041] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other embodiments can be obtained based on these drawings without creative effort.
[0042] Figure 1 This is a schematic flowchart of a text clustering method according to an embodiment of the present invention;
[0043] Figure 2 This is a schematic diagram of a text clustering method according to an embodiment of the present invention;
[0044] Figure 3 This is a schematic diagram of a text clustering apparatus according to an embodiment of the present invention;
[0045] Figure 4This is a schematic diagram of a computer device according to an embodiment of the present invention;
[0046] Figure 5 This is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention. Detailed Implementation
[0047] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to specific examples and the accompanying drawings.
[0048] Based on the above objectives, a first aspect of the embodiments of the present invention provides an embodiment of a text clustering method. Figure 1 The diagram shown is a schematic flowchart of the method.
[0049] like Figure 1 As shown, the method may include the following steps:
[0050] S1 establishes a vocabulary and calculates the word vector for each word in the vocabulary. The text, after initial manual screening and sampling, is segmented into words, ignoring numbers and random strings. A vocabulary is then built using the segmented words, and the word vector for each word in the vocabulary is calculated.
[0051] S2 obtains the text vectors of each text to be clustered and forms a text vector set, and calculates the distance between every two text vectors in the text vector set. For each text to be clustered, it performs word segmentation, searches for the word vectors of each word in the vocabulary list according to the order of the segmented words, and uses the set of word vectors of each word in each text as a text vector. The text vectors of each text are then grouped together to form a text vector set, and the distance between every two text vectors in the text vector set is calculated.
[0052] S3 randomly selects a threshold number of text vectors from the text vector set as candidate center vectors. Within these candidate center vectors, it groups two text vectors together and sequentially uses each pair as a center vector to divide the text vectors into two classes. From the candidate center vectors, it selects two text vectors as center vectors and sequentially queries the distances of other text vectors to these two center vectors. If the distance from a text vector to the first center vector is less than its distance to the second center vector, it is assigned to the subset of the first center vector; otherwise, it is assigned to the subset of the second center vector. This process of selecting two text vectors as center vectors each time divides the text vector set into two subsets. Then, it selects two more text vectors as center vectors and divides the text vector set into two subsets using the same method. This process is repeated until the text vectors are divided into two subsets multiple times, ensuring that the two center vectors selected each time are not identical.
[0053] S4 selects the center vector with the highest disorder from the center vectors of the group with the lowest disorder in each partition and the corresponding text vector of the category. This process is repeated with the selected text vector until the preset condition is met. After the previous step, the text vector set is partitioned multiple times, resulting in two subsets from each partition. Each partition has two center vectors. The disorder of the two subsets in each partition is calculated. The center vector with the highest disorder from the group with the lowest disorder is selected. Elements from the subset corresponding to the center vector with the highest disorder are selected as a new text vector set. A threshold number of text vectors are randomly selected from the new text vector set as candidate center vectors. From these candidate center vectors, every two text vectors are grouped together and used sequentially as center vectors to divide the new text vectors into two categories. The center vector with the highest disorder from the group with the lowest disorder in each partition and the corresponding text vector of the category are selected. This process is repeated with the selected text vector until the preset condition is met.
[0054] By using the technical solution of the present invention, efficient semantic clustering of short texts can be achieved, and automatic clustering of short texts can be realized while fully preserving the semantic and word order information of the text.
[0055] In a preferred embodiment of the present invention, establishing a vocabulary list and calculating the word vector for each word in the vocabulary list includes:
[0056] The text that has undergone preliminary manual screening and sampling is segmented into words, ignoring numbers and random strings in the text;
[0057] Build a vocabulary list using the segmented words;
[0058] Calculate the word vector for each word in the vocabulary. The text, after initial manual screening and sampling, is segmented into words. A vocabulary is built based on these segmented words, ignoring irregular symbols such as numbers and random strings in the text. Then, the word2vec model is used to model the massive amount of text, calculating the word vector for each word in the vocabulary.
[0059] In a preferred embodiment of the present invention, obtaining the text vector of each text to be clustered and forming a text vector set, and calculating the distance between every two text vectors in the text vector set includes:
[0060] Obtain the text for each group to be clustered, and perform word segmentation on each text;
[0061] The word vectors of each word are searched in the vocabulary list according to the order of the words after each text segmentation.
[0062] The set of word vectors for each word in the text is used as the text vector;
[0063] The text vectors of each text are grouped together to form a text vector set;
[0064] Calculate the distance between any two text vectors in the text vector set. Segment the given short text, which can be understood as the text to be clustered. Query the word vector of each word in the order of the segmented words; that is, query the word vector of the corresponding word in the vocabulary, and define the sequence of word vectors as the short text vector. For example, given the short text "Unsure if Grafana is for you?", the segmentation result is [unsure, if, grafana, is, for, you]. Assuming that these words are all in the vocabulary, and their corresponding word vector values are {unsure:0.5, if:0.3, grafana:0.09, is:0.01, for:0.02, you:0.6}, then the text vector of this text is [0.5, 0.3, 0.09, 0.01, 0.02, 0.6]. If some words are not in the vocabulary, the word vector of the corresponding word is 0.
[0065] In a preferred embodiment of the present invention, calculating the distance between every two text vectors in the text vector set includes:
[0066] Using recursive formulas Calculate the distance between any two text vectors, where the boundary conditions are distance(0,0)=0; distance(i,0)=|A i |;distance(0,j)=|B j|, distance(n,m) is the distance between two vectors, A is the first text vector, B is the second text vector, n is the length of the first text vector, m is the length of the second text vector, 0 < i < n, 0 < j < m. Therefore, the distance between the first text vector A and the second text vector B is distance(n,m), denoted as <A,B>.
[0067] In a preferred embodiment of the present invention, a threshold number of text vectors are randomly selected from the text vector set as candidate central vectors, and for each pair of text vectors in the candidate central vectors as central vectors in turn, the text vectors are divided into two categories, including:
[0068] Select two text vectors from the candidate central vectors as central vectors, and sequentially query the distances of other text vectors to the two central vectors. In response to the distance of a text vector to the first central vector being less than the distance to the second central vector, divide the text vector into the subset of the first central vector; in response to the distance of a text vector to the first central vector being greater than the distance to the second central vector, divide the text vector into the subset of the second central vector to divide the text vectors into two subsets;
[0069] Repeat the previous step until the text vectors are divided into two subsets multiple times, where the two central vectors selected each time cannot be exactly the same. Randomly select a threshold number of short text vectors as candidate central vectors for all texts, select two optimal central vectors from the candidate central vectors to divide the original text vector set into two categories, and determine these two optimal central vectors as the clustering centers. The two optimal central vectors need to be determined by enumeration among the candidate central vectors. For example, select 10 text vectors in the text vector set as candidate central vectors. First, select vector 1 and vector 2 as central vectors in the 10 vectors for enumeration, and use the calculated distances above to query the distances of all other vectors to vector 1 and vector 2. Whichever distance is closer, it belongs to that category. For example, if the distance of vector 100 to vector 1 is less than vector 2, then vector 100 is classified into the subset of vector 1, and so on. Then reselect two other vectors as central vectors for enumeration, and each enumeration can complete one division.
[0070] In a preferred embodiment of the present invention, select the central vector with the largest confusion degree in the group of central vectors with the smallest confusion degree in each division and the corresponding classified text vectors, and repeat the previous step with the selected text vectors until the preset conditions are met, including:
[0071] Calculate the confusion degrees of the two subsets in each division;
[0072] Select the central vector with the largest confusion degree in the group of central vectors with the smallest confusion degree;
[0073] Select the elements from the subset corresponding to the center vector with high disorder as the new text vector set;
[0074] A threshold number of text vectors are randomly selected from the new text vector set as candidate center vectors. Then, the new text vectors are divided into two categories by grouping every two text vectors into a center vector.
[0075] In each partition, select the center vector with the highest disorder from the group of center vectors with the lowest disorder, and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset condition is met. For example, let seed i Let seed be the i-th candidate center vector of the vector set S. a With seed b Let S be the center vector, and let each vector S in S be S_center. i Based on distance calculations, selecting the center vectors that are closer in distance to be classified into the same class allows S to be divided into two subsets Si. a and S b subset S a and S b The overall disorder level is in and Representing subset S respectively a and S b The i-th text vector in the dataset. For each set of center vectors, a comprehensive disorder score is calculated. The partitioning method with the lowest disorder score is selected as the two cluster centers. For example, if vector 1 and vector 5 are selected as cluster centers, then the disorder scores of vector 1 and vector 5 are calculated. The vector with the higher disorder score is selected; for example, if the disorder score of vector 1 is greater than that of vector 5, then vector 1 is selected. Other vectors that were classified into vector 1 when partitioning with vector 1 and vector 5 as cluster centers are also selected. These vectors are used as a new set of text vectors. This process of partitioning and calculating disorder scores is repeated multiple times until a preset condition is met. Note that all the above vectors are text vectors.
[0076] In a preferred embodiment of the present invention, the preset conditions include the disorder level being less than a preset threshold or the total number of partitioned subsets reaching a threshold number.
[0077] This invention accelerates text vector calculation during processing by modeling text that has undergone preliminary manual screening and sampling, extracting a lexicon, and pre-calculating word vectors for each word in the lexicon. Furthermore, the pre-screened text provides better feedback from human knowledge, resulting in higher model reliability. The distance between text vectors is calculated using dynamic programming, fully preserving the semantics of the text and the relative word order relationships between word vectors. Therefore, its calculation results are more accurate than methods such as direct text embedding and TF-IDF. Additionally, during clustering, this invention effectively prevents the model from getting trapped in local optima by using a combination of random search and enumeration.
[0078] It should be noted that those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc. The embodiments of the computer program described above can achieve the same or similar effects as any of the corresponding foregoing method embodiments.
[0079] Furthermore, the method disclosed in the embodiments of the present invention can also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. When the computer program is executed by the CPU, it performs the functions defined in the method disclosed in the embodiments of the present invention.
[0080] Based on the above objectives, a second aspect of the embodiments of the present invention provides a text clustering apparatus, such as... Figure 3 As shown, the device 200 includes:
[0081] The module is configured to create a vocabulary list and calculate the word vector for each word in the vocabulary list.
[0082] The calculation module is configured to obtain the text vector of each text to be clustered and form a set of text vectors, and calculate the distance between every two text vectors in the set of text vectors;
[0083] The segmentation module is configured to randomly select a threshold number of text vectors from the text vector set as candidate center vectors, and then divide the text vectors into two categories by grouping every two text vectors into a center vector.
[0084] The selection module is configured to select the center vector with the highest disorder from the group of center vectors with the lowest disorder in each division, and the corresponding text vector of the classification. The previous step is repeated with the selected text vector until the preset condition is met.
[0085] In view of the above objectives, a third aspect of the present invention provides a computer device. Figure 4 The diagram shown is a schematic representation of an embodiment of the computer device provided by the present invention. Figure 4 As shown, embodiments of the present invention include the following apparatus: at least one processor 21; and a memory 22, the memory 22 storing computer instructions 23 that can be executed on the processor, the instructions implementing the above method when executed by the processor.
[0086] In view of the above objectives, a fourth aspect of the present invention provides a computer-readable storage medium. Figure 5 The diagram shown is a schematic representation of an embodiment of the computer-readable storage medium provided by the present invention. Figure 5 As shown, the computer-readable storage medium 31 stores a computer program 32 that, when executed by a processor, performs the methods described above.
[0087] Furthermore, the method disclosed in the embodiments of the present invention can also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. When the computer program is executed by the processor, it performs the functions defined in the method disclosed in the embodiments of the present invention.
[0088] Furthermore, the above-described method steps and system units can also be implemented using a controller and a computer-readable storage medium for storing a computer program that enables the controller to perform the functions of the above-described steps or units.
[0089] Those skilled in the art will also understand that the various exemplary logic blocks, modules, circuits, and algorithm steps described in conjunction with the disclosure herein can be implemented as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, the functionality of various illustrative components, blocks, modules, circuits, and steps has been generally described. Whether this functionality is implemented as software or as hardware depends on the specific application and the design constraints imposed on the system as a whole. Those skilled in the art can implement the functionality in various ways for each specific application, but such implementation decisions should not be construed as departing from the scope of the embodiments disclosed herein.
[0090] In one or more exemplary designs, functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, functionality may be stored as one or more instructions or code on or transmitted via a computer-readable medium. Computer-readable media include computer storage media and communication media, including any medium that facilitates the transfer of a computer program from one location to another. Storage media may be any available medium accessible to a general-purpose or special-purpose computer. By way of example, and not limitation, computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage devices, disk storage devices or other magnetic storage devices, or any other medium that may be used to carry or store the required program code in the form of instructions or data structures and is accessible to a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Furthermore, any connection may be appropriately referred to as computer-readable media. For example, if software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the aforementioned coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are all included in the definition of media. As used herein, disks and optical discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy disks, and Blu-ray discs, where disks typically reproduce data magnetically, while optical discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.
[0091] The above are exemplary embodiments disclosed in this invention. However, it should be noted that various changes and modifications can be made without departing from the scope of the embodiments of this invention as defined by the claims. The functions, steps, and / or actions of the methods according to the disclosed embodiments described herein do not need to be performed in any particular order. Furthermore, although the elements disclosed in the embodiments of this invention may be described or claimed individually, they may be understood as multiple unless explicitly limited to a singular number.
[0092] It should be understood that, as used herein, the singular form “a” is intended to include the plural form as well, unless the context clearly supports an exception. It should also be understood that, as used herein, “and / or” refers to any and all possible combinations of one or more of the associated listed items.
[0093] The embodiment numbers disclosed in the above embodiments of the present invention are merely for description and do not represent the superiority or inferiority of the embodiments.
[0094] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0095] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of the invention (including the claims) is limited to these examples. Within the framework of the invention, technical features of the above embodiments or different embodiments can be combined, and many other variations of different aspects of the invention exist, which are not provided in the details for the sake of brevity. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.
Claims
1. A text clustering method, characterized in that, Includes the following steps: Create a vocabulary list and calculate the word vector for each word in the vocabulary list; Obtain the text vector of each text to be clustered and form a text vector set, and calculate the distance between every two text vectors in the text vector set; A threshold number of text vectors are randomly selected from the text vector set as candidate center vectors, and then the text vectors are divided into two categories by grouping every two text vectors into a center vector. Select the center vector with the highest disorder from the center vectors of the group with the lowest disorder in each division and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset condition is met. The process of obtaining the text vector for each text to be clustered and forming a text vector set, and calculating the distance between every two text vectors in the text vector set includes: Obtain the text for each group to be clustered, and perform word segmentation on each text; The word vectors of each word are searched in the vocabulary list according to the order of the words after each text segmentation. The set of word vectors for each word in the text is used as the text vector; The text vectors of each text are grouped together to form a text vector set; Calculate the distance between any two text vectors in the set of text vectors; Calculating the distance between any two text vectors in the text vector set includes: Using recursive formulas Calculate the distance between any two text vectors, where the boundary conditions are: ; ; , Let A be the distance between two text vectors, B be the first text vector, n be the length of the first text vector, m be the length of the second text vector, and 0 be the length of the second text vector. <i<n,0<j<m; A threshold number of text vectors are randomly selected from the text vector set as candidate center vectors. Then, within each candidate center vector, two text vectors are grouped together and used sequentially as center vectors to divide the text vectors into two categories: Two text vectors are selected as center vectors from the candidate center vectors. The distances from other text vectors to the two center vectors are queried in turn. If the distance from a text vector to the first center vector is less than the distance to the second center vector, the text vector is assigned to the subset of the first center vector. If the distance from a text vector to the first center vector is greater than the distance to the second center vector, the text vector is assigned to the subset of the second center vector to divide the text vector into two subsets. Repeat the previous step until the text vector is divided into two subsets multiple times, where the two center vectors selected each time cannot be exactly the same.
2. The method according to claim 1, characterized in that, The process of building a vocabulary and calculating the word vector for each word in the vocabulary includes: The text that has undergone preliminary manual screening and sampling is segmented into words, ignoring numbers and random strings in the text; Build a vocabulary list using the segmented words; Calculate the word vector for each word in the vocabulary.
3. The method according to claim 1, characterized in that, Select the center vector with the highest disorder from the group of center vectors with the lowest disorder in each partition, and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset conditions are met, including: Calculate the disorder of the two subsets in each partition; Choose the center vector with the highest disorder from the set of center vectors with the lowest disorder. Select the elements from the subset corresponding to the center vector with high disorder as the new text vector set; A threshold number of text vectors are randomly selected from the new text vector set as candidate center vectors. Then, the new text vectors are divided into two categories by grouping every two text vectors into a center vector. Select the center vector with the highest disorder from the center vectors of the group with the lowest disorder in each division, and the corresponding text vector of the classification. Repeat the previous step with the selected text vector until the preset condition is met.
4. The method according to claim 3, characterized in that, The preset conditions include a disorder level less than a preset threshold or a total number of partitioned subsets reaching a threshold number.
5. A text clustering apparatus, characterized in that, The device includes: A module is configured to create a vocabulary list and calculate the word vector for each word in the vocabulary list. The calculation module is configured to obtain the text vector of each text to be clustered and form a text vector set, and calculate the distance between every two text vectors in the text vector set; The segmentation module is configured to randomly select a threshold number of text vectors from the text vector set as candidate center vectors, and then divide the text vectors into two categories by grouping every two text vectors into a center vector. The selection module is configured to select the center vector with the largest disorder from the group of center vectors with the smallest disorder in each division and the corresponding text vector of the classification, and repeat the previous step with the selected text vector until the preset condition is met. The computing module is further configured as follows: Obtain the text for each group to be clustered, and perform word segmentation on each text; The word vectors of each word are searched in the vocabulary list according to the order of the words after each text segmentation. The set of word vectors for each word in the text is used as the text vector; The text vectors of each text are grouped together to form a text vector set; Calculate the distance between any two text vectors in the set of text vectors; Calculating the distance between any two text vectors in the text vector set includes: Using recursive formulas Calculate the distance between any two text vectors, where the boundary conditions are: ; ; , Let A be the distance between two text vectors, B be the first text vector, n be the length of the first text vector, m be the length of the second text vector, and 0 be the length of the second text vector. <i<n,0<j<m; The partitioning module is also configured to: Two text vectors are selected as center vectors from the candidate center vectors. The distances from other text vectors to the two center vectors are queried in turn. If the distance from a text vector to the first center vector is less than the distance to the second center vector, the text vector is assigned to the subset of the first center vector. If the distance from a text vector to the first center vector is greater than the distance to the second center vector, the text vector is assigned to the subset of the second center vector to divide the text vector into two subsets. Repeat the previous step until the text vector is divided into two subsets multiple times, where the two center vectors selected each time cannot be exactly the same.
6. A computer device, characterized in that, include: At least one processor; as well as A memory storing computer instructions executable on the processor, which, when executed by the processor, implement the steps of the method according to any one of claims 1-4.
7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-4.