A real estate industry data standardization method, system, device and storage medium

By segmenting, vectorizing, and clustering real estate industry leads, and combining Bayesian and cosine similarity algorithms, a standard tag library is constructed and updated. This solves the problem of incomplete data standardization in real estate marketing and achieves comprehensive coverage and rapid iteration of business scenarios.

CN116127073BActive Publication Date: 2026-06-12金茂云科技服务(北京)有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
金茂云科技服务(北京)有限公司
Filing Date
2023-02-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

The real estate marketing sector lacks effective data standardization solutions. Existing technologies rely on manually constructed standard tag sets, resulting in incomplete coverage of business scenarios and difficulty in updating and iterating.

Method used

By segmenting and filtering clues reported by real estate industry practitioners, vectorization and normalization are performed using a word vector generation model. A standard tag library is constructed by combining K-means and probe density clustering, and then the standard tag library is updated by mapping based on Bayesian algorithm and cosine similarity algorithm.

🎯Benefits of technology

It achieves comprehensive coverage of real estate industry business scenarios with standard tag library, and can be updated and iterated quickly and accurately based on subsequent data, improving the accuracy and adaptability of data mapping.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116127073B_ABST
    Figure CN116127073B_ABST
Patent Text Reader

Abstract

The embodiment of the application discloses a real estate industry data standardization method, system, device and storage medium. First, first clue content reported by a real estate industry practitioner is acquired, and a first key word is extracted therefrom. The first key word is subjected to vectorization processing, and the word vector is subjected to normalization processing to obtain a normalized vector. The normalized vector is subjected to first clustering processing and second clustering processing respectively, and a comprehensive clustering result is obtained according to the first clustering result and the second clustering result. A standard label library is constructed based on the comprehensive clustering result. The second clue content to be standardized is subjected to mapping processing by using the standard label library to obtain a mapping result. The standard label library is updated according to the mapping result to obtain an updated standard label library. The standard label library constructed by the real estate industry data standardization can realize comprehensive coverage of the business scenarios of the real estate industry, and can be updated and iterated according to subsequent data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine learning, specifically to a method, system, device, and storage medium for data standardization in the real estate industry. Background Technology

[0002] Currently, the real estate marketing field lacks an effective data standardization solution. When business scenarios that require data standardization are involved, they are often circumvented by subjectively defining and adding product design tags, which may lead to a series of problems such as product functions not being able to effectively match actual business scenarios.

[0003] Most existing real estate data standardization technologies rely on business personnel to build standard tag sets (standardized tag sets) based on their personal experience. Such solutions make it difficult for standard tag sets to comprehensively cover all business scenarios, which can easily lead to inconsistencies between actual business scenarios and alternative content. Furthermore, due to their over-reliance on personal experience, subsequent real estate data is difficult to accurately map to the standard tag set, making it impossible to quickly and accurately update and iterate the standard tag set. Summary of the Invention

[0004] To address this issue, embodiments of the present invention provide a method, system, device, and storage medium for data standardization in the real estate industry, thereby resolving the problems of incomplete coverage of business scenarios and difficulty in updating and iterating the standard label sets constructed by existing real estate data standardization technologies.

[0005] To achieve the above objectives, the embodiments of the present invention provide the following technical solutions:

[0006] According to a first aspect of the present invention, a method for standardizing data in the real estate industry is provided, the method comprising:

[0007] Retrieve first-hand information reported by real estate industry professionals from the database;

[0008] The first clue content is used for word segmentation to obtain the word segmentation result. The word segmentation result is then filtered to obtain the filtered first key word.

[0009] Based on the preset word vector generation model, the first key word segmentation is vectorized to obtain the corresponding first word vector, and the first word vector is normalized to obtain each normalized vector.

[0010] The normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained.

[0011] Based on the comprehensive clustering results, a standard label library is constructed;

[0012] The second key word is extracted from the second clue content to be standardized, and the second key word is mapped using the standardized tags in the standard tag library to obtain the mapping result;

[0013] Based on the mapping result, the standard tag library is updated to obtain the updated standard tag library.

[0014] Further, the first clue content is used for word segmentation to obtain word segmentation results. The word segmentation results are then filtered to obtain the filtered first key word segmentation, including:

[0015] The first clue content is used to perform the first word segmentation process to obtain each first word;

[0016] For each of the first word segments, the first TF-IDF value of the first word segment is calculated based on the word frequency of the first word segment;

[0017] Determine whether the first TF-IDF value is greater than the first preset filtering threshold;

[0018] If the first TF-IDF value is less than or equal to the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is discarded;

[0019] If the TF-IDF value is greater than the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is used as the first key word segment.

[0020] Further, the normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, including:

[0021] Pre-set a preset number of cluster centroids;

[0022] For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, thus obtaining the first cluster set after clustering is completed.

[0023] Determine whether the first cluster set meets the preset clustering criteria;

[0024] If the first cluster set does not meet the preset clustering criteria, then for each of the first cluster sets, based on each normalized vector in the first cluster set, a preset number of cluster centroids are recalculated, and clustering is performed again;

[0025] If the first cluster set meets the preset clustering criteria, then each of the first cluster sets is taken as the first clustering result.

[0026] Further, the normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, which also includes:

[0027] Map all the normalized vectors to a one-dimensional coordinate system to obtain the corresponding vector coordinates;

[0028] For each of the vector coordinates, the current vector coordinate is taken as the starting vector coordinate, and it is determined whether the distance between the starting vector coordinate and the next vector coordinate of the starting vector coordinate is less than a preset distance;

[0029] If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is less than a preset distance, a second cluster set is generated, and the starting vector coordinates and the next vector coordinates of the starting vector coordinates are used as cluster vector coordinates, and the cluster vector coordinates are included in the second cluster set.

[0030] If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is greater than or equal to a preset distance, then the second cluster set is not generated;

[0031] The first label density of the second cluster set is calculated based on the distance between the coordinates of each clustering vector;

[0032] The next vector coordinate outside the clustering vector coordinates is taken as the candidate vector coordinates, and the second label density is calculated using the clustering vector coordinates and the candidate vector coordinates.

[0033] Determine whether the ratio of the density of the second label to the density of the first label is greater than or equal to a preset density ratio;

[0034] If the ratio of the second label density to the first label density is greater than or equal to a preset density ratio, then the candidate vector coordinates are used as clustering vector coordinates and included in the second cluster set, and the process continues until the next vector coordinate outside the clustering vector coordinates is used as the candidate vector coordinates.

[0035] If the ratio of the second label density to the first label density is less than a preset density ratio, then the candidate vector coordinates are not included in the second cluster set, and the second cluster set with completed clustering is obtained.

[0036] The second cluster set of each completed clustering is taken as the second clustering result;

[0037] Based on the preset clustering weights, the comprehensive clustering result is obtained using the first clustering result and the second clustering result.

[0038] Furthermore, based on the comprehensive clustering results, a standard label library is constructed, including:

[0039] Generate an initial tag library;

[0040] For each cluster set in the comprehensive clustering result, a corresponding tag set is generated in the initial tag library;

[0041] The first key word segment corresponding to the normalized vector in each cluster set is used as the standardized label;

[0042] All the standardized tags are stored in their respective tag sets to obtain the standard tag library.

[0043] Further, a second key word is extracted from the content of the second clue to be standardized. Using standardized tags from the standard tag library, the second key word is mapped to obtain the mapping result, including:

[0044] The second word segmentation is performed using the content of the second clue to be standardized;

[0045] For each of the second word segments, the second TF-IDF value of the second word segment is calculated based on the word frequency of the second word segment;

[0046] Determine whether the second TF-IDF value is greater than the second preset filtering threshold;

[0047] If the second TF-IDF value is less than or equal to the second preset filtering threshold, the second word segment corresponding to the second TF-IDF value will be discarded.

[0048] If the TF-IDF value is greater than the second preset filtering threshold, then the second word segment corresponding to the second TF-IDF value is used as the second key word segment;

[0049] Using the preset word vector generation model, the second key word segmentation is vectorized to obtain the corresponding second word vector;

[0050] For each of the second word vectors, a first mapping result p is calculated by comparing the second word vector with each standardized tag in each of the tag sets. The formula for calculating the first mapping result p is as follows:

[0051]

[0052] Wherein, A represents the standardized label; W represents the second word vector; P(A|W) represents the probability that the second word vector W is mapped to the standardized label A; P(W|A) represents the probability that the second word vector W appears in the label set corresponding to the standardized label A; P(A) represents the probability that the standardized label A appears in the corresponding label set; P(W) represents the probability that the second word vector W appears in the standard label library; P(A|B) represents the probability that the second word vector W was previously mapped to the original standardized label B and is now mapped to the standardized label A;

[0053] The first mean value is calculated using the first mapping result p corresponding to the tag set to obtain the first mapping score corresponding to the tag set.

[0054] For each of the second word vectors, a second mapping result q is calculated by relating the second word vector to each standardized tag in each of the tag sets. The formula for calculating the second mapping result q is as follows:

[0055]

[0056] Where m is the preset weight parameter; This represents the word vector corresponding to the standardized label;

[0057] The second mean value is calculated using the second mapping result q corresponding to the tag set to obtain the second mapping score corresponding to the tag set.

[0058] Based on each of the first mapping scores and each of the second mapping scores, and according to the preset mapping weights, the comprehensive mapping scores corresponding to the second word vectors are calculated respectively.

[0059] Furthermore, based on the mapping result, the standard tag library is updated to obtain an updated standard tag library, including:

[0060] For each second word vector, the comprehensive mapping scores corresponding to the second word vector are compared to obtain the comparison results.

[0061] Based on the comparison results, the tag set corresponding to the highest comprehensive mapping score is taken as the target tag set of the second word vector;

[0062] The second key word corresponding to the second word vector is added to the target tag set to obtain the updated standard tag library.

[0063] According to a second aspect of the present invention, a data standardization system for the real estate industry is provided, the system comprising:

[0064] The real estate content acquisition module is used to retrieve first-hand information reported by real estate industry professionals from the database.

[0065] The content preprocessing module is used to perform word segmentation processing on the first clue content to obtain word segmentation results, and to filter the word segmentation results to obtain the filtered first key word;

[0066] The vectorization module is used to vectorize the first key word based on a preset word vector generation model to obtain the corresponding first word vector, and to normalize the first word vector to obtain various normalized vectors.

[0067] The vector clustering module is used to perform a first clustering process and a second clustering process on the normalized vector to obtain a first clustering result and a second clustering result, and to obtain a comprehensive clustering result based on the first clustering result and the second clustering result;

[0068] The tag library generation module is used to construct a standard tag library based on the comprehensive clustering results;

[0069] The tag mapping module is used to extract the second key word from the second clue content to be standardized, and to map the second key word using the standardized tags in the standard tag library to obtain the mapping result;

[0070] The tag library update module is used to update the standard tag library according to the mapping result, so as to obtain the updated standard tag library.

[0071] Further, the first clue content is used for word segmentation to obtain word segmentation results. The word segmentation results are then filtered to obtain the filtered first key word segmentation, including:

[0072] The first clue content is used to perform the first word segmentation process to obtain each first word;

[0073] For each of the first word segments, the first TF-IDF value of the first word segment is calculated based on the word frequency of the first word segment;

[0074] Determine whether the first TF-IDF value is greater than the first preset filtering threshold;

[0075] If the first TF-IDF value is less than or equal to the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is discarded;

[0076] If the TF-IDF value is greater than the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is used as the first key word segment.

[0077] Further, the normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, including:

[0078] Pre-set a preset number of cluster centroids;

[0079] For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, thus obtaining the first cluster set after clustering is completed.

[0080] Determine whether the first cluster set meets the preset clustering criteria;

[0081] If the first cluster set does not meet the preset clustering criteria, then for each of the first cluster sets, based on each normalized vector in the first cluster set, a preset number of cluster centroids are recalculated, and clustering is performed again;

[0082] If the first cluster set meets the preset clustering criteria, then each of the first cluster sets is taken as the first clustering result.

[0083] Further, the normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, which also includes:

[0084] Map all the normalized vectors to a one-dimensional coordinate system to obtain the corresponding vector coordinates;

[0085] For each of the vector coordinates, the current vector coordinate is taken as the starting vector coordinate, and it is determined whether the distance between the starting vector coordinate and the next vector coordinate of the starting vector coordinate is less than a preset distance;

[0086] If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is less than a preset distance, a second cluster set is generated, and the starting vector coordinates and the next vector coordinates of the starting vector coordinates are used as cluster vector coordinates, and the cluster vector coordinates are included in the second cluster set.

[0087] If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is greater than or equal to a preset distance, then the second cluster set is not generated;

[0088] The first label density of the second cluster set is calculated based on the distance between the coordinates of each clustering vector;

[0089] The next vector coordinate outside the clustering vector coordinates is taken as the candidate vector coordinates, and the second label density is calculated using the clustering vector coordinates and the candidate vector coordinates.

[0090] Determine whether the ratio of the density of the second label to the density of the first label is greater than or equal to a preset density ratio;

[0091] If the ratio of the second label density to the first label density is greater than or equal to a preset density ratio, then the candidate vector coordinates are used as clustering vector coordinates and included in the second cluster set, and the process continues until the next vector coordinate outside the clustering vector coordinates is used as the candidate vector coordinates.

[0092] If the ratio of the second label density to the first label density is less than a preset density ratio, then the candidate vector coordinates are not included in the second cluster set, and the second cluster set with completed clustering is obtained.

[0093] The second cluster set of each completed clustering is taken as the second clustering result;

[0094] Based on the preset clustering weights, the comprehensive clustering result is obtained using the first clustering result and the second clustering result.

[0095] Furthermore, based on the comprehensive clustering results, a standard label library is constructed, including:

[0096] Generate an initial tag library;

[0097] For each cluster set in the comprehensive clustering result, a corresponding tag set is generated in the initial tag library;

[0098] The first key word segment corresponding to the normalized vector in each cluster set is used as the standardized label;

[0099] All the standardized tags are stored in their respective tag sets to obtain the standard tag library.

[0100] Further, a second key word is extracted from the content of the second clue to be standardized. Using standardized tags from the standard tag library, the second key word is mapped to obtain the mapping result, including:

[0101] The second word segmentation is performed using the content of the second clue to be standardized;

[0102] For each of the second word segments, the second TF-IDF value of the second word segment is calculated based on the word frequency of the second word segment;

[0103] Determine whether the second TF-IDF value is greater than the second preset filtering threshold;

[0104] If the second TF-IDF value is less than or equal to the second preset filtering threshold, the second word segment corresponding to the second TF-IDF value will be discarded.

[0105] If the TF-IDF value is greater than the second preset filtering threshold, then the second word segment corresponding to the second TF-IDF value is used as the second key word segment;

[0106] Using the preset word vector generation model, the second key word segmentation is vectorized to obtain the corresponding second word vector;

[0107] For each of the second word vectors, a first mapping result p is calculated by comparing the second word vector with each standardized tag in each of the tag sets. The formula for calculating the first mapping result p is as follows:

[0108]

[0109] Wherein, A represents the standardized label; W represents the second word vector; P(A|W) represents the probability that the second word vector W is mapped to the standardized label A; P(W|A) represents the probability that the second word vector W appears in the label set corresponding to the standardized label A; P(A) represents the probability that the standardized label A appears in the corresponding label set; P(W) represents the probability that the second word vector W appears in the standard label library; P(A|B) represents the probability that the second word vector W was previously mapped to the original standardized label B and is now mapped to the standardized label A;

[0110] The first mean value is calculated using the first mapping result p corresponding to the tag set to obtain the first mapping score corresponding to the tag set.

[0111] For each of the second word vectors, a second mapping result q is calculated by relating the second word vector to each standardized tag in each of the tag sets. The formula for calculating the second mapping result q is as follows:

[0112]

[0113] Where m is the preset weight parameter; This represents the word vector corresponding to the standardized label;

[0114] The second mean value is calculated using the second mapping result q corresponding to the tag set to obtain the second mapping score corresponding to the tag set.

[0115] Based on each of the first mapping scores and each of the second mapping scores, and according to the preset mapping weights, the comprehensive mapping scores corresponding to the second word vectors are calculated respectively.

[0116] Furthermore, based on the mapping result, the standard tag library is updated to obtain an updated standard tag library, including:

[0117] For each second word vector, the comprehensive mapping scores corresponding to the second word vector are compared to obtain the comparison results.

[0118] Based on the comparison results, the tag set corresponding to the highest comprehensive mapping score is taken as the target tag set of the second word vector;

[0119] The second key word corresponding to the second word vector is added to the target tag set to obtain the updated standard tag library.

[0120] According to a third aspect of the present invention, a data standardization device for the real estate industry is provided, the device comprising: a processor and a memory;

[0121] The memory is used to store one or more program instructions;

[0122] The processor is configured to run one or more program instructions to perform the steps of a real estate industry data standardization method as described in any of the preceding claims.

[0123] According to a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein when executed by a processor, the computer program implements the steps of a real estate industry data standardization method as described in any of the preceding claims.

[0124] The embodiments of the present invention have the following advantages:

[0125] This invention discloses a method, system, device, and storage medium for standardizing data in the real estate industry. The method involves: retrieving first clues reported by real estate industry practitioners from a database; extracting first key words from the first clues; vectorizing the first key words based on a preset word vector model and normalizing the word vectors to obtain normalized vectors; performing first and second clustering processes on the normalized vectors, and obtaining a comprehensive clustering result based on the first and second clustering results; constructing a standard tag library based on the comprehensive clustering result; mapping the second clues to be standardized using the standard tag library to obtain mapping results; and updating the standard tag library based on the mapping results to obtain an updated standard tag library. This invention's standard tag library, constructed through real estate industry data standardization, can comprehensively cover the business scenarios of the real estate industry and can be updated and iterated based on subsequent data. Attached Figure Description

[0126] To more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are merely exemplary, and those skilled in the art can derive other embodiments based on the provided drawings without creative effort.

[0127] The structures, proportions, sizes, etc. illustrated in this specification are only for the purpose of assisting those skilled in the art in understanding and reading the content disclosed herein, and are not intended to limit the conditions under which the present invention can be implemented. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in the proportions, or adjustments to the size, without affecting the effects and objectives that the present invention can produce, should still fall within the scope of the technical content disclosed in the present invention.

[0128] Figure 1 A schematic diagram of the logical structure of a real estate industry data standardization system provided in an embodiment of the present invention;

[0129] Figure 2 A flowchart illustrating a data standardization method for the real estate industry provided in an embodiment of the present invention;

[0130] Figure 3 This is a schematic diagram of the preprocessing flow for clue content provided in an embodiment of the present invention;

[0131] Figure 4 This is one of the flowcharts illustrating the clustering of normalized vectors provided in an embodiment of the present invention;

[0132] Figure 5 This is the second schematic diagram of the process of clustering normalized vectors provided in an embodiment of the present invention;

[0133] Figure 6 A schematic diagram illustrating the process of constructing a standard tag library provided in an embodiment of the present invention;

[0134] Figure 7 This is a schematic diagram illustrating the process of updating a standard tag library according to an embodiment of the present invention. Detailed Implementation

[0135] The following specific embodiments illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0136] refer to Figure 1 This invention provides a data standardization system for the real estate industry, which specifically includes: a real estate content acquisition module 1, a content preprocessing module 2, a vectorization module 3, a vector clustering module 4, a tag library generation module 5, a tag mapping module 6, and a tag library update module 7.

[0137] Furthermore, the real estate content acquisition module 1 is used to retrieve the first clue content reported by real estate industry practitioners from the database; the content preprocessing module 2 is used to perform word segmentation processing on the first clue content to obtain the word segmentation results, and filter the word segmentation results to obtain the filtered first key word segmentation; the vectorization module 3 is used to perform vectorization processing on the first key word segmentation based on a preset word vector generation model to obtain the corresponding first word vector, and normalize the first word vector to obtain each normalized vector; the vector clustering module 4 is used to perform first clustering processing and second clustering processing on the normalized vectors respectively to obtain the first clustering result and the second clustering result, and obtain the comprehensive clustering result based on the first clustering result and the second clustering result; the tag library generation module 5 is used to construct a standard tag library based on the comprehensive clustering result; the tag mapping module 6 is used to extract the second key word segmentation from the second clue content to be standardized, and use the standardized tags in the standard tag library to perform mapping processing on the second key word segmentation to obtain the mapping result; the tag library update module 7 is used to update the standard tag library according to the mapping result to obtain the updated standard tag library.

[0138] This invention discloses a data standardization system for the real estate industry. The system retrieves first clues reported by real estate industry practitioners from a database; extracts first key words from the first clues; vectorizes the first key words based on a preset word vector model and normalizes the word vectors to obtain normalized vectors; performs first and second clustering processes on the normalized vectors, and obtains a comprehensive clustering result based on the first and second clustering results; constructs a standard tag library based on the comprehensive clustering result; maps the second clues to be standardized using the standard tag library to obtain mapping results; and updates the standard tag library based on the mapping results to obtain an updated standard tag library. This invention's standard tag library, constructed through real estate industry data standardization, can comprehensively cover the business scenarios of the real estate industry and can be updated and iterated based on subsequent data.

[0139] Corresponding to the aforementioned real estate industry data standardization system, this invention also discloses a real estate industry data standardization method. The following details the real estate industry data standardization method disclosed in this invention, in conjunction with the aforementioned real estate industry data standardization system.

[0140] refer to Figure 2The following describes the specific steps of a data standardization method for the real estate industry provided by an embodiment of the present invention.

[0141] The real estate content acquisition module 1 retrieves first-hand information reported by real estate industry practitioners from the database.

[0142] The above steps specifically include: real estate sales personnel inputting lead data from the sales process into a database. This lead data includes multiple fields such as customer ID, follow-up ID, follow-up time, and lead content; extracting the specific content fields from the lead data in the database to obtain the first lead content, for example, the first lead content is "The customer has been contacted by phone, is interested in unit type 142, has a high intention, will notify us as soon as there is news, and will continue to follow up."

[0143] The content preprocessing module 2 uses the first clue content to perform word segmentation to obtain the word segmentation results. The word segmentation results are then filtered to obtain the filtered first key word segmentation.

[0144] refer to Figure 3 The above steps specifically include: firstly, performing first word segmentation using the first clue content to obtain individual first words; for each first word, calculating the first TF-IDF value based on its term frequency; determining whether the first TF-IDF value is greater than a first preset filtering threshold; if the first TF-IDF value is less than or equal to the first preset filtering threshold, discarding the first word corresponding to the first TF-IDF value; if the TF-IDF value is greater than the first preset filtering threshold, using the first word corresponding to the first TF-IDF value as the first key word. Here, TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF (Term Frequency) is the term frequency, and IDF (Inverse Document Frequency) is the inverse document frequency index.

[0145] For example, the first clue content "The customer has been contacted by phone, is interested in unit type 142, has a high level of interest, will notify you as soon as there is news, and will continue to follow up" is processed by first word segmentation. The first clue content after word segmentation is "The customer has been contacted by phone, is interested in unit type 142, has a high level of interest, will notify you as soon as there is news, and will continue to follow up". The first word segmentation is extracted as: "customer", "already", "phone", "conducted a follow-up", "interested", "142", "unit type", "high level of interest", "have", "news", "come out", "first", "time", "notification", "continue", and "follow up". The first TF-IDF value of each of the above first word segments is calculated. All first word segments with a first TF-IDF value less than or equal to a preset filtering threshold are filtered out. First word segments with a first TF-IDF value greater than the preset filtering threshold are selected as first key word segments. The first key word segments obtained through the above steps are: “customer”, “telephone”, “follow-up”, “attention”, “142”, “apartment type”, “high intention”, “message”, and “notification”. TF-IDF (term frequency–inverse document frequency) is a weighting technique used in information retrieval and data mining to evaluate the importance of a word to a document in a document set or corpus. The importance of a word increases proportionally to the number of times it appears in a document, but decreases inversely proportionally to its frequency in the corpus.

[0146] The vectorization module 3 performs vectorization processing on the first key word segmentation based on the preset word vector generation model to obtain the corresponding first word vector. The first word vector is then normalized to obtain various normalized vectors.

[0147] The above steps specifically include: based on the preset Word2vec model, performing vectorization and dimensionality reduction on the first key word segment to obtain the first word vector corresponding to each first key word segment, and normalizing all first word vectors to obtain normalized vectors. The Word2Vec (Word to vector) model is a related model used to generate word vectors. The word2vec model can map each word to a vector, which can be used to represent the relationship between words. Normalization is a very important step in deep learning data preprocessing, which can unify the scale and prevent small data from being swallowed up.

[0148] The vector clustering module 4 performs first clustering and second clustering on the normalized vectors respectively to obtain the first clustering result and the second clustering result. Based on the first clustering result and the second clustering result, the comprehensive clustering result is obtained.

[0149] refer to Figure 4 and Figure 5The above steps specifically include: First, K-means clustering is performed. A preset number of cluster centroids is pre-defined based on the actual business scenario. For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, resulting in the first cluster set after clustering. Next, it is determined whether the first cluster set meets the preset clustering criteria. If the first cluster set does not meet the preset clustering criteria, for each first cluster set, a preset number of cluster centroids are recalculated based on each normalized vector in the first cluster set, and clustering is performed again. If the first cluster set meets the preset clustering criteria, each first cluster set is used as the first clustering result.

[0150] Next, density-based clustering is performed, mapping all normalized vectors distributed in the polar coordinate system to a one-dimensional horizontal coordinate system to obtain the corresponding vector coordinates. For each vector coordinate, the current vector coordinate is selected as the starting vector coordinate, and the next vector coordinate is probed to determine whether it belongs to the same cluster category. The specific strategy is as follows: determine whether the distance between the starting vector coordinate and the next vector coordinate is less than a preset distance; if the distance between the starting vector coordinate and the next vector coordinate is less than the preset distance, they belong to the same cluster category, and a second cluster set is generated with the starting vector coordinate as the center. The starting vector coordinate and the next vector coordinate are used as cluster vector coordinates and included in the second cluster set; if the distance between the starting vector coordinate and the next vector coordinate is greater than or equal to the preset distance, no second cluster set is generated; then, the second cluster is calculated based on the distance between each cluster vector coordinate. The first label density of the set is determined; the next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates, and the second label density is calculated using the cluster vector coordinates and the candidate vector coordinates; it is determined whether the ratio of the second label density to the first label density is greater than or equal to a preset density ratio; if the ratio is greater than or equal to the preset density ratio, the candidate vector coordinates are used as cluster vector coordinates and included in the second cluster set, and this process is repeated until the next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates; if the ratio is less than the preset density ratio, the candidate vector coordinates are not included in the second cluster set, resulting in the second cluster set after clustering is completed; after density detection of all vector coordinates is completed, each second cluster set after clustering is used as the second clustering result; finally, based on the first and second clustering results, overlapping vectors are eliminated according to the preset clustering weights to obtain the comprehensive clustering result.

[0151] In this embodiment of the invention, K-means clustering and probe density clustering are performed on the normalized word vectors. The combined results of the two clustering methods are used to obtain a comprehensive clustering result, which can effectively improve the accuracy of word clustering results compared with a single clustering method.

[0152] The tag library generation module 5 constructs a standard tag library based on the comprehensive clustering results.

[0153] refer to Figure 6 The above steps specifically include: first, generating an initial tag library; for each cluster set in the comprehensive clustering result, generating a corresponding tag set in the initial tag library according to the business scenario; using the first key word segment corresponding to the normalized vector in each cluster set as a standardized tag; and storing all standardized tags into the corresponding tag sets to obtain a standard tag library.

[0154] For example, if the comprehensive clustering result includes ten cluster sets, then according to the business scenario, generate the same number of tag sets in the initial tag library to correspond one-to-one with the cluster sets. The tag sets are: "External customer acquisition", "Channel customer acquisition", "Incoming call", "Outgoing call", "WeChat", "SMS", "Customer pool maintenance", "Other", "On-site reception", and "Data supplementation". The first key word corresponding to each normalized vector is included in the corresponding tag set. After all are included, the standard tag library is obtained.

[0155] This invention utilizes comprehensive clustering results to construct a standard tag library, enabling rapid and accurate tag classification and storage, and allowing the standard set to be expanded based on the discovery of new business scenarios.

[0156] The tag mapping module 6 extracts the second key word from the second clue content to be standardized, and uses standardized tags in the standard tag library to map the second key word to obtain the mapping result.

[0157] refer to Figure 7 The above steps specifically include: performing second word segmentation using the new second clue content to be standardized to obtain second word segments; calculating the second TF-IDF value of each second word segment based on its word frequency; determining whether the second TF-IDF value is greater than a second preset filtering threshold; discarding the second word segment corresponding to the second TF-IDF value if the second TF-IDF value is less than or equal to the second preset filtering threshold; using the preset word vector generation model to vectorize the second key word segment to obtain the corresponding second word vector.

[0158] For each second word vector, the first mapping result p is calculated by comparing the second word vector with each normalized label in each label set. The formula for calculating the first mapping result p is as follows:

[0159]

[0160] Where A represents the normalized label; W represents the second word vector; P(A|W) represents the probability that the second word vector W is mapped to the normalized label A; P(W|A) represents the probability that the second word vector W appears in the label set corresponding to the normalized label A; P(A) represents the probability that the normalized label A appears in the corresponding label set; P(W) represents the probability that the second word vector W appears in the standard label library; P(A|B) represents the probability that the second word vector W was previously mapped to the original normalized label B and is now mapped to the normalized label A.

[0161] The first mean value is calculated using the first mapping result p corresponding to the label set to obtain the first mapping score corresponding to the label set.

[0162] For each second word vector, the second mapping result q is calculated by comparing the second word vector with each normalized label in each label set. The formula for calculating the second mapping result q is as follows:

[0163]

[0164] Where m is the preset weight parameter; This represents the word vector corresponding to the standardized label;

[0165] The second mean operation is performed using the second mapping calculation result q corresponding to the tag set to obtain the second mapping score corresponding to the tag set; based on each first mapping score and each second mapping score, the comprehensive mapping scores corresponding to the second word vector are calculated according to the preset mapping weights. In the preset mapping weights, the weight of the first mapping score is usually greater than the weight of the second mapping score.

[0166] This invention uses Bayesian and cosine similarity algorithms for multi-path mapping, introduces mapping weights, and calculates the comprehensive mapping score of the second word vector, effectively improving the accuracy of the mapping results.

[0167] The tag library update module 7 updates the standard tag library based on the mapping results, resulting in the updated standard tag library.

[0168] refer to Figure 7The above steps specifically include: for each second word vector, comparing the comprehensive mapping scores corresponding to the second word vector to obtain the comparison results; based on the comparison results, taking the tag set corresponding to the highest comprehensive mapping score as the target tag set of the second word vector; adding the second key word segment corresponding to the second word vector to the target tag set to obtain the updated standard tag library.

[0169] The embodiments of the present invention also include secondary standardization derived from actual business scenarios.

[0170] The specific process of the above-mentioned secondary standardization is as follows: conduct business data analysis based on actual business scenarios and business data; for abnormal tag sets, formulate corresponding secondary standardization strategies based on the analysis results; and use the secondary standardization strategies to correct the abnormal tag sets.

[0171] For example, in the "customer pool maintenance" scenario, sales consultants refresh the status of leads to prevent them from falling into the public pool due to prolonged inactivity. In this scenario, to maximize efficiency, consultants often fill in less lead information, and due to variations in the actions of multiple consultants, the content may differ from the actual situation. In this case, neither Bayesian algorithms, cosine similarity algorithms, nor other algorithms can effectively standardize the data. Therefore, strategy standardization based on the actual data corresponding to the business scenario is necessary. Data review and analysis show that in the "customer pool maintenance" scenario, sales consultants typically update or follow up on one lead every 5 seconds, while in other scenarios, it usually takes minutes. Therefore, the secondary standardization strategy for "customer pool maintenance" is as follows: if the time interval between two adjacent leads followed up by a sales consultant does not exceed half a minute, and more than 20 leads are followed up within a day, then the corresponding lead content is standardized and included in the "customer pool maintenance" tag set.

[0172] The embodiments of the present invention supplement the standardization of normal real estate data with secondary standardization, which effectively improves the standardization accuracy of some feature data.

[0173] This invention discloses a data standardization method for the real estate industry. The method involves: retrieving first clues reported by real estate industry practitioners from a database; extracting first key words from the first clues; vectorizing the first key words based on a preset word vector model and normalizing the word vectors to obtain normalized vectors; performing first and second clustering processes on the normalized vectors, and obtaining a comprehensive clustering result based on the first and second clustering results; constructing a standard tag library based on the comprehensive clustering result; mapping the second clues to be standardized using the standard tag library to obtain mapping results; and updating the standard tag library based on the mapping results to obtain an updated standard tag library. This invention's standard tag library, constructed through real estate industry data standardization, can comprehensively cover the business scenarios of the real estate industry and can be updated and iterated based on subsequent data.

[0174] In addition, embodiments of the present invention also provide a data standardization device for the real estate industry, the device comprising: a processor and a memory; the memory being used to store one or more program instructions; the processor being used to run one or more program instructions to perform the steps of a data standardization method for the real estate industry as described in any of the preceding embodiments.

[0175] In addition, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of a real estate industry data standardization method as described in any of the preceding embodiments.

[0176] In this embodiment of the invention, the processor can be an integrated circuit chip with signal processing capabilities. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0177] The various methods, steps, and logic diagrams disclosed in the embodiments of this invention can be implemented or executed. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this invention can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The processor reads information from the storage medium and, in conjunction with its hardware, completes the steps of the above methods.

[0178] The storage medium can be memory, such as volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.

[0179] Among them, non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory.

[0180] Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (Synchlink DRAM, SLDRAM), and direct memory bus RAM (DRRAM).

[0181] The storage media described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.

[0182] Those skilled in the art will recognize that, in one or more of the examples above, the functions described in this invention can be implemented using a combination of hardware and software. When applied as software, the corresponding functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media include computer storage media and communication media, wherein communication media include any medium that facilitates the transmission of computer programs from one place to another. Storage media can be any available medium accessible to general-purpose or special-purpose computers.

[0183] Although the present invention has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to it, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present invention fall within the scope of protection claimed by the present invention.

Claims

1. A real estate industry data standardization method, characterized in that, The method includes: The first clue content is obtained from the database by real estate industry practitioners. Specifically, real estate industry practitioners first input the clue data during the sales process into the database. The clue data includes customer ID, follow-up ID, follow-up time, and clue content. The specific content fields in the clue data in the database are extracted to obtain the first clue content. The first clue content is used for word segmentation to obtain the word segmentation result. The word segmentation result is then filtered to obtain the filtered first key word. Based on the preset word vector generation model, the first key word segmentation is vectorized to obtain the corresponding first word vector, and the first word vector is normalized to obtain each normalized vector. The normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained. Based on the comprehensive clustering results, a standard label library is constructed; The second key word is extracted from the second clue content to be standardized, and the second key word is mapped using the standardized tags in the standard tag library to obtain the mapping result; Based on the mapping result, the standard tag library is updated to obtain the updated standard tag library; The first clustering process and the second clustering process include: First, K-means clustering is performed. A predetermined number of cluster centroids is pre-defined based on the actual business scenario. For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, resulting in the first cluster set. It is then determined whether the first cluster set meets the predetermined clustering criteria. If the first cluster set does not meet the criteria, for each first cluster set, a new predetermined number of cluster centroids are calculated based on each normalized vector in the first cluster set, and clustering is performed again. If the first cluster set meets the criteria, then each first cluster set is used as the first clustering result. Next, density-based clustering is performed, mapping all normalized vectors distributed in the polar coordinate system to a one-dimensional horizontal coordinate system to obtain the corresponding vector coordinates. For each vector coordinate, the current vector coordinate is selected as the starting vector coordinate, and the next vector coordinate is probed to determine whether it belongs to the same cluster category. The specific strategy is as follows: determine whether the distance between the starting vector coordinate and the next vector coordinate is less than a preset distance; if the distance between the starting vector coordinate and the next vector coordinate is less than the preset distance, they belong to the same cluster category, and a second cluster set is generated with the starting vector coordinate as the center. The starting vector coordinate and the next vector coordinate are used as cluster vector coordinates and included in the second cluster set; if the distance between the starting vector coordinate and the next vector coordinate is greater than or equal to the preset distance, a second cluster set is not generated; then, based on each... The distance between the cluster vector coordinates is used to calculate the first label density of the second cluster set. The next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates. The second label density is calculated using the cluster vector coordinates and the candidate vector coordinates. It is determined whether the ratio of the second label density to the first label density is greater than or equal to a preset density ratio. If the ratio is greater than or equal to the preset density ratio, the candidate vector coordinates are used as cluster vector coordinates and included in the second cluster set. This process is repeated until the next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates. If the ratio is less than the preset density ratio, the candidate vector coordinates are not included in the second cluster set, resulting in the second cluster set after clustering is completed. After density detection of all vector coordinates is completed, each second cluster set after clustering is used as the second clustering result. The specific method for obtaining the comprehensive clustering result is as follows: based on the first clustering result and the second clustering result, overlapping vectors are eliminated according to the preset clustering weights to obtain the comprehensive clustering result.

2. The real estate industry data standardization method of claim 1, wherein, The first clue content is used for word segmentation to obtain word segmentation results. The word segmentation results are then filtered to obtain the filtered first key word segmentation, including: The first clue content is used to perform the first word segmentation process to obtain each first word; For each of the first word segments, the first TF-IDF value of the first word segment is calculated based on the word frequency of the first word segment; Determine whether the first TF-IDF value is greater than the first preset filtering threshold; If the first TF-IDF value is less than or equal to the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is discarded; If the TF-IDF value is greater than the first preset filtering threshold, then the first word segment corresponding to the first TF-IDF value is used as the first key word segment.

3. The real estate industry data standardization method of claim 2, wherein, The normalized vector is subjected to a first clustering process and a second clustering process, respectively, to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, including: Pre-set a preset number of cluster centroids; For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, thus obtaining the first cluster set after clustering is completed. Determine whether the first cluster set meets the preset clustering criteria; If the first cluster set does not meet the preset clustering criteria, then for each of the first cluster sets, based on each normalized vector in the first cluster set, a preset number of cluster centroids are recalculated, and clustering is performed again; If the first cluster set meets the preset clustering criteria, then each of the first cluster sets is taken as the first clustering result.

4. The real estate industry data standardization method of claim 3, wherein, The normalized vector is subjected to a first clustering process and a second clustering process respectively to obtain a first clustering result and a second clustering result. Based on the first clustering result and the second clustering result, a comprehensive clustering result is obtained, which further includes: Map all the normalized vectors to a one-dimensional coordinate system to obtain the corresponding vector coordinates; For each of the vector coordinates, the current vector coordinate is taken as the starting vector coordinate, and it is determined whether the distance between the starting vector coordinate and the next vector coordinate of the starting vector coordinate is less than a preset distance; If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is less than a preset distance, a second cluster set is generated, and the starting vector coordinates and the next vector coordinates of the starting vector coordinates are used as cluster vector coordinates, and the cluster vector coordinates are included in the second cluster set. If the distance between the starting vector coordinates and the next vector coordinates of the starting vector coordinates is greater than or equal to a preset distance, then the second cluster set is not generated; The first label density of the second cluster set is calculated based on the distance between the coordinates of each clustering vector; The next vector coordinate outside the clustering vector coordinates is taken as the candidate vector coordinates, and the second label density is calculated using the clustering vector coordinates and the candidate vector coordinates. Determine whether the ratio of the density of the second label to the density of the first label is greater than or equal to a preset density ratio; If the ratio of the second label density to the first label density is greater than or equal to a preset density ratio, then the candidate vector coordinates are used as clustering vector coordinates and included in the second cluster set, and the process continues until the next vector coordinate outside the clustering vector coordinates is used as the candidate vector coordinates. If the ratio of the second label density to the first label density is less than a preset density ratio, then the candidate vector coordinates are not included in the second cluster set, and the second cluster set with completed clustering is obtained. The second cluster set of each completed clustering is taken as the second clustering result; Based on the preset clustering weights, the comprehensive clustering result is obtained using the first clustering result and the second clustering result.

5. The real estate industry data standardization method of claim 4, wherein, Based on the comprehensive clustering results, a standard label library is constructed, including: Generate an initial tag library; For each cluster set in the comprehensive clustering result, a corresponding tag set is generated in the initial tag library; The first key word segment corresponding to the normalized vector in each cluster set is used as the standardized label; All the standardized tags are stored in their respective tag sets to obtain the standard tag library.

6. A data standardization method for the real estate industry as described in claim 5, characterized in that, The second key word is extracted from the second clue content to be standardized. Then, standardized tags from the standard tag library are used to map the second key word, resulting in a mapping result, including: The second word segmentation is performed using the content of the second clue to be standardized; For each of the second word segments, the second TF-IDF value of the second word segment is calculated based on the word frequency of the second word segment; Determine whether the second TF-IDF value is greater than the second preset filtering threshold; If the second TF-IDF value is less than or equal to the second preset filtering threshold, the second word segment corresponding to the second TF-IDF value will be discarded. If the TF-IDF value is greater than the second preset filtering threshold, then the second word segment corresponding to the second TF-IDF value is used as the second key word segment; Using the preset word vector generation model, the second key word segmentation is vectorized to obtain the corresponding second word vector; For each of the second word vectors, a first mapping result p is calculated by relating the second word vector to each standardized tag in each of the tag sets. The formula for calculating the first mapping result p is as follows: , Wherein, A represents the standardized label; W represents the second word vector; This represents the probability that the second word vector W is mapped to the normalized label A; This represents the probability that the second word vector W appears in the tag set corresponding to the normalized tag A; This represents the probability of the standardized label A appearing in the corresponding label set; This represents the probability that the second word vector W appears in the standard tag library; This represents the probability that the second word vector W was previously mapped to the original normalized label B, and is now mapped to the normalized label A. The first mean value is calculated using the first mapping result p corresponding to the tag set to obtain the first mapping score corresponding to the tag set. For each of the second word vectors, a second mapping result q is calculated by relating the second word vector to each standardized tag in each of the tag sets. The formula for calculating the second mapping result q is as follows: , Where m is the preset weight parameter; This represents the word vector corresponding to the standardized label; The second mean value is calculated using the second mapping result q corresponding to the tag set to obtain the second mapping score corresponding to the tag set. Based on each of the first mapping scores and each of the second mapping scores, and according to the preset mapping weights, the comprehensive mapping scores corresponding to the second word vectors are calculated respectively.

7. A data standardization method for the real estate industry as described in claim 6, characterized in that, Based on the mapping result, the standard tag library is updated to obtain an updated standard tag library, including: For each second word vector, the comprehensive mapping scores corresponding to the second word vector are compared to obtain the comparison results. Based on the comparison results, the tag set corresponding to the highest comprehensive mapping score is taken as the target tag set of the second word vector; The second key word corresponding to the second word vector is added to the target tag set to obtain the updated standard tag library.

8. A data standardization system for the real estate industry, characterized in that, The system includes: The real estate content acquisition module is used to retrieve the first clue content reported by real estate industry practitioners from the database. Specifically, real estate industry practitioners first input the clue data during the sales process into the database. The clue data includes customer ID, follow-up ID, follow-up time, and clue content. The specific content fields in the clue data in the database are extracted to obtain the first clue content. The content preprocessing module is used to perform word segmentation processing on the first clue content to obtain word segmentation results, and to filter the word segmentation results to obtain the filtered first key word; The vectorization module is used to vectorize the first key word based on a preset word vector generation model to obtain the corresponding first word vector, and to normalize the first word vector to obtain various normalized vectors. The vector clustering module is used to perform a first clustering process and a second clustering process on the normalized vector to obtain a first clustering result and a second clustering result, and to obtain a comprehensive clustering result based on the first clustering result and the second clustering result; The tag library generation module is used to construct a standard tag library based on the comprehensive clustering results; The tag mapping module is used to extract the second key word from the second clue content to be standardized, and to map the second key word using the standardized tags in the standard tag library to obtain the mapping result; The tag library update module is used to update the standard tag library according to the mapping result to obtain the updated standard tag library; The first clustering process and the second clustering process include: First, K-means clustering is performed. A predetermined number of cluster centroids is pre-defined based on the actual business scenario. For each normalized vector, based on the distance between the normalized vector and each cluster centroid, the normalized vector is included in the first cluster set corresponding to the nearest cluster centroid, resulting in the first cluster set. It is then determined whether the first cluster set meets the predetermined clustering criteria. If the first cluster set does not meet the criteria, for each first cluster set, a new predetermined number of cluster centroids are calculated based on each normalized vector in the first cluster set, and clustering is performed again. If the first cluster set meets the criteria, then each first cluster set is used as the first clustering result. Next, density-based clustering is performed, mapping all normalized vectors distributed in the polar coordinate system to a one-dimensional horizontal coordinate system to obtain the corresponding vector coordinates. For each vector coordinate, the current vector coordinate is selected as the starting vector coordinate, and the next vector coordinate is probed to determine whether it belongs to the same cluster category. The specific strategy is as follows: determine whether the distance between the starting vector coordinate and the next vector coordinate is less than a preset distance; if the distance between the starting vector coordinate and the next vector coordinate is less than the preset distance, they belong to the same cluster category, and a second cluster set is generated with the starting vector coordinate as the center. The starting vector coordinate and the next vector coordinate are used as cluster vector coordinates and included in the second cluster set; if the distance between the starting vector coordinate and the next vector coordinate is greater than or equal to the preset distance, a second cluster set is not generated; then, based on each... The distance between the cluster vector coordinates is used to calculate the first label density of the second cluster set. The next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates. The second label density is calculated using the cluster vector coordinates and the candidate vector coordinates. It is determined whether the ratio of the second label density to the first label density is greater than or equal to a preset density ratio. If the ratio is greater than or equal to the preset density ratio, the candidate vector coordinates are used as cluster vector coordinates and included in the second cluster set. This process is repeated until the next vector coordinate outside the cluster vector coordinates is used as the candidate vector coordinates. If the ratio is less than the preset density ratio, the candidate vector coordinates are not included in the second cluster set, resulting in the second cluster set after clustering is completed. After density detection of all vector coordinates is completed, each second cluster set after clustering is used as the second clustering result. The specific method for obtaining the comprehensive clustering result is as follows: based on the first clustering result and the second clustering result, overlapping vectors are eliminated according to the preset clustering weights to obtain the comprehensive clustering result.

9. A data standardization device for the real estate industry, characterized in that, The device includes: a processor and a memory; The memory is used to store one or more program instructions; The processor is configured to run one or more program instructions to perform the steps of a real estate industry data standardization method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of a real estate industry data standardization method as described in any one of claims 1 to 7.

Citation Information

Patent Citations

  • Online text education resource label generation method integrating multi-source knowledge

    CN110688461A

  • Class labeling method and device for financial data assets

    CN113204603A