A health portrait construction method based on data mining
By constructing an urban health indicator system and using multi-feature fusion technology, a hierarchical urban health profile is generated, which solves the problem of insufficient semantic information integration in existing technologies and realizes accurate description and planning reference of urban health status.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YTO EXPRESS CO LTD
- Filing Date
- 2023-03-01
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies lack effective methods for accurate profiling in the construction of healthy cities, and text mining algorithms fail to fully integrate semantic information, affecting the semantic coherence and accuracy of topic modeling.
A city health status indicator system is established, an integer programming model is constructed, redundant features are eliminated, and multi-feature fusion and label co-occurrence technology are used to generate a hierarchical city health profile. Label recommendations are then made in conjunction with user cognition.
It improves the semantic coherence and accuracy of urban health profiles, enabling a multi-dimensional depiction of urban health status, providing important references for urban planning, reducing the dimensionality disaster, and enhancing feature correlation.
Smart Images

Figure CN116739402B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to big data application technology, specifically to a technology for constructing a health profile of an individual based on data mining. Background Technology
[0002] To address the series of health challenges brought about by rapid urbanization, the construction of healthy cities and healthy towns is necessary. To promote the construction of healthy cities, establishing a scientific and objective set of urban health measurement standards to comprehensively assess urban health levels and accurately profile healthy city construction is of significant guiding importance for current and high-quality urbanization development.
[0003] To achieve the above requirements, currently popular text mining algorithms are usually used. However, text mining algorithms do not effectively combine relevant semantic information during topic modeling, which seriously affects the semantic coherence of topics and the accuracy of text semantic representation.
[0004] Therefore, there is currently no very effective method in the industry to achieve a precise profile of healthy city construction. Summary of the Invention
[0005] The following provides a brief overview of one or more aspects to offer a basic understanding of them. This overview is not an exhaustive summary of all conceived aspects, nor is it intended to identify key or decisive elements of all aspects, nor to define the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form to prepare for the more detailed descriptions that follow.
[0006] The purpose of this invention is to solve the above problems and provide a health profile construction method based on data mining, which can establish an urban health status indicator system and depict the urban health profile from economic, environmental, population and social dimensions.
[0007] The technical solution of this invention is as follows: This invention discloses a method for constructing health profiles based on data mining, the method comprising:
[0008] Step 1: Establish an urban health status indicator system applicable to the required region, and construct an integer programming model for maximizing urban economic influence;
[0009] Step 2: Collect data based on the urban health status index system established in Step 1, and remove redundant features with weak correlation to the labels according to the feature filtering theory. Construct a subset of attribute features of the nearest neighbor sample set, introduce information acceptance to obtain the optimal feature subset to reduce the curse of dimensionality, enhance the correlation between attribute features and feature labels, and construct the optimal feature subset.
[0010] Step 3: Consider the potential semantic information of the text, obtain the multi-feature fusion dynamic weight of web page text words, filter out the topic distribution and keyword set with high interpretability of urban health profile, realize the clustering of text words at the topic level, and generate urban health profile with hierarchical structure.
[0011] Step 4: Construct a multi-feature fusion tag set and a city-tag matrix. Improve the collaborative filtering recommendation algorithm based on user cognition and tag expansion. Extract a tag set reflecting the health of the city from massive tags. Reveal the importance of tags in the semantic features of the city profile through tag co-occurrence. Generate city tag recommendations and delve into the hierarchical structure to develop a fine-grained description of the city profile.
[0012] According to an embodiment of the health profile construction method based on data mining of the present invention, step 1 further includes:
[0013] Step 1-1: Collect urban health status indicators for the required regions and establish an urban health status indicator system based on these indicators;
[0014] Step 1-2: In the initial stage, select p county-level cities and 1 prefecture-level city. Determine whether the influence of the prefecture-level city on the p county-level cities exceeds the set threshold. If so, it means that county-level city i is driven by the development of prefecture-level city j. When the influence diffusion ends, a total of T diffusion stages are experienced. Establish the objective function and constraints of the integer programming model.
[0015] According to an embodiment of the health profile construction method based on data mining of the present invention, step 2 further includes:
[0016] Step 2-1: Using the required region as sample points, collect corresponding data from the indicator set C of the urban health status indicators, and standardize the collected data to obtain the initial dataset Data. 0 Construct the initial attribute feature set X 0 =[x1,…,x n ];
[0017] Step 2-2: Process the collected initial dataset Data 0 Arrange them in order, set the attribute feature-label threshold η, and calculate the attribute feature x. r The mutual information entropy value I(x) for each tertiary indicator c in the indicator set C. r c), the mutual information entropy value I(x) r c) Attribute features below the attribute feature-label threshold η are filtered out, and the filtered attribute features of the sample are constructed. These filtered attribute features are represented as X. 1 =[x1,…,x m ](m<n), where m is the number of filtered attribute features and n is the number of features in the initial attribute feature set;
[0018] Step 2-3-1: Randomly collect K samples from the set of prefecture-level cities with economic influence obtained in Step 1 to form the city sample set City. Randomly select a sample city j from this set, and then select k neighboring prefecture-level cities of the same and different levels of influence. Calculate the performance of sample city j on a certain attribute feature x. r The sample distance d(i,j) between i and other cities i;
[0019] Step 2-3-2: If city j and its peers share a certain attribute feature x r The distance between samples on a given level is less than the distance between it and samples of different levels on a certain attribute feature x. r The distance between them is used to update the attribute feature weights w(x) of city j. j,r ), where the initial weights of the attribute features are w(x) j,r If ) = 0, otherwise, re-extract sample cities and calculate sample interval and attribute feature weights;
[0020] Step 2-3-3: Based on the attribute feature weights w(x) of city j j,r Matching attribute features to secondary index c sec The weights w(x) between j,r ,c sec By iterating through all secondary indicators in the indicator set C, we can obtain the city's attribute feature x. j,r The sum of the weights of all secondary indicators w sum (x j,r );
[0021] Steps 2-3-4: For w sum (x j,r Arrange them in order, and iterate through r = 1, 2, ..., p, where p represents the attribute feature x. r The number of features in each column is used to construct the attribute feature vector u of city j. j =[w sum (j,1),w sum (j,2),…,w sum [j,p], perform dimensionality reduction on the attribute features of city j in the sample set City, and traverse all cities in the sample set City to construct an l-dimensional feature subset S. l (l < m < n);
[0022] Steps 2-4: Optimal Feature Subset The screening process.
[0023] According to an embodiment of the health profile construction method based on data mining of the present invention, step 3 further includes:
[0024] Step 3-1: Filter images, videos, hyperlinks and unknown interference information in the original web page document, find the maximum segmentation combination based on word frequency for word segmentation, use a stop word list to remove stop words, and construct a preprocessed document set Text;
[0025] Step 3-2-1: Set adjustment coefficients for the preprocessed document i∈Text;
[0026] Step 3-2-2: Adjust the importance of word j to document i, and allocate the score of word j based on the adjustment coefficient.
[0027] Step 3-3: Divide document i into q segments, find the segment containing word j in the document and denote it as b; after the preprocessing in step 3-1, segment b still has d candidate words, the position of word j becomes c, and calculate the position information weight of word j. And after undergoing minimization normalization processing, the final position information weights are obtained;
[0028] Steps 3-4: Weighted fusion of word scores Location information weight and word dynamic weight w i (j), obtain the dynamic weights of multi-feature fusion for words;
[0029] Steps 3-5: Combine multi-feature fusion with dynamic weights and traverse the word set T. i dele The Transformer encoder is used to encode words in the text after filtering out stop words to obtain a text representation of the word set;
[0030] Steps 3-6: Match the corresponding topic for each word in the KeyWord set according to the topic-word distribution matrix of the matrix.
[0031] According to an embodiment of the health profile construction method based on data mining of the present invention, step 3-5 further includes:
[0032] Step 3-5-1: Use a multi-layer bidirectional Transformer encoder to compute the context-aware representation of each word and obtain the word embedding representation;
[0033] Step 3-5-2: Use Word2Vec pre-training to convert each word into a real-valued vector, and fuse the pre-trained language model and word vectors based on a gating fusion strategy to obtain word embedding vectors;
[0034] Step 3-5-3: Calculate the initial keyword representation;
[0035] Steps 3-5-4: Decode the final keyword representation using a decoder to construct a keyword set.
[0036] According to an embodiment of the health profile construction method based on data mining of the present invention, steps 3-6 further include:
[0037] Step 3-6-1: Calculate the initial topic distribution of document i in the text set Text;
[0038] Step 3-6-2: For document i, generate dynamic weights for words according to the topic-word distribution matrix;
[0039] Step 3-6-3: Update the probability that document i belongs to each topic in the document-topic distribution. Then update the probability of words in the topic-word distribution for each topic k. Until the result converges and is output;
[0040] Step 3-6-4: When the update iteration count t satisfies t%moment = 0, adjust the topic vector of the current word j. and the dynamic weight of words Return to step 3-6-3, where moment represents the iteration interval when dynamically updating word weights;
[0041] Step 3-6-5: For the word set WORD in document i, iterate through j = 1, 2, ..., Q and k = 1, 2, ..., K, repeating steps 3-6-2 to 3-6-4 until the document-topic distribution under K topics is obtained. A topic-word distribution matrix for Q words under each topic;
[0042] Step 3-6-6: For each document in the text set Text, repeat steps 3-6-2 to 3-6-5 to obtain the document-topic distribution θ = (θ... 1 ,θ 2 ,...,θ TextNum ) and topic-word distribution φ=(φ 1 ,φ 2 ,...,φ TextNum Arrange them in order, select M topics that are highly semantically related to the urban health dimension to form a topic set, and select N words under each topic to form a topic-word set.
[0043] According to an embodiment of the health profile construction method based on data mining of the present invention, step 3-6-2 further includes:
[0044] Step 3-6-2-1: Randomly generate each word in the document Assign topics and perform initialization.
[0045] Step 3-6-2-2: For word j in the document that belongs to topic k, update the probability μ that word j belongs to topic k. i,j (k), iterate through k = 1, 2, ..., K to obtain the topic distribution of word j;
[0046] Step 3-6-2-3: Normalize the topic distribution of word j φ i Each component φ of (j) k i (j) Construct the topic vector for word j.
[0047] According to an embodiment of the health profile construction method based on data mining of the present invention, step 4 further includes:
[0048] Step 4-1: Use web crawling technology to obtain user Q&A describing the main health characteristics of the required region from the web platform. Following the text data preprocessing steps in Step 3-1, perform lexical standardization and filtering on the raw data to obtain a tag set.
[0049] Step 4-2-1: Define the user set, the city set, and the set of tags for users to annotate cities;
[0050] Step 4-2-2: Define the local weight of the label based on the number of times the city is labeled;
[0051] Step 4-2-3: Determine the global weight of the label based on the information gain between the information entropy of the sample set and the conditional entropy of the label, in order to measure the ability of the label to distinguish different cities;
[0052] Step 4-2-4: Semantic dimension weights are used to explain the semantic ambiguity of tags;
[0053] Step 4-2-5: Calculate the element values of the city-label matrix and construct the city-label matrix;
[0054] Step 4-3: Calculate the elements in the k×k dimensional tag co-occurrence matrix based on the elements of the city-tag matrix, and use the tag co-occurrence matrix as the tag similarity matrix based on user cognition;
[0055] Step 4-4: Calculate the label similarity matrix Sim based on label semantics using the WordNet semantic dictionary. 2 ;
[0056] Steps 4-5: Combine the label similarity matrix Sim 1 and Sim 2 Perform tag merging to obtain a new tag similarity matrix;
[0057] Steps 4-6: Expand labels using the new label similarity matrix, for the already labeled city c jHowever, city C was not marked. i The tag t z According to the tag t z With city c already marked j Find the co-occurrence distribution of all labels and estimate the label t. z It will mark up unmarked cities c i The probability of it;
[0058] Steps 4-7: Calculate city similarity to obtain the city-city similarity matrix B. Based on the city-city similarity matrix B, generate city tag recommendations for users according to the collaborative filtering algorithm.
[0059] According to an embodiment of the health profile construction method based on data mining of the present invention, step 4-4 further includes:
[0060] Step 4-4-1: Preprocess the label set and generate labels t using WordNet. p and tag t q Synonyms of p and s q This constitutes a thesaurus (s p ,s q );
[0061] Step 4-4-2: Traverse all existing synonym pairs, retrieve annotations for each synonym set, and use text preprocessing methods to process the data. p and s q Extract annotation G p and G q Calculate the semantic similarity between tags and construct the tag similarity matrix Sim. 2 .
[0062] Compared with the prior art, the present invention has the following beneficial effects: (1) The present invention establishes an indicator system for urban health status and establishes an integer programming model for maximizing influence based on the dynamic diffusion process of urban economic influence. It judges whether the influence of prefecture-level cities with higher economic development potential has the possibility of influencing lower-level county-level cities, thereby determining the set of prefecture-level cities with greater economic influence. It focuses on the health status of such cities and portrays the urban health profile from the dimensions of economy, environment, population and society, serving as an important reference element for government urban planning. (2) Currently popular text mining algorithms do not combine relevant semantic information well in the process of topic modeling, which seriously affects the semantic coherence of the topic and the accuracy of text semantic representation. The present invention, by integrating multiple features of text information, dynamically generates word weights in the iteration process, ensuring high semanticity. In order to better describe text word vectors, it uses gated recurrent units to extract context features and combines attention mechanisms to learn the importance of words in the text. It uses capsule networks to learn text features and generate keyword sets, thereby improving the accuracy and efficiency of text learning. (3) Based on feature filtering theory, this invention removes redundant features with weak correlation to tags, constructs a subset of attribute features from the nearest neighbor sample set, and introduces information acceptance to obtain the optimal feature subset, effectively reducing the curse of dimensionality and enhancing the correlation between attribute features and feature tags. Through the above processing, this invention extracts a set of tags reflecting urban health from massive tags, reveals the importance of tags in the semantic features of urban profiles through tag co-occurrence, and combines users' common understanding of urban health status to develop a fine-grained description of urban profiles in a hierarchical structure. Attached Figure Description
[0063] The above-described features and advantages of the present invention will be better understood after reading the following detailed description of embodiments of the present disclosure in conjunction with the accompanying drawings. In the drawings, components are not necessarily drawn to scale, and components having similar related characteristics or features may have the same or similar reference numerals.
[0064] Figure 1 A flowchart of an embodiment of the health profile construction method based on data mining of the present invention is shown.
[0065] Figure 2 A flowchart illustrating the process of obtaining the optimal feature subset is shown.
[0066] Figure 3 The diagram illustrates the process of generating tag recommendations for users based on user cognition and semantic information.
[0067] Figure 4 A schematic diagram of the GRU-attention-Capsule hybrid model structure is shown.
[0068] Figure 5A flowchart illustrating the process of creating a faceted urban health profile is shown. Detailed Implementation
[0069] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. It should be noted that the aspects described below with reference to the accompanying drawings and specific embodiments are merely exemplary and should not be construed as limiting the scope of protection of the present invention in any way.
[0070] Figure 1 A flowchart illustrating an embodiment of the data mining-based health profile construction method of the present invention is shown. Please refer to... Figure 1 The following is a detailed description of the implementation steps of the method in this embodiment.
[0071] Step 1: Establish an urban health status indicator system applicable to the required region, and construct an integer programming model for maximizing urban economic influence.
[0072] Step 1 further includes the following processing.
[0073] Step 1-1: Collect urban health status indicators for the required region. These indicators include economic health, environmental health, population health, social health, and public health. Then, establish an urban health status indicator system based on these indicators.
[0074] Table 1 shows an example of an urban health status indicator system:
[0075]
[0076]
[0077] Step 1-2: In the initial stage, select p county-level cities and 1 prefecture-level city, and determine whether the influence of the prefecture-level city on the p county-level cities exceeds the set threshold. If so, it means that county-level city i is driven by the development status of prefecture-level city j. The influence diffusion will go through a total of T diffusion stages when it ends.
[0078] During the diffusion process, at least one county-level city will be affected at each stage. If the influence on a county-level city reaches 1, it indicates that it will promote development.
[0079] Objective function:
[0080] Constraints:
[0081]
[0082]
[0083] In the objective function, N represents the set to which county-level city i belongs. Let be a 0-1 decision variable, used to determine whether county-level city i will be driven by the development status of prefecture-level city j. Its expression is formula (1-3).
[0084]
[0085] This indicates that county-level city i is driven by the development of prefecture-level city j; W ji The index representing the effective influence of prefecture-level city j on county-level city i is calculated according to formula (1-4).
[0086]
[0087] Where, d ij I represents the Euclidean distance between the latitude and longitude coordinates of the center point of prefecture-level city j and the latitude and longitude coordinates of the center point of the subordinate county-level city i; j,own This represents the annual economic development level of prefecture-level city j, calculated according to formula (1-5).
[0088]
[0089] This represents the per capita disposable income level of prefecture-level city j over the past year. This represents the average income level of the primary, secondary, and tertiary industries in prefecture-level city j. This represents the registered unemployment rate in prefecture-level city j.
[0090] τ in formula (1-4) i The development potential of county-level city i itself is given by the formula:
[0091]
[0092] in, For the connectivity efficiency of the urban network G, graph G refers to an undirected connected graph consisting of a set VP of nodes connected by influence relationships between prefecture-level cities and county-level cities under the jurisdiction of provinces and autonomous regions.
[0093] Network connectivity efficiency The calculation formula is as follows:
[0094]
[0095] Where ρ represents the city's current economic ranking; d pq The shortest path is calculated after performing max-min normalization on the geographical paths between city nodes in the network graph. To reset the distance d between the county-level city and the adjacent edge. pq The connectivity efficiency, d, is then calculated again using formula (1-7). pqThe update formula is as follows:
[0096]
[0097] In formula (1-2), constraint 1 indicates that there are p county-level cities in the initial stage;
[0098] Constraint 2 states that for county-level city i to be driven by prefecture-level city j in stage t, then the effective influence W of prefecture-level city j on county-level city i in stage t-1 must be guaranteed. ji Only when the value exceeds 1 can it influence the next county-level city; E represents the set of edges formed by the city center of the prefecture-level city and the city center of the county-level city;
[0099] Constraint 3 indicates that the county-level city i that is being promoted for development can maintain a healthy development momentum.
[0100] Step 2: Collect data based on the urban health status index system established in Step 1, and remove redundant features with weak correlation to the labels according to the feature filtering theory. Construct a subset of attribute features of the nearest neighbor sample set, introduce information acceptance degree to obtain the optimal feature subset, effectively reduce the curse of dimensionality, enhance the correlation between attribute features and feature labels, and construct the optimal feature subset.
[0101] like Figure 2 As shown, the specific processing procedure for step 2 is as follows.
[0102] Step 2-1: Using the required region as sample points, collect corresponding data from the indicator set C of the urban health status indicators, and standardize the collected data to obtain the initial dataset Data0, constructing the initial attribute feature set X. 0 =[x1,…,x n If there are missing attribute values in the initial dataset, impute them using the mean of the column containing the missing values.
[0103] Step 2-2: Process the collected initial dataset Data 0 Arrange them in order, set three-quarters of the initial dataset as the attribute feature-label threshold η, and calculate the attribute feature x according to the following formula (2-5). r The mutual information entropy value I(x) for each tertiary indicator c in the indicator set C. r c). The mutual information entropy value I(x) r c) Attribute features below the attribute feature-label threshold η are filtered out, and the filtered attribute features of the sample are constructed. These filtered attribute features are represented as X. 1 =[x1,…,x m ](m<n), where m is the number of filtered attribute features and n is the number of features in the initial attribute feature set;
[0104]
[0105] Wherein, ρ(x) i ) represents attribute feature x i In the initial attribute feature set X 0 The marginal probability of occurrence in the initial attribute feature set X, ρ(c) represents the entire initial attribute feature set X. 0 The marginal probability of the feature term belonging to the third-level index c, ρ(x) i c) indicates that attribute feature x appears in the third-level indicator c. i The frequency; i = 1, 2, ..., n; c = 1, 2, 3, 4;
[0106] Step 2-3-1: Randomly collect K samples from the set of prefecture-level cities with economic influence obtained in Step 1 to form the city sample set City. Randomly select a sample city j from this set, and then select k neighboring prefecture-level cities of the same and different levels of influence. Calculate the performance of sample city j on a certain attribute feature x according to formula (2-1). r The sample distance d(i,j) between i and other cities i;
[0107]
[0108] Where, x i,r This indicates that sample city i filters attribute features x in step 2-2. r The value above; m represents the number of attribute features after filtering.
[0109] Step 2-3-2: If city j and its peers share a certain attribute feature x r The distance between samples on a given level is less than the distance between it and samples of different levels on a certain attribute feature x. r The distance between them is then used to update the attribute feature weights w(x) of city j according to formula (2-2). j,r ), where the initial weights of the attribute features are w(x) j,r =0. Conversely, re-sample cities and calculate sample intervals and attribute feature weights.
[0110]
[0111] Among them, dis j,s∈City (x j,r ,x s,r The expression indicates that sample city j and its neighboring sample cities s of the same level share similar attributes in terms of feature x. r On coupling, This indicates that sample j and its non-same-level nearest neighbor sample g share similar attributes in terms of feature x. rThe coupling is defined as follows: h represents the h-th extraction, p(g) represents the probability that sample city g is extracted, and k represents the number of neighboring prefecture-level cities of the same or different levels of influence as city j.
[0112] The formula for calculating coupling is formula (2-3):
[0113]
[0114] r max Represents the attribute feature x of city j r The maximum value in the column, r min Represents the attribute feature x of city j r The minimum value in the column.
[0115] Step 2-3-3: Based on the attribute feature weights w(x) of city j j,r Matching attribute features to secondary index c sec The weights w(x) between j,r ,c sec By iterating through all secondary indicators in the indicator set C, we can obtain the city's attribute feature x. j,r The sum of the weights of all secondary indicators
[0116] Steps 2-3-4: For w sum (x j,r Arrange them in order, and iterate through r = 1, 2, ..., p, where p represents the attribute feature x. r The number of features in each column is used to construct the attribute feature vector u of city j. j =[w sum (j,1),w sum (j,2),…,w sum (j,p)]. The information acceptance rate δ is 0.75. The dimensionality of the j-attribute features of the sample city is reduced according to the following formula (2-6). Traverse all cities in the sample set City, and construct an l-dimensional feature subset S. l (l < m < n);
[0117]
[0118] Where m represents the filter attribute feature matrix X 1 The number of attribute features in the middle, l is the number of attribute features after dimensionality reduction, δ is the information acceptance degree, and u j Let be the attribute feature vector of city j.
[0119] Steps 2-4: Optimal Feature Subset The specific filtering is as follows:
[0120] For feature subset S lThe forward search method is used to obtain the (l+1)th dimension attribute features, and the relationship between the attribute features and the label I(x) is maximized. j The correlation between c) minimizes the attribute feature I(x) i ,x j The criterion of redundancy between features determines the final feature subset; the criterion formula (2-4) here is:
[0121]
[0122] Where max(z) represents the objective function for selecting the optimal feature subset, S l+1 -S l This indicates the degree of influence of the (l+1)th attribute feature on other attribute features and labels.
[0123] The aforementioned feature reduction techniques essentially involve filtering the feature attributes from the original sample dataset, selecting the most effective and representative features to reduce the dimensionality of the data's feature attributes. During feature reduction, such as reducing high-dimensional data to low-dimensional data, data loss can occur. Furthermore, selecting too many features or retaining a high dimensionality after reduction can lead to significant computational costs and introduce noisy data. Feature selection algorithms can remove a large number of redundant and irrelevant feature attributes from the feature attribute space, filtering out noise interference in the dataset. Feature extraction, on the other hand, involves transforming and combining features on the original dataset to create entirely new features, primarily addressing the problems of excessively high sample attributes, large computational costs, and high data dimensionality.
[0124] Step 3: Consider the potential semantic information of the text, obtain the multi-feature fusion dynamic weights of web page text words, filter out the topic distribution and keyword set with high interpretability of urban health profile, realize the clustering of text words at the topic level, and generate urban health profile with hierarchical structure.
[0125] like Figure 3 and Figure 5 As shown, step 3 further includes the following processing.
[0126] Step 3-1: Compile policy notices related to urban health in the relevant regions (such as the Yangtze River Delta region) published in the past year on the China Urban Statistical Yearbook webpage. Filter out images, videos, hyperlinks, and various unknown interference information in the original webpage documents. Use Jieba word segmentation technology to find the maximum segmentation combination based on word frequency for word segmentation. Use a stop word list to remove stop words and construct a preprocessed document set Text.
[0127] Step 3-2-1: Set adjustment coefficients for the preprocessed document i∈Text. x ijThe global frequency of word j is calculated according to formula (3-2).
[0128]
[0129] f ij f represents the frequency of word j in document i. i max f represents the frequency of the most frequent word in document i. i sum This represents the total number of words in document i; This represents the minimum global term frequency of word j in document i, which is 0.2 in this case. This represents the maximum global word frequency of word j in document i, which is 0.8 in this case.
[0130] Step 3-2-2: Adjust the importance of word j to document i according to formula (3-1), and allocate the score of word j based on the adjustment coefficient.
[0131]
[0132] Among them, T i dele ζ represents the word set of document i after filtering out stop words, and ζ represents the filtering coefficient of document i after filtering out stop words. Here, it is taken as the ratio of the total number of words before and after filtering. This represents the adjustment factor when no filtering is performed. This represents the adjustment factor after filtering.
[0133] Step 3-3: Divide document i into q segments, find the segment containing word j in the document and denote it as b; after the preprocessing in step 3-1, segment b still has d candidate words, and the position of word j becomes c. Calculate the positional information weight of word j according to formula (3-3). After undergoing minimization normalization processing, the final position information weights are obtained.
[0134]
[0135] Steps 3-4: Weighted fusion of word scores Location information weight And the word dynamic weights w in step 3-6-2-3 i (j) Obtain the dynamic weights of multi-feature fusion for words. Where λ1+λ2+λ3=1, λ1, λ2, and λ3 represent the weight coefficients of word score, position information weight, and word dynamic weight, respectively.
[0136] Steps 3-5: Combine multi-feature fusion with dynamic weights and traverse the word set T. idele The Transformer encoder is used to encode the words in the text after filtering out stop words to obtain the text representation of the word set H = (h1, h2, ..., h...). n Specifically,
[0137] Step 3-5-1: Use a multi-layer bidirectional Transformer encoder to compute the context-aware representation of each word, obtaining the word embedding representation e. q =(e q1 ,e q2 ,...,e qn ),
[0138] Step 3-5-2: Use Word2Vec pre-training to convert each word into a real-valued vector e. w =(e w1 ,e w2 ,...,e wn Based on the gating fusion strategy, the pre-trained language model and word vectors are fused according to formulas (3-4) and (3-5) to obtain the word embedding vector Y = (y1, y2, ..., y3). n ),
[0139] M' = sigmoid(W 1 e w +W 2 e q (3-4)
[0140] Y = (1-M')·e w +M'·e q (3-5)
[0141] W in formula (3-4) 1 W 2 All are weight vectors, and then the embedding vector Y = (y1, y2, ..., y...) is used. n The text is fed into a pre-trained encoder to generate a text representation H = (h1, h2, ..., h...). n M′ represents the normalization coefficient of the fused word embedding vector and the real-valued vector.
[0142] Step 3-5-3: Calculate the initial keyword representation H' according to formula (3-6).
[0143]
[0144] Where V is the output unit obtained based on the GRU-attention-Capsule hybrid model, V kThe dimension representing the encoder's hidden state is set to 3 layers here. (The initial keyword representation of the word is obtained according to formula (3-6). In order to combine the initial keyword representation and the initial text representation, the weighted representation in (3-7) is used to obtain the final keyword representation, which covers both the initial text and extracts the keywords.)
[0145] Randomly set a scaling factor p, and calculate the final keyword representation based on the initial text representation H and keyword representation H' according to formula (3-7).
[0146] Steps 3-5-4: Representing the final keywords using a decoder. Decode the keywords to form the KeyWord set;
[0147] Step 3-6: The topic-word distribution matrix obtained in step 3-6-5 For each word in the keyword set KeyWord, match the corresponding topic, assuming a total of M topics are matched; the matrix... This represents the probability distribution of the Q-th word under the first topic. Let Q represent the probability distribution of the Q-th word under the K-th topic.
[0148] For topic a, construct topic space vector T a =(t1,t2,...,t n ), where tn represents the topic of the nth word belonging to topic a, and represents the word sequence belonging to a certain topic; traverse M topics sequentially and represent them with spatial vectors, according to the cosine similarity formula cos(T a ,T b Calculate the topic space vector T a ,T b The similarity between topics is used to repeatedly cluster and merge two topics with high similarity until the termination condition is met, and finally outputs a framework model of urban health profile with faceted structure.
[0149] Steps 3-6 involve constructing a topic distribution and word set that are highly semantically related to urban health, specifically including the following steps.
[0150] Step 3-6-1: Calculate the initial topic distribution of document i in the text set Text as follows: For each topic k, there is a word distribution of length Q. in, Represents the probability distribution of document i under topic K. Let represent the probability distribution of the Q-th word under the k-th topic.
[0151] Step 3-6-2: For document i, generate dynamic weights for words according to the topic-word distribution matrix, specifically as follows:
[0152] Step 3-6-2-1: Randomly generate each word in the document Assign topics and perform initialization.
[0153] Step 3-6-2-2: For word j in the document that belongs to topic k, update the probability μ of word j belonging to topic k according to formula (3-8). i,j (k), iterate through k = 1, 2, ..., K to obtain the topic distribution of word j.
[0154]
[0155] In formula (3-8), Let represent the document-topic probability of the k-th topic in document i, and α and β represent the prior parameters β = 0.01 and β = 0.01, which follow a Dirichlet distribution. Here, K represents the number of topics in document i. Let represent the topic-word distribution probability of the j-th word under topic k.
[0156] Step 3-6-2-3: Normalize the topic distribution of word j φ i Each component φ of (j) k i (j) Construct the topic vector of word j Where K represents the number of topics, This represents the word weight of word j in document i under topic K. Based on the JS divergence principle, the topic vector of word j is measured according to formula (3-9). With interference vector Similarity between
[0157]
[0158] in, Represents topic vectors and interference vector The mean vector, D JS here it is The JS divergence and Do of the word topic vector and the interference vector represent the JS divergence and Do of the word. KL This represents the KL divergence between the topic vector and the mean vector.
[0159] Similarity The standardization process yields the dynamic weights of word j. in This indicates the frequency of word j in topic k.
[0160] Step 3-6-3: Update the probability of document i belonging to each topic in the document-topic distribution according to formula (3-10). Then, update the probability of words in the topic-word distribution for each topic k according to formula (3-11). Until the result converges and is output;
[0161]
[0162]
[0163] Where, μ i,j (k) represents the probability that word j belongs to topic k.
[0164] Step 3-6-4: When the update iteration count t satisfies t%moment = 0, adjust the topic vector of the current word j. and the dynamic weight of words Return to step 3-6-3. The moment represents the number of iterations when dynamically updating word weights, and is set to 20.
[0165] Set the prior parameters β = 0.01 and β = 0.01, which follow a Dirichlet distribution. According to formulas (3-12) and (3-13), based on the eventual convergence... and The probability estimates yield the final topic-word distribution probability and document-topic distribution probability;
[0166]
[0167]
[0168] Step 3-6-5: For the word set WORD in document i, iterate through j = 1, 2, ..., Q and k = 1, 2, ..., K, repeating steps 3-6-2 to 3-6-4 until the document-topic distribution under K topics is obtained. A topic-word distribution matrix with Q words for each topic.
[0169] Step 3-6-6: For each document in the text set Text, repeat steps 3-6-2 to 3-6-5 to obtain the document-topic distribution θ = (θ... 1 ,θ 2 ,...,θ TextNum ) and topic-word distribution φ=(φ 1 ,φ 2 ,...,φ TextNumArrange them in order, select M topics that are highly semantically related to the urban health dimension to form a topic set, and select N words under each topic to form a topic-word set.
[0170] Furthermore, the specific method for calculating the interference vector in step 3-6-2-3 above is as follows.
[0171] Given an input word sequence H = (h1, h2, ..., h...) s ), through a chaotic mapping function f e (x) Mapping to obtain the interference vector For each character in the training word sequence of round t The interference noise mapping formula is:
[0172] f e t+1 (h i )=f e t (h i )×τ×(1-f e t (h i (3-14)
[0173] in, P is a random number that follows a uniform distribution in [0,1], ψ represents the fixed probability of introducing interference noise, which is set to 4; τ represents the logistic parameter between [0,4]. This indicates that each character h is trained in the t-th round. i The mapping representation of f e (h s ) represents the word h s Chaotic mapping representation, h s This represents the s-th character in the source word sequence.
[0174] Furthermore, such as Figure 4 As shown, the GRU-attention-Capsule hybrid model utilizes gated recurrent units to extract contextual features and combines an attention mechanism to learn the importance of words in the text. It learns text features through a capsule network to generate a keyword set. The GRU-attention-Capsule hybrid model specifically includes the following modules.
[0175] Module 1: Global Feature Extraction Module
[0176] The word set T after filtering stop words in step 3-2-2 is calculated. i dele The length is N, and the serialized representation is K. a =(k1,k2,…,k N );
[0177] K a The input variables of the training set are fed into a GRU model with 50 units. After updating and resetting through the update gate, the output feature h of the GRU at the current time is obtained. t ;
[0178] Introducing an attention mechanism, the output features h of the GRU are... t The input is fed into the attention mechanism to obtain the current hidden layer representation v. t =tanh(W a h t +b a ), where W a Let b be the weight matrix. a This is the bias matrix;
[0179] The hidden layer representation v is processed by the softmax function. t The word weights are redistributed after standardization, and the summation of the word weights yields the output vector of the attention mechanism.
[0180] Module 2: Capsule Network Classification Module
[0181] The bottom-level capsules are set to 24, the dynamic routing iterations are 3, the top-level capsules are 11, and the output unit V dimension is 10.
[0182] The output vector u i As the input to the first capsule layer, the adjacent capsule layer... L With Capsule L+1 Dynamic routing iteration between routes specifically includes the following methods:
[0183] By transforming matrix W ij For input capsule unit u i Perform the transformation to obtain the prediction vector For the prediction vector Weighted summation yields the output capsule unit v of the L+1 capsule layer. j All information c ij Represents the prediction vector The weighting coefficients.
[0184] According to formula (3-15), the output information m j Nonlinear compression processing is performed to obtain the output capsule unit v of the L+1 capsule layer. j ;
[0185]
[0186] The output capsule unit v of the L+1 capsule layer is calculated according to formula (3-16).j With the input capsule unit u of the L capsule layer i Prediction vector The dot product;
[0187]
[0188] Among them, b ij For log odds initialized to 0,
[0189] When the prediction vector With output capsule unit v j When the directions tend to be consistent, adjust the coupling coefficient c according to formula (3-17). ij Set the dynamic routing iteration to 3 times, and adjust the coupling coefficient c. ij By continuously refining the process, the output capsule unit v of the next capsule layer is obtained. j * ;
[0190]
[0191] The dynamic routing iteration algorithm in module 2 (capsule network classification module) is executed sequentially from the bottom capsule layer to the top capsule layer to obtain the output unit V of the final top capsule layer. The modulus |V| of the output unit V represents the classification probability of the corresponding category.
[0192] The text mining algorithm used in step 3 mainly involves word vectorization of text data. Preprocessing is required before classifying the text data. Text preprocessing primarily includes two steps: word segmentation and stop word removal. Based on a pre-established stop word library, some meaningless words need to be removed, converting the text information into a set of words and characters. After stop word removal, the selected feature words are vectorized using a text representation model. To differentiate the degree of distinction between different feature words for each category, a weighted evaluation function is often constructed.
[0193] The idea behind text classification models based on capsule networks is to use capsules instead of neurons in CNNs, allowing the model to learn the pose information and spatial relationships between objects. It primarily uses dynamic routing algorithms for parameter updates from lower to higher levels, rather than pooling operations, thus avoiding information loss. A compression function is used instead of the ReLU activation function, with multiple vector neurons jointly determining the relationship with the whole text, because capsule networks can better learn the correlation information between local and overall text. First, the input vector is initialized with pre-trained word vectors. Features are extracted in the convolutional layers using multiple convolutional kernels of different scales. Then, pooling operations are used to extract the main features, and finally, a softmax classifier is used for classification.
[0194] Step 4: Construct a multi-feature fusion tag set and a city-tag matrix. Improve the collaborative filtering recommendation algorithm based on user cognition and tag expansion. Extract a tag set reflecting the health of the city from massive tags. Reveal the importance of tags in the semantic features of the city profile through tag co-occurrence. Generate city tag recommendations and delve into the hierarchical structure to develop a fine-grained description of the city profile.
[0195] The specific processing procedure for step 4 is as follows.
[0196] Step 4-1: Utilize web crawling technology to obtain user Q&A from web platforms describing the main health characteristics of the desired region (e.g., cities in the Yangtze River Delta region), covering public opinions on various cities across multiple dimensions such as economy, society, environment, and population health. Following the text data preprocessing steps described in Step 3-1, perform lexical standardization and filtering on the raw data to obtain a tag set T.
[0197] Step 4-2-1: Define the user set U = {u1, u2, ..., u...} m The set of cities is City = {c1, c2, ..., c3}. n}, the user's label set for city annotation is T = {t1, t2, ..., t} k},
[0198] Step 4-2-2: According to label t i For city c j The number of annotations P(i,j) defines the local weight of the label P. w (i,j);
[0199] P w (i,j)=log2(P(i,j)+1) (4-7)
[0200] Step 4-2-3: Based on the information entropy of the sample set and the conditional entropy of the label H(c|t) i The information gain between the two determines the global weight T of the label. w (i), used to measure label t i The ability to differentiate between different cities;
[0201]
[0202] Wherein, CityNum represents the total number of cities in the Yangtze River Delta region's city sample;
[0203] Step 4-2-4: Semantic Dimension Weights R w (j) is used to explain the semantic ambiguity of labels;
[0204]
[0205] Where TagNum represents the total number of tags that have been labeled; H(c j ) represents city c j Information entropy;
[0206] Step 4-2-5: Calculate the element values o(i,j) of the city-label matrix O, and construct the city-label matrix O;
[0207] o(i,j)=P w (i,j)×T w (i)×R w (j) (4-6)
[0208] In formula (4-6), P w (i, j) represents the label t i For city c j Local weights, T w (i) indicates label t i Global weights, R w (j) represents city c j Weighting in the semantic dimension.
[0209] Step 4-3: Calculate the elements c(t) in the k×k dimensional tag co-occurrence matrix C according to formula (4-1) based on the elements of the city-tag matrix O. p ,t q The tag co-occurrence matrix C is used as the tag similarity matrix Sim based on user perception. 1 ;
[0210]
[0211] in, Indicates label t p City C j The number of, N(t) p ) indicates the label t p The set of labeled cities, N(t) p )∩N(t q ) indicates the label t p and tag t q The set of cities with common annotations, c(t) p ,t q ) indicates the label t p and tag t q The frequency of the same city is marked.
[0212] Step 4-4: Calculate the label similarity matrix Sim based on label semantics using the WordNet semantic dictionary. 2 Step 4-4 further includes the following two steps.
[0213] Step 4-4-1: Preprocess the label set and generate labels t using WordNet. p and tag t q Synonyms of p and s q This constitutes a thesaurus (s p ,s q );
[0214] Step 4-4-2: Traverse all existing synonym pairs, retrieve annotations for each synonym set, and use text preprocessing methods to process the data. p and s q Extract annotation G p and G q The semantic similarity between tags is calculated according to formula (4-2), and the tag similarity matrix Sim is constructed. 2 ;
[0215]
[0216] sim(t p , t q ) indicates the label t p and tag t q Semantic similarity.
[0217] Steps 4-5: Combine the label similarity matrix Sim 1 and Sim 2 The tags are merged according to formula (4-3) to obtain a new tag similarity matrix M;
[0218] m(t p ,t q )=η*c(t p ,t q )+(1-η)*sim(t p ,t q (4-3)
[0219] Where η∈[0,1] represents the factor used to adjust the merging weights, which increases by 0.1 in each iteration; c(t) p , t q ) indicates the label t p and tag t q Similarity based on user perception.
[0220] Steps 4-6: Expand labels using the new label similarity matrix M, for the already labeled city c j However, city C was not marked. i The tag t z According to the tag t z With city c already marked jFind the co-occurrence distribution of all labels and estimate the label t. z It will mark up unmarked cities c i The probability of it being above the given value is calculated using formula (4-4):
[0221]
[0222] Among them, T i This indicates that city c has been labeled. i A collection of tags, This indicates that city c is marked. i The total number of tags, Indicates label t t City C i The probability of.
[0223] Steps 4-7: Calculate city similarity according to formula (4-5) to obtain city-city similarity matrix B. Based on city-city similarity matrix B, generate city tag recommendations for users according to collaborative filtering algorithm.
[0224]
[0225] In the above formula, sim(t) z c i ) represents city c i Labeled t z The number of times, simb(t) z c i ) represents city c i In the tag set T i Labeled t z The number of times, r(c) i c j ) represents city c i With city c i The similarity of the tags can be used to determine the tag category to which a city belongs.
[0226] The collaborative filtering algorithm mentioned in step 4 first constructs a rating matrix based on historical data, then selects a set of users with high similarity as nearest neighbors based on the similarity between users, and finally predicts the rating for target users for whom there is no historical data, and selects the top N high-rated items to complete the recommendation.
[0227] Although the methods described above are illustrated and depicted as a series of actions for the sake of simplicity, it should be understood and appreciated that these methods are not limited by the order of the actions, as some actions may occur in a different order and / or concurrently with other actions from the illustrations and descriptions herein or not illustrated and described herein but which may be understood by those skilled in the art, according to one or more embodiments.
[0228] Those skilled in the art will further appreciate that the various illustrative logic blocks, modules, circuits, and algorithm steps described in conjunction with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, the various illustrative components, blocks, modules, circuits, and steps are described above in a generalized manner in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. Those skilled in the art may implement the described functionality in different ways for each specific application, but such implementation decisions should not be construed as departing from the scope of the invention.
[0229] The various illustrative logic blocks, modules, and circuits described in conjunction with the embodiments disclosed herein can be implemented or performed using a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The general-purpose processor may be a microprocessor, but in alternatives, it may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors cooperating with a DSP core, or any other such configuration.
[0230] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of both. The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to a processor such that the processor can read and write information to / from the storage medium. In an alternative, the storage medium may be integrated into the processor. The processor and storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In an alternative, the processor and storage medium may reside as discrete components in the user terminal.
[0231] In one or more exemplary embodiments, the described functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functionality may be stored or transmitted as one or more instructions or code on or through a computer-readable medium. A computer-readable medium includes both computer storage media and communication media, encompassing any medium that facilitates the transfer of a computer program from one location to another. A storage medium may be any available medium accessible to a computer. By way of example and not limitation, such a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage, disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and is accessible to a computer. Any connection is also legitimately referred to as a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of a medium. As used in this article, disk and disc include compact discs (CDs), laser discs, optical discs, digital multi-purpose discs (DVDs), floppy disks, and Blu-ray discs. Disks typically reproduce data magnetically, while discs reproduce data optically using lasers. Combinations of these should also be included within the scope of computer-readable media.
[0232] The prior description of this disclosure is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to this disclosure will be apparent to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of this disclosure. Therefore, this disclosure is not intended to be limited to the examples and designs described herein, but should be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for constructing health profiles based on data mining, characterized in that, The methods include: Step 1: Establish an urban health status indicator system applicable to the required region, and construct an integer programming model for maximizing urban economic influence; Step 2: Collect data based on the urban health status index system established in Step 1, and remove redundant features with weak correlation to the labels according to the feature filtering theory. Construct a subset of attribute features of the nearest neighbor sample set, introduce information acceptance to obtain the optimal feature subset to reduce the curse of dimensionality, enhance the correlation between attribute features and feature labels, and construct the optimal feature subset. Step 3: Consider the potential semantic information of the text, obtain the multi-feature fusion dynamic weight of web page text words, filter out the topic distribution and keyword set with high interpretability of urban health profile, realize the clustering of text words at the topic level, and generate urban health profile with hierarchical structure. Step 4: Construct a multi-feature fusion tag set and a city-tag matrix. Improve the collaborative filtering recommendation algorithm based on user cognition and tag expansion. Extract a tag set reflecting the health of the city from a massive number of tags. Reveal the importance of tags in the semantic features of the city profile through tag co-occurrence. Generate city tag recommendations and delve into the hierarchical structure to develop a fine-grained description of the city profile. Step 3 further includes: Step 3-1: Filter images, videos, hyperlinks, and unknown interference information in the original web page document; find the maximum segmentation combination based on word frequency for word segmentation; use a stop word list for stop word removal; and construct a preprocessed document set. ; Step 3-2-1: For the preprocessed document Set the adjustment coefficient; Step 3-2-2: Adjust the words For documents The importance of words is determined by adjusting the coefficients. Score ; Step 3-3: Transfer the document Divided into Paragraph, find the words The paragraph in the document is denoted as After the preprocessing in step 3-1, The remaining section 1 candidate word, word Location changed , calculation words Location information weight And after undergoing minimization normalization processing, the final position information weights are obtained; Steps 3-4: Weighted fusion of word scores Location information weight and word dynamic weights Obtain the dynamic weights of multi-feature fusion for words; Steps 3-5: Combine multiple features to fuse dynamic weights and traverse the word set. ,use The encoder encodes words in the text after filtering out stop words to obtain a text representation of the word set; Steps 3-6: Based on the topic-word distribution matrix of the matrix, generate the keyword set. Each word in the list matches the corresponding topic.
2. The health profile construction method based on data mining according to claim 1, characterized in that, Step 1 further includes: Step 1-1: Collect urban health status indicators for the required regions and establish an urban health status indicator system based on these indicators; Step 1-2: Initial Stage Selection One county-level city and one prefecture-level city, judge Does the influence of a prefecture-level city on a county-level city exceed a set threshold? If so, it indicates... The influence spread ended after a total of In the diffusion stage, the objective function and constraints of the integer programming model are established.
3. The health profile construction method based on data mining according to claim 1, characterized in that, Step 2 further includes: Step 2-1: Collect a set of indicators for urban health status, using the desired region as the sample point. The corresponding data in the dataset was used to standardize the collected data to obtain the initial dataset. Construct the initial attribute feature set ; Step 2-2: Process the collected initial dataset Arrange in order, and set attribute feature-label thresholds. Calculate attribute features For the indicator set Each of the three-level indicators mutual information entropy value The mutual information entropy value Below the attribute feature-label threshold The attribute features are filtered out, and the filtered attribute features of the sample are constructed. These filtered attribute features are represented as follows: Where m is the number of filtered attribute features, and n is the number of features in the initial attribute feature set; Step 2-3-1: Obtain the set of prefecture-level cities with economic influence from Step 1. Random collection These samples constitute the city sample set. And randomly select a sample city from them. With neighboring prefecture-level cities of equal or different levels of influence, Count the sample cities In a certain attribute feature Other cities Sample spacing ; Step 2-3-2: If the city Compared with samples of the same level in a certain attribute feature The distance between samples on a given level is less than the distance between it and samples from non-same level on a certain attribute feature. The distance between them updates the city. Attribute feature weights The initial weights of the attribute features Conversely, re-extract sample cities and calculate sample intervals and attribute feature weights; Step 2-3-3: Based on the city Attribute feature weights Matching attribute features to secondary indicators Weights between Traverse all indicator sets The secondary indicators in the city are obtained Attributes and characteristics The sum of the weights of all secondary indicators ; Steps 2-3-4: [Regarding...] Arrange in order and iterate through them sequentially. ,in p Representing attribute characteristics The number of features in a column determines the city's structure. Attribute feature vector For sample cities Dimension reduction of attribute features, traversing the sample set All cities in China, construct dimensional feature subset ; Steps 2-4: Optimal Feature Subset The screening process.
4. The health profile construction method based on data mining according to claim 1, characterized in that, Steps 3-5 further include: Step 3-5-1: Use multi-layer bidirectional The encoder computes context awareness for each word to obtain word embedding representations; Step 3-5-2: Use Pre-training converts each word into a real-valued vector, and then uses a gating fusion strategy to fuse the pre-trained language model and word vectors to obtain word embedding vectors. Step 3-5-3: Calculate the initial keyword representation; Steps 3-5-4: Decode the final keyword representation using a decoder to construct a keyword set.
5. The health profile construction method based on data mining according to claim 1, characterized in that, Steps 3-6 further include: Step 3-6-1: Calculate the text set Chinese document The initial topic distribution; Step 3-6-2: For the document Dynamic weights of words are generated based on the topic-word distribution matrix; Step 3-6-3: Update Documents - Topical Distribution Documents Probability of belonging to each topic Then update each theme Next topic - the probability of words in word distribution Continue until the result converges and is output; Step 3-6-4: When updating the iteration count satisfy When adjusting the current word Topic Vectors and the dynamic weight of words Return to step 3-6-3. This represents the iteration interval when dynamically updating word weights. Steps 3-6-5: Targeting the document vocabulary set traverse sequentially , Repeat steps 3-6-2 to 3-6-4 until you obtain... Documents under each topic - Topic distribution and The topic-word distribution matrix of each word under each topic; Step 3-6-6: For the text set For each document in the webpage, repeat steps 3-6-2 to 3-6-5 to obtain the document-topic distribution of all documents on the webpage. and topic-word distribution Arranged in order, select those that are highly semantically related to the urban health dimension. A set of topics is composed of several themes, and each theme is further divided into... A topic is composed of words - a vocabulary set.
6. The health profile construction method based on data mining according to claim 5, characterized in that, Step 3-6-2 further includes: Step 3-6-2-1: Randomly generate each word in the document Assign topics and perform initialization. Step 3-6-2-2: For the topics in the document The words below Update words Probability of belonging to the topic traversal Get words Thematic distribution; Step 3-6-2-3: Normalize words Theme Distribution Each component Constructing words The topic vector.
7. The health profile construction method based on data mining according to claim 1, characterized in that, Step 4 further includes: Step 4-1: Use web crawling technology to obtain user Q&A describing the main health characteristics of the required region from the web platform. Following the text data preprocessing steps in Step 3-1, perform lexical standardization and filtering on the raw data to obtain a tag set. Step 4-2-1: Define the user set, the city set, and the set of tags for users to annotate cities; Step 4-2-2: Define the local weight of the label based on the number of times the city is labeled; Step 4-2-3: Determine the global weight of the label based on the information gain between the information entropy of the sample set and the conditional entropy of the label, in order to measure the ability of the label to distinguish different cities; Step 4-2-4: Semantic dimension weights are used to explain the semantic ambiguity of tags; Step 4-2-5: Calculate the element values of the city-label matrix and construct the city-label matrix; Step 4-3: Calculate based on the elements of the city-tag matrix The elements in the dimensional label co-occurrence matrix are used to treat the label co-occurrence matrix as a label similarity matrix based on user perception; Step 4-4: Utilize Semantic dictionary calculation of tag similarity matrix based on tag semantics ; Steps 4-5: Combine the tag similarity matrix and Perform tag merging to obtain a new tag similarity matrix; Steps 4-6: Expand labels using the new label similarity matrix, for cities that have already been labeled. However, the city was not marked. tags According to the label With cities already marked The co-occurrence distribution of all labels, estimating the label... It will mark up unmarked cities. The probability of it; Steps 4-7: Calculate city similarity to obtain city-to-city similarity. Matrix, based on city-city similarity matrix The system generates city tag recommendations for users based on a collaborative filtering algorithm.
8. The health profile construction method based on data mining according to claim 7, characterized in that, Step 4-4 further includes: Step 4-4-1: Preprocess the tag set using... Generate tags separately and tags Synonyms and To form a thesaurus ; Step 4-4-2: Traverse all existing synonym pairs, retrieve annotations for each synonym set, and use text preprocessing methods to... and Extracting annotations and Calculate the semantic similarity between tags and construct a tag similarity matrix. .