Address data intelligent identification system based on cloud collaboration

The cloud-based address data intelligent identification system solves the problems of insufficient real-time performance and robustness of federated learning in dynamic and open environments, and achieves efficient and accurate identification and adaptive optimization of multi-source heterogeneous data.

CN121996992BActive Publication Date: 2026-06-23FUJIAN WUSE SHENNIU NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
FUJIAN WUSE SHENNIU NETWORK TECH CO LTD
Filing Date
2026-04-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing address collaborative identification methods based on federated learning suffer from the problem of not being able to achieve both real-time performance and robustness in dynamic and open environments. In particular, in emergency response and real-time logistics scenarios, the model update speed is lagging and is susceptible to contamination by low-quality data.

Method used

A cloud-based intelligent address data identification system is adopted. Through multi-source heterogeneous data collection, dual-dimensional feature quantification analysis, dynamic collaborative coefficient construction, and asynchronous weighted collaborative scheduling, the system dynamically evaluates the real-time value and quality of each data source, asynchronously processes high-value data and buffers low-quality data, and optimizes the address element parsing rules.

Benefits of technology

It achieves a balance between real-time performance and accuracy in data processing under complex multi-source data environments, avoids the impact of low-quality data, ensures high recognition accuracy and anti-interference capabilities, and adapts to changes in new address patterns.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121996992B_ABST
    Figure CN121996992B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of data processing, and specifically discloses an address data intelligent identification system based on cloud cooperation, which collects original address data from multiple terminals and forms a data stream, carries out time-dimension value decay simulation and content-dimension semantic dispersion analysis on each terminal data, respectively generates time sequence urgency and content credibility characteristic values, then maps the two to a unified space for correlation calibration, constructs a dynamic cooperation coefficient, and based on the coefficient, performs differentiated asynchronous scheduling on the data stream in the cloud: high-coefficient data immediately triggers an identification core update, and low-coefficient data is buffered and iteratively purified; the differentiated results are integrated, a mainstream change direction is identified by using density-based clustering, address resolution rules are adaptively corrected, and finally, the optimized rules are applied to output structured addresses; the method effectively unifies the real-time performance and robustness of cooperative processing, and significantly improves the accuracy and efficiency of address identification in complex scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and more specifically to a cloud-based intelligent address data identification system. Background Technology

[0002] In many fields such as smart cities, modern logistics, emergency command, and population management, rapid and accurate intelligent identification and structured processing of address data are key technologies supporting efficient business operations. Traditional address identification methods are mostly based on single-point rule bases or models trained with limited samples, making it difficult to cope with the diversity, regional differences, and dynamic changes in address representations in reality. With the development of the Internet of Things and mobile Internet, address data is characterized by being multi-source, heterogeneous, massive, and generated in real time. This has prompted the industry to adopt collaborative processing technologies based on cloud computing platforms, attempting to integrate dispersed data and computing power to improve identification accuracy and coverage.

[0003] Existing technologies suffer from the following shortcomings: Current federated learning-based address collaborative identification methods face a core technical contradiction in achieving both real-time performance and robustness in dynamic, open environments. Specifically, in scenarios such as emergency response and real-time logistics, multi-source terminal data changes drastically and is of mixed quality. Traditional methods employ a fixed-period synchronous federated learning framework, whose model update speed lags significantly behind on-site changes (poor real-time performance). Furthermore, due to the equal aggregation of updates from all participants, it is highly susceptible to contamination by massive amounts of low-quality data (low robustness). Summary of the Invention

[0004] The purpose of this invention is to provide a cloud-based intelligent address data identification system to solve the problems mentioned above.

[0005] The objective of this invention can be achieved through the following technical solutions:

[0006] The cloud-based address data intelligent recognition system has a multi-source heterogeneous data acquisition module, which is used to acquire raw address data sets in real time from multiple independently operating terminals. Each terminal collects address information containing unstructured text or images at its own physical location to form an initial data stream.

[0007] The dual-dimensional feature quantification analysis module simulates the continuous value decay of the timestamp sequence generated by the initial data stream of each terminal, estimates the current effective information density of the timestamp sequence, and generates the time urgency feature value; it extracts the semantic dispersion of the initial data stream generated by each terminal from the content dimension, compares it with the pre-stored benchmark pattern, uses probability divergence measurement to calculate the deviation, and generates the content credibility feature value.

[0008] The dynamic coordination coefficient construction module maps the time urgency feature value and the content credibility feature value to the same metric space, and performs normalization and correlation calibration to generate a set of dynamic coordination coefficients that uniquely correspond to each terminal.

[0009] The asynchronous weighted collaborative scheduling module performs differentiated scheduling on continuously flowing terminal address data in the cloud based on dynamic collaborative coefficients. For terminal data with high dynamic collaborative coefficients, the update process of the cloud identification core is immediately triggered. For terminal data with low dynamic collaborative coefficients, it is temporarily stored in a buffer queue for multiple rounds of iterative purification. The differentiated processing results of different terminals are integrated asynchronously to correct and optimize the address element parsing rules.

[0010] The structured address output module applies the stable parsing rules formed after asynchronous weighted collaborative identification steps to the real-time inflow or historical backlog of raw address data, and outputs structured address entries in a unified format.

[0011] As a further aspect of the present invention: the process for generating the timing urgency feature value is as follows:

[0012] A continuous decay mapping is performed on the timestamp sequence to generate an information value curve that characterizes the decay of each data point over time;

[0013] Based on a preset information validity threshold, the information value curve is dynamically truncated, retaining only the curve segments that exceed the information validity threshold to form a valid information window;

[0014] Calculate the area enclosed by the curve and the time axis within the effective information window, and quantify the area value as a time urgency feature value.

[0015] As a further aspect of the present invention: the process for generating the content credibility feature value is as follows:

[0016] The unstructured address text in the initial data stream is decomposed and the frequency of occurrence of each address element is counted according to the preset address element parsing rules, and the address element distribution of the corresponding terminal is generated.

[0017] The distribution of address elements is compared with the pre-stored standardized address element mapping table at each level, and the matching completeness and logical conflict of each level of elements are calculated.

[0018] Based on the matching completeness and logical conflict degree, a numerical value representing the overall deviation between the corresponding terminal data and the standard benchmark is synthesized through a predefined discrete quantification method, which serves as the content credibility feature value.

[0019] As a further aspect of the present invention: the calculation process of the dynamic synergy coefficient is as follows:

[0020] Construct a two-dimensional interaction matrix between time-series urgency feature values ​​and content credibility feature values;

[0021] Spectral analysis is performed on the two-dimensional interaction matrix to extract the principal eigenvectors of the two-dimensional interaction matrix;

[0022] Project the two-dimensional coordinates composed of the time urgency feature value and the content credibility feature value onto the direction of the main feature vector;

[0023] The length value of the projection is scaled by a preset scale and quantified into the dynamic coordination coefficient corresponding to the terminal.

[0024] As a further aspect of the present invention: the spectral analysis of the two-dimensional interaction matrix to extract the principal eigenvectors of the two-dimensional interaction matrix specifically includes:

[0025] The two-dimensional interaction matrix is ​​symmetricized and centered to obtain a standardized real symmetric matrix;

[0026] An orthogonal direction iterative approximation method is used for real symmetric matrices. Through multiple iterations, a stable vector direction that satisfies the preset convergence condition is obtained.

[0027] Calculate the Rayleigh quotient ratio of the stable vector direction relative to the real symmetric matrix, and use the Rayleigh quotient ratio as the final verification basis for convergence;

[0028] The unit vector in the direction of the verified stable vector is determined as the principal feature vector.

[0029] As a further aspect of the present invention: the method of asynchronously integrating the differentiated processing results of different terminals to correct and optimize the address element parsing rules specifically includes:

[0030] Extract the latest output address element parsing results from the immediately triggered update process and the buffer queue of multiple rounds of iterative purification, respectively. Compare the address element parsing results with the currently used parsing rules to generate a difference vector cluster.

[0031] Density-based spatial clustering of the difference vector clusters identifies the set of change directions.

[0032] By logically integrating and resolving conflicts between the set of mainstream changes and the current analysis rules, a draft rule update is formed.

[0033] The draft rule update will be tested on an independent historical address verification set. The draft rule update will only be adopted when the resolution accuracy improves by more than a preset threshold, thus completing the formal revision and optimization of the address element resolution rules.

[0034] As a further aspect of the present invention: the density-based spatial clustering of the difference vector clusters to identify the set of change directions specifically includes:

[0035] For each vector in the differential vector cluster, calculate the number of neighbors of each vector with other vectors within a preset radius to form the local density value of the vector and construct the local density distribution map of the entire vector space;

[0036] Based on a preset density benchmark, vectors with local density values ​​higher than the density benchmark are selected from the local density distribution map to form a core vector set;

[0037] Traverse the core vector set, establish a connection between any two core vectors in the core vector set that satisfy the spatial proximity relationship, and merge all core vectors that are accessible through direct or indirect connections into an independent core vector group;

[0038] All vector directions within each core vector group are synthesized to generate an aggregate vector representing the mainstream change direction of the core vector group. The set of all aggregate vectors is the set of mainstream change directions.

[0039] As a further aspect of the present invention: the output of structured address entries in a unified format specifically includes:

[0040] The original address data is input into a hierarchical decision tree constructed based on stable resolution rules. The address element type and logical level are determined and marked layer by layer from top to bottom, generating temporary address tuples with hierarchical labels.

[0041] The address elements in the temporary address tuple are combined and verified. Based on the predefined regional constraint rules and logical completeness rules, the conflict of elements or missing levels are identified and corrected to form a verified intermediate address tuple.

[0042] The intermediate address tuple is assembled with delimiters in a fixed order according to a preset unified output standard.

[0043] Perform a final consistency check on the assembled address string to ensure that the address string conforms to the preset complete address paradigm, and then output the complete address paradigm as a structured address entry in a uniform format.

[0044] As a further aspect of the present invention: the final consistency check performed on the assembled address string specifically includes:

[0045] The address string is matched against a preset address character pattern library to verify whether the address character type sequence, segment length, and delimiter position conform to the standard pattern.

[0046] Based on a pre-constructed spatial relationship graph of address elements, verify whether there are valid spatial inclusion or adjacency relationships between address elements at different levels in the address string;

[0047] The address string is correlated and verified with the time and space stamp information in the original address data to ensure that the form of the address string is consistent with the time and space logic of the address string being collected;

[0048] The address string will be output as a structured address entry in a uniform format only if the address string passes all the aforementioned verification steps.

[0049] The beneficial effects of this invention are:

[0050] (1) Through a unique dual-dimensional feature analysis of time urgency and content credibility, and by calculating a dynamic collaboration coefficient, the system can intelligently and dynamically evaluate the real-time value and quality of each data source. This enables the cloud scheduling mechanism to process and update high-value, high-quality terminal data in real time, while buffering and purifying low-quality or non-urgent data (such as volunteer crowdsourced data), thus breaking the rigid synchronous waiting bottleneck of traditional federated learning. This differentiated asynchronous processing based on the characteristics of the data itself ensures that key information can be adopted by the system with minimal delay, while effectively avoiding the impact of low-quality data streams on the core processing flow. In scenarios with extremely high timeliness requirements, such as emergency response and logistics scheduling, the optimal balance between data processing speed and accuracy is achieved.

[0051] (2) This invention integrates differentiated results from different processing paths asynchronously and uses density-based spatial clustering technology to automatically identify the true mainstream trends and common patterns from massive, potentially noisy, individual differences. This process makes the update of address resolution rules no longer dependent on manual induction or simple majority voting, but rather on statistically verified, high-confidence group consensus. This not only improves the accuracy and reliability of rule optimization and effectively prevents the "poisoning" of rules due to individual erroneous data or malicious interference, but also enables the system to continuously learn from real business data, evolve, and adapt to new address patterns (such as new urban areas and temporary resettlement sites), thereby maintaining high recognition accuracy and strong anti-interference capabilities in a real-world environment with complex data sources and varying quality. Attached Figure Description

[0052] The invention will now be further described with reference to the accompanying drawings.

[0053] Figure 1 This is a system block diagram of the present invention. Detailed Implementation

[0054] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0055] Please see Figure 1 As shown, this invention is a cloud-based collaborative address data intelligent identification system, comprising:

[0056] The multi-source heterogeneous data acquisition module is used to acquire raw address data sets in real time from multiple independently operating terminals. Each terminal collects address information containing unstructured text or images at its own physical location to form an initial data stream.

[0057] The dual-dimensional feature quantification analysis module simulates the continuous value decay of the timestamp sequence generated by the initial data stream of each terminal, estimates the current effective information density of the timestamp sequence, and generates the time urgency feature value; it extracts the semantic dispersion of the initial data stream generated by each terminal from the content dimension, compares it with the pre-stored benchmark pattern, uses probability divergence measurement to calculate the deviation, and generates the content credibility feature value.

[0058] The dynamic coordination coefficient construction module maps the time urgency feature value and the content credibility feature value to the same metric space, and performs normalization and correlation calibration to generate a set of dynamic coordination coefficients that uniquely correspond to each terminal.

[0059] The asynchronous weighted collaborative scheduling module performs differentiated scheduling on continuously flowing terminal address data in the cloud based on dynamic collaborative coefficients. For terminal data with high dynamic collaborative coefficients, the update process of the cloud identification core is immediately triggered. For terminal data with low dynamic collaborative coefficients, it is temporarily stored in a buffer queue for multiple rounds of iterative purification. The differentiated processing results of different terminals are integrated asynchronously to correct and optimize the address element parsing rules.

[0060] The structured address output module applies the stable parsing rules formed after asynchronous weighted collaborative identification steps to the real-time inflow or historical backlog of raw address data, and outputs structured address entries in a unified format.

[0061] In the multi-source heterogeneous data acquisition module, raw address data sets are acquired in real time from multiple independently operating terminals. Each terminal acquires address information containing unstructured text or images at its respective physical location, forming an initial data stream, specifically including:

[0062] In this invention, the data acquisition step is completed through terminal devices deployed in different physical locations. Each terminal device operates independently, and its built-in positioning unit and data input interface are used to collect raw address information containing unstructured text descriptions or on-site images. The text descriptions come from user manual input, device-recognized text, or associated form information; the image information is obtained by directly capturing scene photos containing door numbers, road signs, or landmarks through the terminal's camera.

[0063] After data collection is complete, the terminal device's built-in communication module transmits a data packet containing a timestamp, device identifier, and the aforementioned original address information to the designated receiving endpoint in the cloud, either in real-time or near real-time, via a mobile network or wireless LAN. The data transmission process employs standard transmission control protocols and data encryption measures to ensure data integrity and transmission security.

[0064] After receiving data packets from various terminals, the cloud-based data receiving service first decrypts and parses them to extract valid address information and its metadata. Then, multiple address messages from the same terminal that have a continuous time sequence are logically aggregated to form a data sequence uniquely corresponding to that terminal and arranged in chronological order—the initial data stream. This data stream provides the basic data units for subsequent feature analysis and collaborative processing.

[0065] In the dual-dimensional feature quantification analysis module, the initial data stream generated by each terminal is used to generate a timestamp sequence for continuous value decay simulation, estimate the current effective information density of the timestamp sequence, and generate a time-series urgency feature value. The semantic dispersion of the initial data stream generated by each terminal is extracted from the content dimension, compared with a pre-stored benchmark pattern, and the deviation is calculated using probability divergence measurement to generate a content credibility feature value, specifically including:

[0066] The generation of the timing urgency feature value begins with processing the timestamp sequence attached to the initial data stream generated by the terminal. This timestamp sequence records the precise time when each address data point was collected by the terminal, forming a set of points arranged in chronological order. First, a continuous value decay simulation is performed on this sequence. To achieve this, each timestamp data point in the sequence is assigned an initial information value, which decays as the time difference between the current calculation time and the time recorded for that data point increases. Specifically, the current calculation time is defined as... The recording time of a certain historical data point is Then the instantaneous information value of that data point at the current moment. Calculated using the following formula: ;

[0067] Where e is the base of the natural logarithm, which is a mathematical constant, and the parameter... This represents the initial information value of the data point at the time of its generation and can be uniformly set to a constant of 1. Parameter This is the value decay coefficient, with a value greater than 0. It is pre-calibrated by analyzing the typical changes in the validity of address information over time in historical data. After completing this calculation for all timestamp data points, the corresponding... The value is generated by creating an information value curve that represents the decay of each data point over time.

[0068] Subsequently, based on a preset information validity threshold Dynamically truncate the curve using a window. Information validity threshold. It is a value between 0 and the initial value. The constants between these values ​​are determined based on the minimum timeliness requirements for address information in the business. The processing procedure is as follows: from the current time... By tracing back along the historical timeline, we can locate the last value on the information value curve that is equal to or higher than [the previous value]. The data point, the time corresponding to that point is denoted as Because of this historical moment Starting from the current moment The time interval shifted forward from the endpoint ( Later The segment outside this window represents the valid information window. The curve segments outside this window are removed because the information value they represent is below the minimum acceptable standard for the business.

[0069] Finally, the area enclosed by the information value curve and the time axis (horizontal axis) within the effective information window is calculated. This area is calculated by discretizing and approximating the integration over continuous time intervals within the window: the window time interval is divided into n small time intervals, and the product of the approximate average height (value) of the information value curve in each time interval and the duration of that time interval is calculated. Finally, the products of all time intervals are summed to obtain the total area S. This area S directly reflects the total information value contributed by the terminal within the effective time window. The value of area S can be linearly scaled or directly used as a feature value, i.e., quantified as the time urgency feature value corresponding to the terminal. . The higher the value, the more timely information the terminal has provided recently.

[0070] The generation of content credibility feature values ​​focuses on analyzing the standardization and consistency of address text content in the initial data stream. First, based on preset address element parsing rules, the unstructured address text collected by the terminal is decomposed. The address element parsing rule is a predefined, hierarchical set of keywords or patterns; for example, it decomposes address text into element types such as "province-level," "city-level," "district-level," "street-level," "house number," and "landmark name." Through text matching and regular expression recognition, words or phrases matching these element types are extracted from each address text, and the frequency of each address element type in the entire initial data stream is counted. After the statistics are completed, a vector is generated, where each dimension of the vector corresponds to an address element type, and its value is the frequency of that element's occurrence or its normalized frequency. This vector is the address element distribution D corresponding to the terminal.

[0071] Next, the address feature distribution D is compared layer by layer with a pre-stored canonical address feature mapping table. The canonical address feature mapping table is a knowledge base that stores a standardized, complete address hierarchy structure and the possible canonical feature values ​​for each level. The comparison process includes the calculation of two quantitative indicators: one is the matching completeness. The calculation method is as follows: First, the number of element types that appear in the statistical distribution D and can be matched at the corresponding level of the standardized mapping table is divided by the total number of key element types that the standardized mapping table requires to appear in this business scenario; second, the degree of logical conflict. The calculation method is to identify combinations of elements that appear in distribution D but have geographical or logical contradictions with each other (e.g., place names of the same level belonging to different superior regions appear at the same time), and the degree of conflict is assessed by statistically analyzing the number and severity weight of such contradictory combinations.

[0072] Finally, based on matching completeness Conflict with logic Content credibility feature values ​​are synthesized using a predefined discrete metric method. This synthesis aims to characterize the overall deviation of current terminal data from the standard benchmark. A specific synthesis method is defined by the following formula: ;

[0073] In this formula, the parameter Represents the theoretical maximum value of the logical conflict level or a preset upper limit constant. Parameter It is a scaling factor greater than 0, used to adjust the range of confidence values. Parameter It is an exponential coefficient used to control the non-linearity of the impact of reduced logical conflict on improved credibility. Its value is usually set according to the business's sensitivity to data errors. For example, when extremely sensitive to errors, it can be set to... >1. To amplify the positive impact of low conflict levels. The formula calculation first processes the logical conflict level, then calculates... and Find the difference and calculate it. The power of 1, then combined with the matching completeness. Multiply, then use coefficients Scaling is performed, and then 1 is added to obtain the base value, resulting in the final value. value. The higher the value, the more compliant and logically consistent the address data provided by the terminal is in terms of content, and the higher its reliability.

[0074] In the dynamic coordination coefficient construction module, the time-series urgency feature value and the content credibility feature value are mapped to the same metric space, and normalized and relevance calibrated to generate a unique set of dynamic coordination coefficients for each terminal, specifically including:

[0075] The calculation of the dynamic synergy coefficient begins with a synergistic analysis of the temporal urgency feature value and the content credibility feature value corresponding to each terminal. These two feature values ​​characterize the different characteristics of terminal data in the time dimension and the content dimension, respectively. In order to compare and integrate them within a unified framework, it is first necessary to construct a mathematical structure that can reflect the interaction between the two.

[0076] Specifically, for all participating terminals, historical temporal urgency and content credibility feature values ​​are collected over several recent preset collaboration periods. This historical data is organized into a two-dimensional interaction matrix. The rows and columns of this matrix correspond to different collaboration periods, and each element is a combination of two values: one value is a correlation measure (e.g., covariance) among the temporal urgency feature values ​​of all terminals within that period, and the other value is a correlation measure among the content credibility feature values ​​of all terminals within the corresponding period. In this way, the two-dimensional interaction matrix simultaneously encodes the evolution patterns of each feature value over time and the covariance relationship between them.

[0077] Subsequently, spectral analysis is performed on the constructed two-dimensional interaction matrix to extract its dominant variation pattern, i.e., the principal eigenvector. This process first preprocesses the original interaction matrix, transforming it into a standardized real symmetric matrix. Preprocessing includes symmetry transformation (ensuring the matrix is ​​equal to its transpose) and centering (adjusting the matrix elements to have a mean of zero). After obtaining the standardized real symmetric matrix, a numerical method based on iterative approximation is used to solve for its principal eigenvector. This method starts with a randomly generated initial two-dimensional vector, repeatedly multiplying it by the real symmetric matrix, and normalizing the length of the resulting vector after each multiplication (i.e., adjusting it to a unit vector). This process is called iteration. Iterative calculations continue until the change in vector direction between two adjacent iterations is less than a preset minimum threshold. At this point, the calculation is considered converged, and the resulting stable vector direction is an approximation of the direction of the principal eigenvector. To verify the reliability of this approximate vector, its Rayleigh quotient ratio is calculated. This ratio is a scalar value obtained by taking the dot product of this vector with the vector transformed by the real symmetric matrix and then dividing by the dot product of the vector itself. The calculated Rayleigh quotient ratio is compared with the reference range of the largest eigenvalue of the matrix estimated by other methods. If it falls within this range, the verification is successful. Finally, the convergent and verified stable vector direction is transformed into a unit vector and formally determined as the principal eigenvector.

[0078] After obtaining the principal feature vector, for any terminal where the dynamic synergy coefficient needs to be calculated, its current temporal urgency feature value and content credibility feature value are considered as a two-dimensional coordinate point. The length of the projection of this coordinate point vector onto the principal feature vector is calculated. The method for calculating the projection length is as follows: first, calculate the dot product of the coordinate point vector and the unit vector of the principal feature vector, and then take the absolute value of the dot product.

[0079] Finally, the calculated projection length value is quantified into the dynamic collaboration coefficient corresponding to the terminal through a preset scaling function. The purpose of the scaling function is to map the projection length value to a predetermined numerical range that meets the needs of business interpretation, such as between 0 and 1, or a specific integer range. A simple scaling method is linear scaling: a theoretical maximum value or an observed historical maximum value of the projection length is set, and the actual projection length is divided by this maximum value, resulting in a scaling factor between 0 and 1. This scaling factor can be used as the dynamic collaboration coefficient. This coefficient comprehensively reflects the contribution intensity and priority level of the terminal's data relative to the overall historical collaboration pattern at the current moment, under the combined effect of the time urgency and content credibility of the data.

[0080] In the asynchronous weighted collaborative scheduling module, differentiated scheduling is performed on the continuously flowing terminal address data in the cloud based on the dynamic collaborative coefficient. For terminal data with a high dynamic collaborative coefficient, the update process of the cloud identification core is immediately triggered. For terminal data with a low dynamic collaborative coefficient, it is temporarily stored in a buffer queue for multiple rounds of iterative purification. The differentiated processing results of different terminals are integrated asynchronously to correct and optimize the address element parsing rules, specifically including:

[0081] After the differentiated scheduling and processing of terminal data based on dynamic coordination coefficients, data from different terminals undergoes processing along different paths: high-coordination-coordination data generates new parsing results through a rapid update process, while low-coordination-coordination data outputs optimized parsing results after multiple rounds of iterative purification via a buffer queue. To integrate these differentiated results to optimize the core address parsing rules, the latest processed and output address element parsing results must first be extracted from the two processing paths mentioned above. These results are typically represented as a structured list of address elements or a sequence of tags. Each newly generated parsing result is then compared item by item with the address element parsing rules currently in use on the cloud. The comparison includes: identifying elements or relationships between elements present in the new results but not covered by the current rules; and identifying differences between the new results and the current rules in terms of values ​​or constraints on the same elements. For each difference, it is quantified into a numerical vector with direction and magnitude, pointing in the direction the rule should adjust. All such difference vectors are collected to form a set containing multiple vectors, called a difference vector cluster.

[0082] Subsequently, density-based spatial clustering analysis is performed on the differential vector clusters to extract general and consistent trends of change from potentially noisy and scattered individual differences. The specific steps are as follows: First, calculate the local density of each vector in the cluster. For any vector, a spherical region is defined with its position in the multidimensional vector space as the center and a predetermined fixed value as the radius. The total number of other vectors that fall completely within this spherical region is counted; this total number is defined as the local density value of that vector. After completing this calculation for all vectors, a local density distribution map reflecting the density of the population distribution within the entire vector space is obtained. Second, set a density screening threshold, called the density benchmark. This density benchmark is usually selected as a statistical quantile of the local density values ​​of all vectors, such as the median or the third or fourth quartile. All vectors with local density values ​​higher than this density benchmark are selected to form the core vector set. These vectors represent points in space that are closely surrounded by other vectors and are representative of the population. Third, establish the connections between the core vectors. Traverse the core vector set; for any two core vectors, calculate the spatial distance between them. If this distance is less than or equal to another preset connection radius (which may be the same as or different from the statistical radius in the first step), then a connection edge is established between the two core vectors. Through this connection, all directly or indirectly connected core vectors can be merged into the same set; each such set is called a core vector group, and the vectors within the group are spatially adjacent to each other. The fourth step is direction synthesis. For each core vector group, the directions of all the original difference vectors contained within the group are synthesized. The synthesis method can be to calculate the average direction of all vectors within the group, or to obtain the most representative direction vector through other vector synthesis algorithms. This synthesized direction vector represents a mainstream change direction identified from the data set. The set of synthesized direction vectors corresponding to all core vector groups is the final set of identified mainstream change directions.

[0083] Next, rule fusion and conflict resolution are performed. The change intentions implied by the set of mainstream change directions obtained in the previous step are applied item by item to the currently used address element parsing rules. This process involves logical integration: rule clauses indicating that direction should be added or strengthened are adopted or enhanced; clauses indicating that direction should be weakened or deleted are adjusted accordingly. When different change directions are logically contradictory (for example, one direction requires stronger constraints on a certain element format, while another direction requires relaxation), priority arbitration and conflict resolution are carried out based on the comprehensive weight of the original data sources that generated these directions (such as data volume, average coherence coefficient of the source terminal) and the density support of the change direction itself in the cluster, thereby forming a rule update draft with internal logical consistency.

[0084] Finally, the validity of the rule update draft is verified to determine whether it should be formally adopted. A historical address validation set, independent of the training data used in this rule update, is selected. This set contains a large number of address samples with known correct resolution results. All samples in this validation set are parsed using both the current old parsing rules and the proposed new rule draft, and the overall parsing accuracy is calculated. The accuracy is calculated by dividing the number of correctly parsed samples by the total number of samples in the validation set. A threshold for accuracy improvement is preset; this threshold is a percentage value greater than 0, such as 2%. Only if the parsing accuracy after applying the new rule draft, compared to the baseline accuracy when applying the old rules, exceeds this preset threshold, is the rule update deemed valid, and the draft is formally adopted, completing the correction and optimization of the address element parsing rules. If the improvement does not reach the threshold, the draft is abandoned, the original rules remain unchanged, and further updates are attempted after accumulating more data.

[0085] In the structured address output module, the stable parsing rules formed after the asynchronous weighted collaborative identification process are applied to the raw address data that is either flowing in in real time or has accumulated historically, outputting structured address entries in a unified format, specifically including:

[0086] To apply the stable parsing rules formed after asynchronous weighted collaborative recognition steps to actual address data and generate the final output, a hierarchical decision structure needs to be constructed based on these rules. These rules clarify the recognition patterns, logical relationships, and hierarchical constraints of different address elements (such as province, city, district, street, and house number). Based on these rules, a tree-like decision process is constructed: the root node corresponds to the initial parsing of the address string; each non-leaf node represents a judgment condition for a type of address element, and its branches represent possible different results or the next level of element type under that condition; leaf nodes correspond to a specific address element value and its corresponding hierarchical label. During application, each piece of original address data (text or recognized text information) is input from the root node. From top to bottom, based on the matching of the text content with the node's judgment condition, the process traverses the decision tree layer by layer. For each successfully matched and judged element, its type and value are recorded as a data unit with a hierarchical label. When a leaf node is reached or matching fails, all recorded data units are combined according to the judgment order to generate a temporary address tuple with hierarchical labels.

[0087] Next, perform a combined verification on the temporary address tuple to correct possible logical errors or missing hierarchies among elements in the preliminary parsing. This step depends on two types of predefined verification rules: one is the geographical constraint rule, which stores in the knowledge base the inclusion relationships of known valid superior and subordinate administrative regions or geographical regions (for example, a certain district must be subordinate to a certain city); the other is the logical completeness rule, which defines that certain elements in a complete address must appear in pairs or have a specific order (for example, if there is a house number, there should usually be a street name). The verification process first checks whether the values of any two elements with a hierarchical relationship in the temporary address tuple violate the inclusion relationship in the geographical constraint rule. If violated, a suggested correction is made to one of the elements or it is marked as pending according to the more general or authoritative relationship in the knowledge base. Secondly, check whether the tuple meets the logical completeness rule. For elements that are required by the rule but missing in the tuple, try to infer and complete them based on the context or associated information. After all verifications, corrections, and completions are done, a new set of address data with stronger internal logical consistency is formed, called the verified intermediate address tuple.

[0088] Then, enter the stage of assembling the address string. According to a preset unified output specification, convert the intermediate address tuple into a text string in standard format. This specification strictly defines the presentation format of the final address entry, mainly including: first, the fixed arrangement order of each level of address elements, for example, in the order of "province, city, district, street, house number, supplementary information"; second, the specific delimiters used between each level of elements, for example, using the full-width Chinese characters "province", "city", "district" as the suffix delimiters for the corresponding levels, and using the English half-width comma or space to separate other levels. The assembling operation is to traverse each element in the intermediate address tuple in a fixed order according to this specification, take out its value text in turn, and add or insert the delimiters required by the specification after or before it, and splice all the texts and delimiters in order to generate a preliminary address string.

[0089] Finally, a final consistency check must be performed on the assembled address string to ensure its quality. This check consists of three sequential sub-steps. First, pattern compliance check: The address string is compared with a pre-defined address character pattern library. This pattern library is compiled by analyzing a large number of standardized addresses and defines the character type sequences of compliant address strings (such as the allowed positions and combinations of Chinese characters, numbers, and specific symbols), the reasonable length range of each segment (usually separated by delimiters), and the precise positions where delimiters must appear. The address string is checked to see if it conforms to these pattern characteristics. Second, spatial relationship logic check: Based on a pre-built spatial relationship map of address elements, the library verifies whether there are indeed reasonable spatial inclusion or direct adjacency relationships between the place name elements at all levels extracted from the string. This map is constructed by integrating official administrative division data and geographic information system data. Third, spatiotemporal correlation check: The library traces back the collection timestamp and geographic coordinates (if they exist) attached to the original address data corresponding to the address string to verify whether the textual description of the address logically matches the collection time (e.g., a new administrative division name may only appear after its official implementation date) and the approximate spatial range of the collection. Only when the address string to be output passes all three checks in sequence without any conflicts or anomalies will it be finally output as a structured address entry in a uniform format. If any check fails, the address string will be marked as an abnormal result and transferred to a special exception handling or manual review process, instead of being included in the final output result set.

[0090] The working principle of this invention is as follows: Raw address data containing unstructured text or images is collected in real time from multiple independent terminals to form an initial data stream. First, a two-dimensional feature quantification analysis is performed on the data from each terminal: in the time dimension, a temporal urgency feature value reflecting data freshness is generated through continuous value decay simulation; in the content dimension, a content credibility feature value reflecting data standardization is generated by comparing with pre-stored benchmark patterns and calculating probability divergence. Subsequently, these two feature values ​​are mapped to the same metric space for normalization and correlation calibration, constructing a two-dimensional matrix reflecting their interaction relationship. Principal feature vectors are extracted through spectral analysis, and a dynamic synergy coefficient uniquely corresponding to each terminal and comprehensively evaluating its contribution to data timeliness and quality is calculated. Based on this coefficient, differentiated scheduling is performed on the continuously flowing address data in the cloud: high-coefficient data immediately triggers core identification updates, while low-coefficient data is temporarily stored in a buffer queue for multiple rounds of iterative purification. Then, the processing results from different terminals are integrated asynchronously, and a density-based clustering method is used to identify the mainstream change direction, thereby correcting and optimizing the address element parsing rules. Finally, by applying optimized stability rules, the original address data is converted into structured address entries in a unified format and output through hierarchical decision-making, combined verification, and standardized assembly. This scheme effectively solves the problems of real-time performance, noise robustness, and adaptive optimization in collaborative processing of multi-source heterogeneous address data.

[0091] The foregoing has provided a detailed description of one embodiment of the present invention, but this description is merely a preferred embodiment and should not be construed as limiting the scope of the invention. All equivalent variations and modifications made within the scope of the claims of this invention should still fall within the patent coverage of this invention.

Claims

1. A cloud-based intelligent address data identification system, characterized in that: include: The multi-source heterogeneous data acquisition module is used to acquire raw address data sets in real time from multiple independently operating terminals. Each terminal collects address information containing unstructured text or images at its own physical location to form an initial data stream. The dual-dimensional feature quantification analysis module simulates the continuous value decay of the timestamp sequence generated by the initial data stream of each terminal, estimates the current effective information density of the timestamp sequence, and generates time urgency feature values. The semantic dispersion of the initial data stream generated by each terminal is extracted from the content dimension. This dispersion is then compared with a pre-stored benchmark pattern, and the deviation is calculated using a probability divergence metric to generate content credibility feature values. The process for generating these content credibility feature values ​​is as follows: The unstructured address text in the initial data stream is decomposed and the frequency of occurrence of each address element is counted according to the preset address element parsing rules, and the address element distribution of the corresponding terminal is generated. The distribution of address elements is compared with the pre-stored standardized address element mapping table at each level, and the matching completeness and logical conflict of each level of elements are calculated. Based on the matching completeness and logical conflict degree, a numerical value representing the overall deviation between the corresponding terminal data and the standard benchmark is synthesized through a predefined discrete quantification method, which serves as the content credibility feature value. The address feature distribution is compared layer by layer with a pre-stored canonical address feature mapping table. The canonical address feature mapping table is a knowledge base that stores a standardized, complete address hierarchy structure and the canonical feature values ​​at each level. The comparison process includes the calculation of two quantitative indicators: The matching completeness calculation method is as follows: the number of element types that appear in the distribution of address elements and can be matched at the corresponding level of the standard mapping table is divided by the total number of key element types that the standard mapping table requires to appear in the corresponding business scenario. The logical conflict degree calculation method is as follows: identify the combination of elements that appear in the distribution of address elements but have geographical or logical contradictions with each other, and assess the degree of conflict by statistically analyzing the number and severity weight of such contradictory combinations; The dynamic coordination coefficient construction module maps the time urgency feature value and the content credibility feature value to the same metric space, and performs normalization and correlation calibration to generate a set of dynamic coordination coefficients that uniquely correspond to each terminal. The asynchronous weighted collaborative scheduling module performs differentiated scheduling on the continuously flowing terminal address data in the cloud based on the dynamic collaborative coefficient. For terminal data with a high dynamic collaborative coefficient, the update process of the cloud identification core is immediately triggered. For terminal data with a low dynamic collaborative coefficient, it is temporarily stored in a buffer queue for multiple rounds of iterative purification. By integrating the differentiated processing results from different terminals asynchronously, the address element parsing rules are corrected and optimized; The structured address output module applies the stable parsing rules formed after asynchronous weighted collaborative identification steps to the real-time inflow or historical backlog of raw address data, and outputs structured address entries in a unified format.

2. The address data intelligent identification system based on cloud collaboration according to claim 1, characterized in that, The process for generating the time-series urgency feature value is as follows: A continuous decay mapping is performed on the timestamp sequence to generate an information value curve that characterizes the decay of each data point over time; Based on a preset information validity threshold, the information value curve is dynamically truncated, retaining only the curve segments that exceed the information validity threshold to form a valid information window; Calculate the area enclosed by the curve and the time axis within the effective information window, and quantify the area value as a time urgency feature value.

3. The address data intelligent identification system based on cloud collaboration according to claim 1, characterized in that, The calculation process for the dynamic synergy coefficient is as follows: Construct a two-dimensional interaction matrix between time-series urgency feature values ​​and content credibility feature values; Spectral analysis is performed on the two-dimensional interaction matrix to extract the principal eigenvectors of the two-dimensional interaction matrix; Project the two-dimensional coordinates composed of the time urgency feature value and the content credibility feature value onto the direction of the main feature vector; The length value of the projection is scaled by a preset scale and quantified into the dynamic coordination coefficient corresponding to the terminal.

4. The address data intelligent identification system based on cloud collaboration according to claim 3, characterized in that, The spectral analysis of the two-dimensional interaction matrix to extract the principal eigenvectors of the two-dimensional interaction matrix specifically includes: The two-dimensional interaction matrix is ​​symmetricized and centered to obtain a standardized real symmetric matrix; An orthogonal direction iterative approximation method is used for real symmetric matrices. Through multiple iterations, a stable vector direction that satisfies the preset convergence condition is obtained. Calculate the Rayleigh quotient ratio of the stable vector direction relative to the real symmetric matrix, and use the Rayleigh quotient ratio as the final verification basis for convergence; The unit vector in the direction of the verified stable vector is determined as the principal feature vector.

5. The address data intelligent identification system based on cloud collaboration according to claim 1, characterized in that, The method of asynchronously integrating the differentiated processing results from different terminals and correcting and optimizing the address element parsing rules specifically includes: Extract the latest output address element parsing results from the immediately triggered update process and the buffer queue of multiple rounds of iterative purification, respectively. Compare the address element parsing results with the currently used parsing rules to generate a difference vector cluster. Density-based spatial clustering of the difference vector clusters identifies the set of change directions. By logically integrating and resolving conflicts between the set of mainstream changes and the current analysis rules, a draft rule update is formed. The draft rule update will be tested on an independent historical address verification set. The draft rule update will only be adopted when the resolution accuracy improves by more than a preset threshold, thus completing the formal revision and optimization of the address element resolution rules.

6. The address data intelligent identification system based on cloud collaboration according to claim 5, characterized in that, The process of performing density-based spatial clustering on the difference vector clusters to identify the set of change directions specifically includes: For each vector in the differential vector cluster, calculate the number of neighbors of each vector with other vectors within a preset radius to form the local density value of the vector and construct the local density distribution map of the entire vector space; Based on a preset density benchmark, vectors with local density values ​​higher than the density benchmark are selected from the local density distribution map to form a core vector set; Traverse the core vector set, establish a connection between any two core vectors in the core vector set that satisfy the spatial proximity relationship, and merge all core vectors that are accessible through direct or indirect connections into an independent core vector group; All vector directions within each core vector group are synthesized to generate an aggregate vector representing the mainstream change direction of the core vector group. The set of all aggregate vectors is the set of mainstream change directions.

7. The address data intelligent identification system based on cloud collaboration according to claim 1, characterized in that, The output of structured address entries in a unified format specifically includes: The original address data is input into a hierarchical decision tree constructed based on stable resolution rules. The address element type and logical level are determined and marked layer by layer from top to bottom, generating temporary address tuples with hierarchical labels. The address elements in the temporary address tuple are combined and verified. Based on the predefined regional constraint rules and logical completeness rules, the conflict of elements or missing levels are identified and corrected to form a verified intermediate address tuple. The intermediate address tuple is assembled with delimiters in a fixed order according to a preset unified output standard. Perform a final consistency check on the assembled address string to ensure that the address string conforms to the preset complete address paradigm, and then output the complete address paradigm as a structured address entry in a uniform format.

8. The address data intelligent identification system based on cloud collaboration according to claim 7, characterized in that, The final consistency check performed on the assembled address string specifically includes: The address string is matched against a preset address character pattern library to verify whether the address character type sequence, segment length, and delimiter position conform to the standard pattern. Based on a pre-constructed spatial relationship graph of address elements, verify whether there are valid spatial inclusion or adjacency relationships between address elements at different levels in the address string; The address string is correlated and verified with the time and space stamp information in the original address data to ensure that the form of the address string is consistent with the time and space logic of the address string being collected; The address string will be output as a structured address entry in a uniform format only if the address string passes all validation steps.