Batch import-based logistics data management warehousing method and system

The three-layer collaborative governance approach solves the problem of conflict merging, data quality classification and parallel fragmentation in the batch import of logistics data, realizes an efficient data governance pipeline, improves data consistency and governance efficiency, and forms a closed-loop governance system.

CN122220331APending Publication Date: 2026-06-16BEIJING CHENGZHI JUNRONG TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING CHENGZHI JUNRONG TECH CO LTD
Filing Date
2026-04-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In the existing technology, during the batch import of logistics data, conflict merging, data quality classification, and parallel sharding are independent of each other, resulting in problems such as duplicate calculations, broken merging clues after sharding, and inconsistencies between quality classification and merging decisions in the governance process.

Method used

A three-layer collaborative governance approach is adopted, which involves pre-emptive conflict detection and initial quality screening, adaptive sharding based on conflict-connected components, and quality classification governance of rule chains within shards. This approach constructs a unified governance pipeline to achieve deep linkage optimization of conflict merging, parallel sharding processing, and data quality classification.

🎯Benefits of technology

It improves the governance efficiency of bulk logistics data import and the overall quality of inbound data, forming a self-evolving closed-loop governance system. It avoids the problem of broken merging clues after segmentation in traditional solutions, eliminates redundant calculation steps in the governance process, and improves data consistency and overall governance efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122220331A_ABST
    Figure CN122220331A_ABST
Patent Text Reader

Abstract

The application discloses a logistics data management warehousing method and system based on batch import, relates to the technical field of logistics data processing and data management, and performs pre-analysis on batch logistics data, completes conflict detection based on logistics semantic similarity, and obtains a preliminary screening quality score, so that the conflict connected components are taken as basic units for fragmentation and distribution of management priorities, each fragment independently performs rule chain quality grading management, and qualified data is merged according to field-level confidence conflicts and then warehoused, meanwhile, management result optimization preposition parameters are fed back, and after the import is completed, multidimensional data quality auditing is triggered. The application realizes the collaborative management of the three, improves data consistency and management efficiency, forms a closed-loop self-evolution mechanism, and comprehensively guarantees the overall quality of the warehoused data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of logistics data processing and data governance technology, specifically to a logistics data governance and warehousing method and system based on batch import. Background Technology

[0002] With the accelerated digital transformation of the logistics industry, logistics business systems generate massive amounts of waybill data, trajectory data, and address data daily. This data is typically imported into the enterprise data center in batches from multiple sources for subsequent business analysis, route optimization, and operational management. However, due to inconsistent data standards and varying data quality across different source systems, the same waybill may appear repeatedly during batch import due to multiple source reports. Furthermore, significant differences in field completeness and format specifications between data from different sources pose challenges to unified management and analysis of the data after it is entered into the database.

[0003] To address the aforementioned issues, various technical solutions have been proposed within the industry. For example, patent publication number CN117312880A discloses a method and apparatus for processing basic vehicle information data. This method aggregates and groups multiple pieces of basic vehicle information data based on the vehicle's unique identifier, calculates the similarity between attribute field sets, and then merges them. This solution achieves conflict merging of multi-source data through similarity calculation, but its similarity calculation and merging decision rely solely on the vehicle's unique identifier and attribute field sets, without considering the quality classification processing of the merged data. Another example is patent publication number CN121705270A, which discloses a data quality monitoring and intelligent blocking system based on a rule engine. This system obtains express delivery data from grassroots outlets and performs legality verification, blocking illegal data before transmitting it to intermediate transfer centers. The intermediate transfer centers and headquarters data centers then perform multi-level verification and blocking. This solution uses a rule engine to achieve line-by-line verification and blocking of data quality, but its data quality classification and blocking decisions are independent of conflict merging and fragmentation processing. The same data still needs to go through a complete quality verification chain after the merging decision, resulting in duplicate calculations in the governance process.

[0004] The technical problem this application aims to solve is that in the process of batch importing logistics data, existing technologies for conflict merging, data quality classification, and parallel sharding are independent and lack a collaborative mechanism, leading to problems such as redundant calculations, broken merging clues after sharding, and inconsistencies between quality classification and merging decisions in the governance process. To address this problem, this application proposes a three-layer collaborative data governance method based on logistics semantics. This method integrates pre-emptive conflict detection and initial quality screening, adaptive sharding based on conflict-connected components, and intra-shard rule chain quality classification governance into a unified governance pipeline. This allows conflict merging decisions, sharding routing strategies, and quality classification execution to perceive each other and collaboratively optimize within the same process.

[0005] In summary, existing technologies have not yet proposed a batch import data governance scheme that can synergistically integrate conflict merging, data quality classification, and parallel sharding processing. This application aims to solve the technical problems caused by the fragmentation of the above three governance stages in existing technologies through a three-layer collaborative governance mechanism. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a method and system for logistics data governance and warehousing based on batch import. By constructing a unified governance pipeline with three collaborative layers—pre-contact conflict detection and initial quality screening, adaptive sharding based on conflict-connected components, and quality grading governance of rule chains within shards—it achieves deep linkage optimization of conflict merging, parallel sharding processing, and data quality grading. This effectively solves the core defect of the fragmented nature of existing technologies, improves the governance efficiency of batch import of logistics data and the overall quality of warehousing data, and forms a self-evolving closed-loop governance system.

[0007] To solve the above-mentioned technical problems, the present invention provides the following technical solution: a method for managing and storing logistics data based on batch import, the method comprising the following steps: Step 1: Perform pre-analysis on the batch logistics data, conduct conflict detection based on logistics semantic similarity and obtain an initial screening quality score. The pre-analysis includes extracting the waybill number, trajectory point sequence, sender and receiver address and timestamp fields of the batch logistics data, and calculating the weighted similarity between each record to be imported and the existing records. Step 2: Construct conflict connectivity components based on the conflict detection results, use the conflict connectivity components as the basic unit for fragmentation to fragment the batch logistics data, and assign a governance priority to each fragment based on the initial screening quality score. Step 3: Each shard independently enters the governance pipeline, executes rule chain quality classification governance, and decides the data flow direction based on the quality score; Step 4: After processing, data that meets the entry criteria will be written to the target database after conflict merging based on field-level confidence levels. Step 5: Collect the quality score, blocking reason, and repair operation record of each piece of data generated after the completion of the rule chain quality classification governance, and feed them back to Step 1 for dynamic adjustment of similarity threshold and initial screening quality score parameters; Step 6: After the batch import is completed, the data quality audit process is automatically triggered to comprehensively evaluate the imported data from the dimensions of completeness, accuracy, consistency and timeliness, and generate a visual quality report.

[0008] By establishing conflict relationships and initial quality screening through pre-analysis, and then driving sharding with conflict connectivity components, duplicate data can be identified synchronously at the data entry point and merging clues can be retained. This avoids the problem of merging breakage after sharding in traditional solutions and improves data consistency and governance efficiency in batch import.

[0009] Furthermore, the weighted similarity calculation formula in step one is as follows: in, Indicates weighted similarity. This indicates the exact matching weight of the waybill number. This indicates an exact match value for the waybill number. This represents the similarity weight of the trajectory point sequence. Indicates the similarity of trajectory point sequences. Represents the semantic similarity weight of addresses. Indicates semantic similarity of addresses. Indicates the weight of timestamp similarity. Indicates the similarity of timestamps; When the weighted similarity exceeds the similarity threshold, the corresponding record pair is marked as a conflict pair. A conflict relationship graph is constructed with records as nodes and conflict relationships as edges. Conflicting connected components are identified through a graph connected component algorithm. Lightweight quality pre-scoring is performed on each data point to obtain an initial screening quality score.

[0010] By constructing a weighted similarity model that includes waybill number, trajectory, address, and timestamp, we can accurately identify multi-source conflict relationships in logistics data and obtain an initial screening quality score, providing a reliable basis for subsequent segmented routing and hierarchical governance.

[0011] Furthermore, the sharding in step two includes: Using the unique identifier of the conflicting connected component as the hash key, a consistent hashing algorithm is used to route all records within the same conflicting connected component to the same shard. Calculate the arithmetic mean of the initial screening quality scores of all records within a segment, and use it as the overall quality score for the segment. When the overall quality score of a fragment is greater than or equal to the first quality threshold, the fragment is marked as a high-priority fragment and assigned to the fast governance channel; When the overall quality score of a fragment is less than the first quality threshold, the fragment is marked as a low-priority fragment and assigned to the deep governance channel; When the number of records in a conflicting connected component exceeds a preset threshold, the conflicting connected component is further subdivided by timestamp or geographic location, while maintaining the connectivity index between the subdivided sub-shards.

[0012] By using consistent hashing to partition data into conflict-connected components, it is possible to ensure that records with merging relationships are processed within the same partition. Priority channels are then allocated based on the initial screening quality, achieving a balance between parallel computing efficiency and merging integrity.

[0013] Furthermore, the rule chain quality classification governance in step three includes: The rule engine dynamically adjusts the set of rules to be executed based on the shard priority. High-priority shards execute a lightweight set of rules, while low-priority shards execute the full set of rules. Each piece of data is validated in the order of the rule chain. Each rule returns the validation result and confidence level. The aggregated calculations yield a quality score. When the quality score is greater than or equal to the second quality threshold, it is judged as high-quality data and directly enters the merging and storage process; When the quality score is greater than or equal to the third quality threshold and less than the second quality threshold, it is judged as medium quality data. Historical records of handling similar anomalies are queried. When the repair success rate exceeds the repair success rate threshold, the repair operation is automatically executed. After the repair is completed, the score is re-evaluated. If the score meets the standard, the data is entered into the database. When the quality score is less than the third quality threshold, it is judged as low-quality data, blocked from entering the database, and a structured anomaly report is generated.

[0014] By dynamically adjusting the rule set depth through sharding priority and combining it with three quality thresholds to determine the data flow, computing resources can be allocated reasonably, and high-quality, medium-quality, and low-quality data can be processed differently to reduce invalid computation and improve governance throughput.

[0015] Furthermore, the structured anomaly report comprises three levels: The exception locator layer contains the line numbers and field names of the data in the imported file; The root cause analysis layer contains descriptive information about the violated business rules; The revision suggestion layer generates actionable revision guidelines based on historical success stories.

[0016] By generating a three-tiered structured report that includes anomaly location, cause analysis, and correction suggestions from historical cases, operators can quickly locate problems and handle them according to the repair plan, reducing manual intervention time and improving the efficiency of batch importing anomaly handling.

[0017] Furthermore, the conflict merging in step four includes: performing a merging operation according to the merging rules within the conflict-connected fragments, and selecting the best value for each field based on the field-level confidence level. The field-level confidence level is ranked as follows: preset high-confidence data source > manually entered data > OCR-recognized data > third-party imported data.

[0018] By sorting data by field-level confidence level and then merging the best results, the system can select the most reliable data values ​​from multiple sources, such as preset high-confidence data sources, manual input, OCR recognition, and third-party imports, thereby improving the overall accuracy and reliability of the merged data.

[0019] Furthermore, step five specifically includes: When the weighted similarity of historical blocked data is concentrated within a preset threshold range, the merging threshold corresponding to that threshold range is automatically lowered. When the deviation between the initial screening quality score and the final quality score output by the rule chain quality grading management exceeds the deviation threshold, the scoring parameters of the initial screening quality score are automatically calibrated.

[0020] By feeding back the governance results to the pre-analysis parameters, the conflict detection threshold and the initial screening scoring model can be dynamically optimized according to the actual treatment effect, forming a closed-loop self-evolution mechanism, which reduces false positives and false negatives after long-term use.

[0021] Furthermore, the completeness dimension in step six includes the field missing rate, the accuracy dimension includes the consistency with the reference data, the consistency dimension includes the inconsistency rate of field values ​​of the same entity in different data tables, and the timeliness dimension includes the degree of conformity between the data update time and the preset time range.

[0022] By performing a comprehensive audit across four dimensions—completeness, accuracy, consistency, and timeliness—after batch import is completed and generating a visual report, users can be provided with a comprehensive data quality profile, facilitating the tracing of problematic batches and the optimization of source data quality.

[0023] On the other hand, a logistics data governance and warehousing system based on batch import is applicable to a logistics data governance and warehousing method based on batch import. The system comprises: The data parsing and pre-analysis module is used to perform pre-analysis on batch logistics data, perform conflict detection based on logistics semantic similarity and obtain the initial screening quality score; The adaptive fragmentation scheduling module is used to construct conflict connectivity components based on the conflict detection results, fragment the batch logistics data using the conflict connectivity components as the basic fragmentation unit, and assign a governance priority to each fragment based on the initial screening quality score. The rules engine and quality grading module are used to execute rule chain quality grading governance after each shard independently enters the governance pipeline, and decide the data flow direction based on the quality score; The conflict merging module is used to perform conflict merging on data that meets the entry conditions after governance, based on field-level confidence.

[0024] By constructing a system architecture that includes pre-analysis, adaptive sharding, hierarchical rule engine, and conflict merging, three-tier collaborative governance can be achieved in a modular manner, facilitating system deployment and maintenance and improving the automation level of batch data import governance.

[0025] Furthermore, the system also includes an inbound and auditing module and a feedback optimization module: The data import and auditing module is used to write the merged data into the target database, record the batch version number, source information and governance link log of each data, and automatically trigger the data quality audit process after the batch import is completed. It comprehensively evaluates the imported data from the dimensions of completeness, accuracy, consistency and timeliness, and generates a visual quality report. The feedback optimization module is used to collect the governance results generated after the rule chain quality classification governance is completed, and feed the governance results back to the data parsing and pre-analysis module to dynamically adjust the similarity threshold and the initial screening quality score parameters.

[0026] By adding an inbound audit and feedback optimization module, it is possible to achieve full-link traceability of data writing and continuous optimization of governance parameters, forming a complete data governance closed loop and improving the long-term stability and intelligence level of the system.

[0027] Compared with existing technologies, this method and system for logistics data governance and warehousing based on batch import has the following advantages: I. This invention integrates pre-collision detection with initial quality screening, adaptive sharding based on conflict-connected components, and quality classification governance within shards into a unified collaborative governance pipeline. It uses conflict-connected components as the basic unit for sharding and allocates governance priorities based on the initial screening quality. This allows conflict merging decisions, sharding routing strategies, and quality classification execution to be mutually aware within the same process. This avoids the problem of broken merging clues after sharding in traditional solutions, eliminates redundant calculations in the governance process, solves the defects of inconsistency between quality classification and merging decisions, and significantly improves data consistency and overall governance efficiency during the import of bulk logistics data.

[0028] Second, this invention establishes a closed-loop feedback mechanism for governance results, feeding back the quality scores, blocking reasons, and repair operation records generated by the rule chain quality grading governance to the pre-analysis stage, dynamically adjusting the similarity threshold and initial screening quality score parameters. At the same time, after batch import is completed, it automatically triggers multi-dimensional data quality audit, generates a visualized quality report, and retains the entire chain governance log. This enables the continuous self-evolution of the system's governance capabilities, reduces misjudgments and omissions, and comprehensively ensures the integrity, accuracy, consistency, and timeliness of the data entering the warehouse, providing reliable data support for subsequent logistics business analysis and operation management.

[0029] Other advantages, objectives and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination or study, or may be learned from the practice of the invention. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0031] Figure 1 This is a flowchart illustrating the overall process of the logistics data management and warehousing method based on batch import of this invention. Figure 2 This is a schematic diagram of the construction of conflict-connected components and adaptive fragmentation in this invention; Figure 3 This is a schematic diagram illustrating the rule chain quality classification governance and data flow of the present invention. Detailed Implementation

[0032] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided below.

[0033] Example This embodiment provides a batch import-based method for logistics data governance and warehousing. This method runs on an enterprise-level data governance platform deployed on a cloud server cluster, with the number of cluster nodes elastically scalable based on business scale. The platform supports receiving batch logistics data files in CSV, JSON, and XML formats, with a maximum batch size of 100GB and a processing throughput of over 100,000 data entries per second.

[0034] Specifically, the overall process of this embodiment is as follows: Figure 1As shown, the process begins with parsing and preliminary analysis of the batch-uploaded logistics data files. Key fields are extracted, and weighted similarity is calculated to complete conflict detection and initial quality scoring. Then, a conflict relationship graph is constructed based on the conflict detection results. Conflicting connected components are identified and used as basic units for fragmentation. Different governance priorities are assigned based on the comprehensive quality score of each fragment. Each fragment independently enters its corresponding governance pipeline, executing rule-chain quality grading governance, and determining the data flow direction based on the final quality score. Data meeting the entry criteria is merged based on field-level confidence and then written to the target database. All data and operation records generated during the governance process are collected and fed back to the preliminary analysis stage, dynamically adjusting relevant parameters. Finally, after batch import is complete, a data quality audit process is automatically triggered, generating a visual quality report.

[0035] Step 1, Data Analysis and Preliminary Analysis: First, the system receives batch logistics data files uploaded by users, performs format validation and parsing on the files. For CSV files, data is read line by line, separated by commas, and the headers are automatically identified and mapped to standard logistics data fields. For JSON files, data is parsed hierarchically according to the JSON object structure, extracting logistics information from nested structures. For XML files, a DOM parser is used to traverse XML nodes and extract the values ​​of corresponding fields. After parsing, all data is converted into a unified internal data format. Each data entry contains four core fields: waybill number, track point sequence, sender and recipient address, and timestamp, as well as other optional extended fields.

[0036] Extract the following fields from each record to be imported: Waybill Number, Track Point Sequence, Sending / Receiving Address, and Timestamp. The Waybill Number field is a string used to uniquely identify a logistics waybill. The Track Point Sequence field is an array, with each element containing three subfields: longitude, latitude, and timestamp, arranged in chronological order. The Sending / Receiving Address field is a string containing information such as province, city, district, street, and house number. The Timestamp field is an integer representing the Unix timestamp generated by the data, in seconds.

[0037] Calculate the weighted similarity between each record to be imported and existing records in the database. The formula for calculating the weighted similarity is: in, This represents the weighted similarity score, ranging from 0 to 1. A higher value indicates a higher similarity between the two records, and a greater likelihood that they are duplicate data from the same waybill.

[0038] This indicates the exact matching weight for the waybill number, with a value ranging from 0 to 1. In this embodiment... Set to 0.5. The waybill number is the unique identifier of the logistics waybill, therefore it is given the highest weight.

[0039] This indicates an exact match for the waybill number. When two records have exactly the same waybill number, The value is 1. When the waybill numbers of the two records are different, The value is 0.

[0040] This represents the similarity weight of the trajectory point sequence, with a value ranging from 0 to 1. In this embodiment... Set to 0.25. The sequence of trajectory points reflects the actual path of logistics transportation and is an important basis for determining whether waybills are identical.

[0041] This represents the similarity of trajectory point sequences. To calculate the similarity of trajectory point sequences between two records, the two trajectories are first time-aligned, and the trajectory points in the overlapping time regions are selected. Then, the Euclidean distance between the corresponding time points is calculated, and the average of all distances is normalized to the interval between 0 and 1. Finally, the trajectory point sequence similarity is obtained by subtracting the normalized average distance from 1.

[0042] The formula for calculating the Euclidean distance of the latitude and longitude of the i-th corresponding point pair is: in, This represents the Euclidean distance in latitude and longitude to the i-th corresponding trajectory point pair, in km. This represents the longitude of the i-th trajectory point in the first record, with a value ranging from -180 to 180. This represents the latitude of the i-th trajectory point in the first record, with a value ranging from -90 to 90. This represents the longitude of the i-th trajectory point in the second record, with a value ranging from -180 to 180. This represents the latitude of the i-th trajectory point in the second record, with a value ranging from -90 to 90.

[0043] The formula for calculating the average distance between all corresponding point pairs is: in, This indicates that the summation operation is performed on all terms from 1 to k of i. The value represents the average latitude and longitude Euclidean distance of all corresponding point pairs in the time overlap region of the two trajectories, in kilometers (km). k represents the total number of corresponding point pairs in the time overlap region of the two trajectories, and takes a positive integer value.

[0044] The formula for calculating the normalized average distance is: in, This represents the normalized average distance, with values ​​ranging from 0 to 1. This represents the average Euclidean distance in latitude and longitude between all corresponding point pairs in the time overlap region of the two trajectories. This represents the preset maximum distance threshold, which is set to 100 km in this embodiment.

[0045] The formula for calculating the similarity of trajectory point sequences is: When the two trajectories have no time overlap The value is 0.

[0046] This represents the semantic similarity weight of addresses, with a value ranging from 0 to 1. In this embodiment... Set to 0.15. The sender and recipient addresses are important attributes of logistics waybills, used to help determine if waybills are identical.

[0047] Address semantic similarity is represented by the following steps: First, the sending and receiving addresses of the two records are segmented and standardized, converting them into a five-level structure: province, city, district, street, and house number. Then, the matching degree of each level of address is calculated. A perfect match at the province, city, and district levels results in a matching degree of 1, while partial matches are assigned values ​​between 0 and 1 based on the degree of matching. The matching degree of the street and house number is calculated using edit distance; a smaller edit distance indicates a higher matching degree. Finally, the matching degrees of each level of address are weighted and summed to obtain the address semantic similarity.

[0048] The formula for calculating address semantic similarity is: in, In this embodiment, the provincial address matching weight is represented. Set to 0.3, Indicates the provincial address matching degree. In this embodiment, the city-level address matching weight is represented. Set to 0.3, Indicates the city-level address matching degree. In this embodiment, the weight of the zone-level address matching degree is represented. Set to 0.2, Indicates the district-level address matching degree. In this embodiment, the street-level address matching weight is represented. Set to 0.15, Indicates the street-level address matching degree. In this embodiment, the matching degree weight of the house number is indicated. Set to 0.05, This indicates the degree of matching of the house number.

[0049] This represents the weight of timestamp similarity, with a value ranging from 0 to 1. In this embodiment... Set to 0.1. The similarity of the timestamps generated by the data can help determine whether the waybills are the same.

[0050] This indicates the similarity of timestamps. Calculate the absolute value of the difference between the timestamps of two records and normalize it to the interval between 0 and 1. Then subtract the normalized difference from 1 to obtain the timestamp similarity.

[0051] The formula for calculating the absolute value of the timestamp difference is: in, This represents the absolute value of the timestamp difference between two records, in seconds. This represents the timestamp of the first record, in seconds. This represents the timestamp of the second record, in seconds.

[0052] The formula for calculating the normalized timestamp difference is: in, This represents the normalized timestamp difference, with a value ranging from 0 to 1. This represents the preset maximum time difference threshold, in this embodiment... Set it to 86400 seconds, which is 24 hours.

[0053] The formula for calculating timestamp similarity is: when Greater than hour, The value is 0.

[0054] A similarity threshold is set; in this embodiment, the similarity threshold is set to 0.7. When the weighted similarity S of two records is ≥ 0.7, these two records are marked as a conflict pair. A conflict relationship graph is constructed with each record as a node and conflict relationships as edges. All connected components in the conflict relationship graph are identified using a graph connected component algorithm; each connected component is a conflicting connected component. A conflicting connected component contains all records that have conflicting relationships with each other.

[0055] A lightweight pre-quality score is performed on each data entry to obtain an initial screening quality score. The initial screening quality score is calculated from two dimensions: field completeness and format conformity.

[0056] The formula for calculating the mass fraction of the initial screening is: in This indicates the initial screening quality score, with a minimum of 0 points. This represents the total points deducted for field completeness. 25 points are deducted for each missing core field. Core fields include waybill number, track point sequence, sender / receiver address, and timestamp. This represents the total points deducted for format compliance. 20 points are deducted for non-standard waybill number format. 10 points are deducted for each point in the trajectory point sequence whose latitude and longitude exceed the reasonable range, with a maximum deduction of 30 points. 20 points are deducted for address format that cannot be parsed into a five-level structure. 10 points are deducted for timestamps exceeding the reasonable range.

[0057] Step 2, Adaptive Sharding and Priority Allocation: Based on the conflicting connected components obtained from the preliminary analysis steps, the batch logistics data is fragmented. For example... Figure 2 As shown, using the unique identifier of each conflicting connected component as the hash key, a consistent hashing algorithm is employed to route all records within the same conflicting connected component to the same shard. The consistent hashing algorithm uses the MD5 hash function, mapping the hash key to 0 to... The integer space is divided into intervals equal to the number of cluster nodes, with each interval corresponding to one cluster node. The hash value of the unique identifier of the conflicting connected component is calculated, and the conflicting connected component is assigned to the corresponding node based on the interval where the hash value lies.

[0058] Calculate the arithmetic mean of the initial screening quality scores of all records within each segment, and use it as the overall quality score for that segment.

[0059] The formula for calculating the overall quality score of each segment is: in, This represents the overall quality score of the shard, ranging from 0 to 100. A higher value indicates a better overall initial quality of the data within that shard. 'n' represents the total number of records in the shard, and is a positive integer. This represents the initial screening quality score of the i-th record within the segment.

[0060] A first quality threshold is set; in this embodiment, it is set to 80 points. When a shard's overall quality score is greater than or equal to 80 points, the shard is marked as a high-priority shard and assigned to the fast governance channel. The fast governance channel has a higher computing resource quota and processes high-priority shards first. When a shard's overall quality score is less than 80 points, the shard is marked as a low-priority shard and assigned to the deep governance channel. The deep governance channel has a lower computing resource quota and processes low-priority shards.

[0061] A threshold is set; in this embodiment, the threshold is set to 1000 records. When the number of records in a conflicting connected component exceeds 1000, the conflicting connected component is further subdivided according to its timestamp. The records within the conflicting connected component are arranged in ascending order by timestamp, and every 1000 records are divided into a sub-fragment. A unique identifier is generated for each sub-fragment, and a connectivity index is established between the sub-fragments. The connectivity index records the identifiers of all sub-fragments belonging to the same original conflicting connected component, which are used to associate related sub-fragments during subsequent merging operations.

[0062] Step 3, Rule Chain Quality Classification Governance: Each shard independently enters its corresponding governance pipeline to execute rule chain quality-level governance. The rule engine dynamically adjusts the set of rules to be executed based on the shard's priority. High-priority shards execute a lightweight rule set, which includes three rules: waybill number format verification, timestamp range verification, and address basic format verification. Low-priority shards execute the full rule set, which includes all rules from the lightweight rule set, as well as five rules: trajectory point rationality verification, sender and receiver address matching verification, waybill status logic verification, duplicate field verification, and data type verification.

[0063] Each data entry is validated sequentially according to the rule chain. Each rule returns a validation result and a confidence score. The validation result is a Boolean value indicating whether the data passes the validation for that rule. The confidence score ranges from 0 to 1, representing the reliability of the rule's validation result. The validation results and confidence scores of all rules are aggregated to obtain the final quality score for the data.

[0064] The final quality score calculation formula is: Where P represents the final quality score of the data, ranging from 0 to 100. A higher value indicates better data quality after rule chain governance. k represents the total number of rules executed on the data, and is a positive integer. This represents the validation result of the j-th rule, with a value of 1 if it passes and a value of 0 if it fails. This represents the confidence level of the j-th rule, with a value ranging from 0 to 1.

[0065] A second quality threshold and a third quality threshold are set. In this embodiment, the second quality threshold is set to 90 points, and the third quality threshold is set to 60 points.

[0066] When the final quality score of the data is greater than or equal to 90, it is judged as high-quality data and directly enters the merging and storage stage.

[0067] Data with a final quality score greater than or equal to 60 and less than 90 is classified as medium quality data. Historical records of similar anomalies are queried to calculate the success rate of repairing that type of anomaly. The success rate is calculated as the number of times this type of anomaly was successfully repaired historically divided by the total number of anomalies of that type. A success rate threshold is set; in this embodiment, it is set to 0.8. When the success rate of repairing this type of anomaly is greater than or equal to 0.8, the corresponding repair operation is automatically executed. Repair operations include format conversion, missing value imputation, and error value correction. After repair, the data is re-processed using the rule chain quality grading system, and the final quality score is calculated. If the recalculated quality score is greater than or equal to 90, the data proceeds to the merging and storage stage. If the recalculated quality score is still less than 90, the data is classified as low quality data.

[0068] When the final quality score of the data is less than 60 points, it is judged as low-quality data, blocked from entering the database, and a structured anomaly report is generated. The structured anomaly report consists of three levels: the anomaly location layer contains the row numbers and field names of the data in the imported file; the root cause analysis layer contains descriptive information about the violated business rules; and the correction suggestion layer generates actionable correction guidance based on historical successful cases. All structured anomaly reports are summarized and exported as an Excel file for manual review and processing.

[0069] Step 4, Conflict merging and data entry: After processing, data meeting the inclusion criteria will be merged according to the merging rules within the conflict connectivity component. For each conflict connectivity component, all records meeting the inclusion criteria will be merged. During merging, values ​​will be selected based on field-level confidence. The field-level confidence ranking is as follows: highest confidence data source > manually entered data > OCR-recognized data > third-party imported data.

[0070] The preset high-reliability data sources include data generated by the enterprise's internal core business systems. Manually entered data includes data entered manually by customer service personnel, warehouse managers, etc. OCR-recognized data includes data extracted from paper documents such as waybills and invoices using optical character recognition technology. Third-party imported data includes data imported from third-party systems such as partners and courier companies.

[0071] For each field, iterate through all records within the conflicting connected component for that field value, and select the field value corresponding to the data source with the highest confidence level as the merged field value. If multiple records have that field from data sources with the same confidence level, select the field value that appears most frequently as the merged field value. If the frequency of occurrence is the same, select the field value corresponding to the latest timestamp as the merged field value.

[0072] After the merge is complete, the merged data is written to the target database. The target database uses a distributed relational database, supporting high-concurrency read / write and horizontal scaling. During the write process, the batch version number, source information, and governance log are recorded for each data entry. The batch version number identifies the import batch to which the data belongs. The source information records the original file name, upload time, and upload user. The governance log records all governance steps, execution times, and operation results that the data underwent, used for data traceability and problem investigation.

[0073] Step 5, Feedback and Optimization: Collect the quality score, blocking reason, and remediation operation record for each piece of data generated after the rule chain quality grading governance is completed. Store this data in the governance results database for subsequent parameter adjustment and model optimization.

[0074] The weighted similarity distribution of historical blocked data is statistically analyzed. When the proportion of historical blocked data within a certain threshold interval to the total number of blocked data exceeds a preset proportion threshold, the merging threshold corresponding to that threshold interval is automatically lowered. In this embodiment, the proportion threshold is set to 0.3.

[0075] The deviation between the initial screening quality score and the final quality score output by the rule chain quality grading governance is statistically analyzed. The average deviation is calculated as the average of the absolute values ​​of the differences between the initial screening quality score and the final quality score for all data.

[0076] The formula for calculating the average deviation is: in, This represents the average deviation between the initial screening quality score and the final quality score, expressed in points. A larger value indicates a greater deviation between the initial screening quality score and the actual quality. N represents the total number of records imported, and is a positive integer. This represents the final quality score of the i-th record.

[0077] A deviation threshold is set; in this embodiment, the deviation threshold is set to 15 points. When the average deviation is greater than or equal to 15 points, the scoring parameters of the initial screening quality score are automatically calibrated. The calibration method uses a linear regression algorithm, with the final quality score as the dependent variable and each deduction item of the initial screening quality score as the independent variable, to train a linear regression model. The weights of the deduction items in the initial screening quality score are updated using the trained model parameters, making the initial screening quality score closer to the final quality score.

[0078] Step Six, Data Quality Audit: The data quality audit process is automatically triggered after the batch import is completed. The imported data is comprehensively evaluated from the dimensions of completeness, accuracy, consistency, and timeliness.

[0079] The integrity dimension assessment primarily calculates the field missing rate. The field missing rate is the number of records missing that field divided by the total number of records. The missing rates for the four core fields—waybill number, track point sequence, sender / receiver address, and timestamp—are calculated separately, along with the average missing rate for all extended fields.

[0080] The accuracy dimension assessment primarily calculates the consistency with reference data. A certain percentage of the imported data is randomly selected and compared with corresponding data in the enterprise's core business system. The proportion of records with consistent field values ​​out of the total number of selected records is used as the accuracy indicator.

[0081] The consistency dimension assessment primarily calculates the inconsistency rate of field values ​​for the same entity across different data tables. The inconsistency rate is calculated by counting the number of records where corresponding field values ​​for the same waybill are inconsistent across the waybill table, tracking table, and address table, and then dividing this number by the total number of records.

[0082] The timeliness dimension assessment primarily calculates the degree of conformity between the data update time and a preset time range. The timeliness index is obtained by dividing the number of records whose difference between the statistical data generation time and import time falls within the preset time range by the total number of records. In this embodiment, the preset time range is 24 hours.

[0083] The evaluation results from four dimensions are used to generate a visualized quality report. The report includes specific indicator values ​​for each dimension, comparative analysis with historical batches, distribution of problematic data, and improvement suggestions. The quality report is generated in HTML format and can be viewed and downloaded online.

[0084] This embodiment also provides a logistics data governance and warehousing system based on batch import. This system is used to execute the above-described logistics data governance and warehousing method based on batch import. The system consists of a data parsing and pre-analysis module, an adaptive sharding and scheduling module, a rule engine and quality grading module, a conflict merging module, an warehousing and auditing module, and a feedback optimization module.

[0085] The data parsing and pre-analysis module receives batch logistics data files, performs format validation and parsing. It extracts the waybill number, trajectory point sequence, sender / receiver address, and timestamp fields for each record. It calculates the weighted similarity between the records to be imported and existing records, performs conflict detection, and constructs a conflict relationship graph. It identifies conflicting connected components and performs a lightweight quality pre-scoring for each data entry to obtain an initial quality score.

[0086] The adaptive sharding scheduling module is used to shard batch logistics data using conflicting connected components as the basic unit. A consistent hashing algorithm is used to route all records within the same conflicting connected component to the same shard. A comprehensive quality score is calculated for each shard, and shards are assigned to either the fast governance channel or the deep governance channel based on this score. Conflicting connected components with more than a threshold of records are further subdivided, and a connectivity index is established between the sub-shards.

[0087] The rules engine and quality grading module are used to execute rule chain quality grading governance. The set of rules to be executed is dynamically adjusted based on shard priority. Verification is performed on each data item in the order of the rule chain, and the aggregated calculation yields the final quality score. Based on the final quality score, the data is categorized into high-quality, medium-quality, and low-quality data, and different processing flows are executed for each category. A structured anomaly report is generated for low-quality data.

[0088] The conflict merging module is used to perform conflict merging operations on data that meets the entry criteria. It selects the best value for each field based on field-level confidence level, generating merged records.

[0089] The import and auditing module is used to write the merged data into the target database. It records the batch version number, source information, and governance log for each data entry. Upon completion of the batch import, a data quality audit process is automatically triggered, comprehensively evaluating the imported data from four dimensions and generating a visual quality report.

[0090] The feedback optimization module collects all data and operation records generated during the governance process. It statistically analyzes the weighted similarity distribution of historical blocking data and the deviation between the initial screening quality score and the final quality score. It dynamically adjusts the similarity threshold and initial screening quality score parameters to achieve system self-evolution and optimization.

[0091] like Figure 3 As shown, data enters the system from the data parsing and pre-analysis module, then passes through the adaptive sharding and scheduling module, the rule engine and quality grading module, and the conflict merging module, before finally being written to the database by the storage and auditing module. The feedback optimization module obtains governance results from the rule engine and quality grading module and the storage and auditing module, and feeds them back to the data parsing and pre-analysis module, forming a complete closed-loop governance process.

[0092] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.

Claims

1. A method for managing and storing logistics data based on batch import, characterized in that, The method includes the following steps: Step 1: Perform pre-analysis on the batch logistics data, conduct conflict detection based on logistics semantic similarity and obtain an initial screening quality score. The pre-analysis includes extracting the waybill number, trajectory point sequence, sender and receiver address and timestamp fields of the batch logistics data, and calculating the weighted similarity between each record to be imported and the existing records. Step 2: Construct conflict connectivity components based on the conflict detection results, use the conflict connectivity components as the basic unit for fragmentation to fragment the batch logistics data, and assign a governance priority to each fragment based on the initial screening quality score. Step 3: Each shard independently enters the governance pipeline, executes rule chain quality classification governance, and decides the data flow direction based on the quality score; Step 4: After processing, data that meets the entry criteria will be written to the target database after conflict merging based on field-level confidence levels. Step 5: Collect the quality score, blocking reason, and repair operation record of each piece of data generated after the completion of the rule chain quality classification governance, and feed them back to Step 1 for dynamic adjustment of similarity threshold and initial screening quality score parameters; Step 6: After the batch import is completed, the data quality audit process is automatically triggered to comprehensively evaluate the imported data from the dimensions of completeness, accuracy, consistency and timeliness, and generate a visual quality report.

2. The method for managing and storing logistics data based on batch import as described in claim 1, characterized in that, The weighted similarity calculation formula in step one is: in, Indicates weighted similarity. This indicates the exact matching weight of the waybill number. This indicates an exact match value for the waybill number. This represents the similarity weight of the trajectory point sequence. Indicates the similarity of trajectory point sequences. Represents the semantic similarity weight of addresses. Indicates semantic similarity of addresses. Indicates the weight of timestamp similarity. Indicates the similarity of timestamps; When the weighted similarity exceeds the similarity threshold, the corresponding record pair is marked as a conflict pair. A conflict relationship graph is constructed with records as nodes and conflict relationships as edges. Conflicting connected components are identified through a graph connected component algorithm. Lightweight quality pre-scoring is performed on each data point to obtain an initial screening quality score.

3. The method for managing and storing logistics data based on batch import as described in claim 1, characterized in that, The segmentation in step two includes: Using the unique identifier of the conflicting connected component as the hash key, a consistent hashing algorithm is used to route all records within the same conflicting connected component to the same shard. Calculate the arithmetic mean of the initial screening quality scores of all records within a segment, and use it as the overall quality score for the segment. When the overall quality score of a fragment is greater than or equal to the first quality threshold, the fragment is marked as a high-priority fragment and assigned to the fast governance channel; When the overall quality score of a fragment is less than the first quality threshold, the fragment is marked as a low-priority fragment and assigned to the deep governance channel. When the number of records in a conflicting connected component exceeds a preset threshold, the conflicting connected component is further subdivided by timestamp or geographic location, while maintaining the connectivity index between the subdivided sub-shards.

4. The method for managing and storing logistics data based on batch import as described in claim 1, characterized in that, The rule chain quality classification governance in step three includes: The rule engine dynamically adjusts the set of rules to be executed based on the shard priority. High-priority shards execute a lightweight set of rules, while low-priority shards execute the full set of rules. Each piece of data is validated in the order of the rule chain. Each rule returns the validation result and confidence level. The aggregated calculations yield a quality score. When the quality score is greater than or equal to the second quality threshold, it is judged as high-quality data and directly enters the merging and storage process; When the quality score is greater than or equal to the third quality threshold and less than the second quality threshold, it is judged as medium quality data. Historical records of handling similar anomalies are queried. When the repair success rate exceeds the repair success rate threshold, the repair operation is automatically executed. After the repair is completed, the score is re-evaluated. If the score meets the standard, the data is entered into the database. When the quality score is less than the third quality threshold, it is judged as low-quality data, blocking its entry into the database and generating a structured anomaly report.

5. The method for managing and storing logistics data based on batch import according to claim 4, characterized in that, The structured anomaly report includes three levels: The exception locator layer contains the line numbers and field names of the data in the imported file; The root cause analysis layer contains descriptive information about the violated business rules; The revision suggestion layer generates actionable revision guidelines based on historical success stories.

6. The method for managing and storing logistics data based on batch import according to claim 1, characterized in that, The conflict merging in step four includes: performing a merging operation according to the merging rules within the conflict connectivity segment, and selecting the best value for each field based on the field-level confidence level. The field-level confidence level is ranked as follows: preset high-confidence data source > manually entered data > OCR recognition data > third-party imported data.

7. The method for managing and storing logistics data based on batch import according to claim 1, characterized in that, Step five specifically includes: When the weighted similarity of historical blocked data is concentrated within a preset threshold range, the merging threshold corresponding to that threshold range is automatically lowered. When the deviation between the initial screening quality score and the final quality score output by the rule chain quality grading management exceeds the deviation threshold, the scoring parameters of the initial screening quality score are automatically calibrated.

8. The method for managing and storing logistics data based on batch import according to claim 1, characterized in that, The completeness dimension in step six includes the field missing rate, the accuracy dimension includes the consistency with the reference data, the consistency dimension includes the inconsistency rate of field values ​​of the same entity in different data tables, and the timeliness dimension includes the degree of conformity between the data update time and the preset time range.

9. A logistics data management and warehousing system based on batch import, applicable to the logistics data management and warehousing method based on batch import as described in any one of claims 1 to 8, characterized in that, The system consists of: The data parsing and pre-analysis module is used to perform pre-analysis on batch logistics data, perform conflict detection based on logistics semantic similarity and obtain the initial screening quality score; The adaptive fragmentation scheduling module is used to construct conflict connectivity components based on the conflict detection results, fragment the batch logistics data using the conflict connectivity components as the basic fragmentation unit, and assign a governance priority to each fragment based on the initial screening quality score. The rules engine and quality grading module are used to execute rule chain quality grading governance after each shard independently enters the governance pipeline, and decide the data flow direction based on the quality score; The conflict merging module is used to perform conflict merging on data that meets the entry conditions after governance, based on field-level confidence.

10. A logistics data governance and warehousing system based on batch import as described in claim 9, characterized in that, The system also includes an inventory and auditing module and a feedback optimization module. The data import and auditing module is used to write the merged data into the target database, record the batch version number, source information and governance link log of each data, and automatically trigger the data quality audit process after the batch import is completed. It comprehensively evaluates the imported data from the dimensions of completeness, accuracy, consistency and timeliness, and generates a visual quality report. The feedback optimization module is used to collect the governance results generated after the rule chain quality classification governance is completed, and feed the governance results back to the data parsing and pre-analysis module to dynamically adjust the similarity threshold and the initial screening quality score parameters.