A case intelligent analysis method based on multi-dimensional data fusion and knowledge graph
By performing differentiated preprocessing and fusion processing on multi-source data, a case heterogeneous graph with multiple entity types and relationships is constructed, which solves the problem of low efficiency in data fusion and analysis in existing technologies and achieves efficient and accurate case analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-26
AI Technical Summary
Existing mapping technologies cannot efficiently integrate case data from different sources and types, cannot fully present the complex relationships between various entities, are difficult to complete multi-source data elements, and are inefficient for manual judgment.
By performing differentiated preprocessing and fusion processing on multi-source data, a case heterogeneous graph of multiple entity types and relationships is constructed. Anomaly detection algorithms and visualization maps are used to achieve deep fusion and comprehensive correlation analysis of multiple types of data.
It achieves efficient integration of data from different sources and types, comprehensively presents the complex relationships between various entities in a case, improves the accuracy and efficiency of case analysis, and reduces the misjudgment rate.
Smart Images

Figure CN122286584A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data analysis technology, and in particular to a method for intelligent case analysis based on multi-dimensional data fusion and knowledge graph. Background Technology
[0002] In case analysis, the elements involved often involve multiple dimensions such as relevant personnel, various assets, transactions and key events, presenting a high degree of complexity. The connections between the various elements are highly concealed, the logical chains are complex, and there are often hidden and complex networks of connections between them. The connection between transactions and key events is difficult to capture. Various elements are intertwined and covered up layer by layer, which brings great obstacles to case verification and problem tracing.
[0003] To address the problems of low efficiency and easy omission of clues in traditional manual data sorting, graph technology has been introduced into case analysis. By using core elements of the case as nodes and the relationships between elements as edges, a case relationship network is constructed. This network integrates, cleans, and maps case data scattered in various tables, helping investigators to intuitively view relationships and locate breakthroughs in the investigation.
[0004] However, in addition to structured data such as tables, case data also comes from semi-structured and unstructured data such as logs, XML files, images, audio, and video. The sources of case data are scattered and diverse, and there is a lack of unified format standards for case data from different sources and of different types. Their formats vary greatly, and existing graph technology cannot achieve efficient integration of data from different sources and of different types. Most of them are simple associations of single-type entities and cannot fully present the complex relationships between various entities. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a case intelligent analysis method based on multi-dimensional data fusion and knowledge graphs. This method solves the problems that existing graph technologies cannot efficiently fuse data from different sources and of different types, and cannot fully present the complex relationships between various entities.
[0006] This application provides a method for intelligent case analysis based on multi-dimensional data fusion and knowledge graphs, including:
[0007] Acquire multi-source data, wherein the multi-source data includes first structured data, first semi-structured data, and first unstructured data; Differential preprocessing is performed on the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data; The differentiated preprocessed data is fused to obtain fused data, and a comprehensive feature set is constructed based on the fused data. Define multiple entity types and construct association relationships based on the association characteristics of the entity types and case data. A case heterogeneity graph is constructed based on the comprehensive feature set, the various entity types, and the association relationships. An anomaly detection algorithm is constructed based on the case heterogeneity graph, the various entity types, and the relationships. A visualization graph is constructed to display the detection results output by the anomaly detection algorithm.
[0008] In one feasible implementation, the differential preprocessing of the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data includes: The first structured data is cleaned and standardized to obtain the second structured data; The first half of the structured data is processed by tag parsing and field extraction to obtain the second half of the structured data; The first unstructured data is processed by OCR recognition, speech-to-text conversion and video frame extraction to obtain the second unstructured data; Construct differentiated preprocessed data comprising the second structured data, the second semi-structured data, and the second unstructured data.
[0009] In one feasible implementation, the step of fusing the differentiated preprocessed data to obtain fused processed data includes: The second structured data, the second semi-structured data, and the second unstructured data are fused in terms of format, element, and feature to obtain fused data including the third structured data, the third semi-structured data, and the third unstructured data.
[0010] In one feasible implementation, after fusing the differentiated preprocessed data to obtain fused processed data, the process includes: The third structured data is stored in a first database, which is a relational database; The third semi-structured data and the third unstructured data are stored in the second database, which is a distributed storage system. The comprehensive feature set is constructed based on the first database and the second database.
[0011] In one feasible implementation, the anomaly detection algorithm constructed based on the case heterogeneity graph, the various entity types, and the association relationships includes: The various entity types and their relationships are stored in a third database, which is a Neo4j graph database. An anomaly detection algorithm is constructed based on the case heterogeneous graph and the third database.
[0012] In one feasible implementation, the process of format fusion, element fusion, and feature fusion of the second structured data, the second semi-structured data, and the second unstructured data includes: The second structured data, the second semi-structured data, and the second unstructured data are mapped to a preset standard data model, wherein the standard data model performs data format standardization processing through a data format mapping algorithm; By constructing multi-source data element association rules, cosine similarity matching, and semantic association matching, the same second structured data, second semi-structured data, and second unstructured data are deduplicated and fused. Through a cross-data source association reasoning mechanism, the missing data in the second structured data, second semi-structured data, and second unstructured data are supplemented. Numerical features from the second structured data, label features from the second semi-structured data, and text features from the second unstructured data are extracted using a feature extraction algorithm. The numerical features, label features, and text features are then normalized, and feature fusion is performed based on a feature association analysis algorithm.
[0013] In one feasible implementation, constructing a case heterogeneity graph based on the comprehensive feature set, the multiple entity types, and the association relationships includes: Based on the various entity types, basic nodes are constructed, and edges between the corresponding basic nodes are constructed based on the relationships. The data is then updated and optimized based on the dynamic data in the comprehensive feature set to construct a case heterogeneous graph.
[0014] In one feasible implementation, the multiple entity types include: person entities, behavior entities, and location entities; the relationships include: relationships between people, relationships between people and behavior, and relationships between people and location.
[0015] In one feasible implementation, the anomaly detection algorithm includes a k-nearest neighbor algorithm and a local anomaly factor algorithm, and the anomaly detection algorithm outputs a detection result based on the judgment results of the k-nearest neighbor algorithm and the local anomaly factor algorithm.
[0016] In one feasible implementation, the first structured data includes tabular data; the first semi-structured data includes forms, logs, or XML format files; and the first unstructured data includes documents, images, audio, or video.
[0017] This invention provides a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, which has the following beneficial effects: This invention eliminates format differences in case data from different sources and of different types by performing differentiated preprocessing and fusion processing on multi-source data, achieving deep fusion of structured, semi-structured, and unstructured multi-source data. Based on a comprehensive feature set, it constructs a heterogeneous case graph containing multiple types of entities and relationships, improving the accuracy and completeness of graph construction, comprehensively and realistically presenting the complex relationships between various entities in a case, uncovering potential hidden relationships, and overcoming the limitations of existing graph technologies in simple associations of single-type entities. Attached Figure Description
[0018] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the implementation of the invention and, together with the description, serve to explain the principles of the embodiments of the invention. It is obvious that the drawings described below are merely some embodiments of the invention, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0019] Figure 1 A flowchart illustrating a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph provided in an embodiment of the present invention; Figure 2 This invention provides a case heterogeneity graph for a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, as an embodiment of the present invention. Detailed Implementation
[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0021] In the following description, the terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first," "second," etc., may explicitly or implicitly include one or more of that feature. In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0022] In case analysis, the elements involved often involve multiple dimensions such as relevant personnel, various assets, transactions and key events, presenting a high degree of complexity. The connections between the various elements are highly concealed, the logical chains are complex, and there are often hidden and complex networks of connections between them. The connection between transactions and key events is difficult to capture. Various elements are intertwined and covered up layer by layer, which brings great obstacles to case verification and problem tracing.
[0023] To address the problems of low efficiency and easy omission of clues in traditional manual data sorting, graph technology has been introduced into case analysis. By using core elements of the case as nodes and the relationships between elements as edges, a case relationship network is constructed. This network integrates, cleans, and maps case data scattered in various tables, helping investigators to intuitively view relationships and locate breakthroughs in the investigation.
[0024] However, in addition to structured data such as tables, case data also comes from semi-structured and unstructured data such as logs, XML files, images, audio, and video. The sources of case data are scattered and diverse, and there is a lack of unified format standards for case data from different sources and of different types. Their formats vary greatly, and existing graph technology cannot achieve efficient integration of data from different sources and of different types. Most of them are simple associations of single-type entities and cannot fully present the complex relationships between various entities.
[0025] Furthermore, case data comes from diverse and scattered sources, with core elements from different data sources often missing or incomplete. Existing methods cannot accurately complete multi-source data elements, resulting in insufficient data integrity and affecting the comprehensiveness of case correlation analysis, making it difficult to complete multi-source data elements. Existing technical analysis capabilities are weak; existing system algorithms mostly remain at the level of basic correlation display and lack abnormal correlation detection algorithms specific to case scenarios. They still rely on investigators to make manual judgments based on experience, which is not only inefficient but also prone to overlooking suspicious clues due to subjective factors, failing to fully meet the current demands for accuracy, efficiency, and comprehensiveness in handling complex cases.
[0026] This invention provides a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, which solves the problems of existing graph technology being unable to achieve efficient fusion of data from different sources and of different types, as well as being unable to fully present the complex relationships between various entities, the difficulty in completing multi-source data elements, and the low efficiency of manual judgment.
[0027] The embodiments of this application will now be described with reference to the accompanying drawings.
[0028] Please see Figure 1 , Figure 1This is a flowchart illustrating a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, provided as an embodiment of the present invention.
[0029] This application provides a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, including: Step 101: Obtain multi-source data, which includes first structured data, first semi-structured data, and first unstructured data.
[0030] The sources of multi-source data can include: data from internal case-handling systems, shared data from external collaborating units, data obtained from third-party data service providers, and data collected from on-site investigations, to ensure the comprehensiveness and diversity of the data. The types of data collected include primary structured data, primary semi-structured data, and primary unstructured data. Primary structured data can include various tabular data, which has clearly defined field definitions and data formats, and serves as the foundation for case analysis. Primary semi-structured data can include various forms, logs, XML files, etc., which have a certain structural framework, but the field formats are not uniform and require standardization. Primary unstructured data can include documents, images, audio, video, etc., which have no fixed structure and need to be converted into an analyzable data format using specific technologies.
[0031] Step 102: Perform differential preprocessing on the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data. By performing differential preprocessing on data of different formats and sources, the unified standardization and deep integration of multi-source data can be achieved, solving the problems of inconsistent formats, missing elements, and data disorder in existing technologies.
[0032] In some embodiments, differential preprocessing is performed on the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data, including: Step 1021: Perform data cleaning and standardization on the first structured data to obtain the second structured data.
[0033] Specifically, by cleaning and standardizing the first set of structured data to handle missing, duplicate, and outlier values, the accuracy and integrity of the data are ensured. At the same time, field names, data formats, and coding standards are standardized to eliminate format differences between structured data from different sources, laying the foundation for subsequent integration.
[0034] Step 1022: Perform tag parsing and field extraction on the first half of the structured data to obtain the second half of the structured data.
[0035] Specifically, by parsing tags and extracting fields, core fields and key information can be extracted from the first semi-structured data and transformed into standardized first structured data. By combining first structured data of the same type, element completion can be performed to prevent missing fields and ensure data integrity.
[0036] Step 1023: Perform OCR recognition, speech-to-text conversion, and video frame extraction on the first unstructured data to obtain the second unstructured data.
[0037] Among them, through OCR recognition, speech-to-text conversion, and video frame extraction processing, the first unstructured data can be transformed into the first structured or semi-structured data; for document data, text segmentation and keyword extraction processing can be used to extract core information related to the case and transform it into a data form that can be associated and analyzed.
[0038] Step 1024: Construct differentiated preprocessed data including second structured data, second semi-structured data, and second unstructured data.
[0039] Step 103: Perform fusion processing on the differentiated preprocessed data to obtain fused data, and construct a comprehensive feature set based on the fused data. By performing comprehensive and multi-level fusion of multi-source and multi-format data, the problems of missing data elements, inconsistent format specifications, and weak data correlation can be effectively solved.
[0040] In some embodiments, the differentially preprocessed data is fused to obtain fused processed data, including: The second set of structured data, the second set of semi-structured data, and the second set of unstructured data are fused using format fusion, element fusion, and feature fusion to obtain fused data that includes the third set of structured data, the third set of semi-structured data, and the third set of unstructured data. Format fusion can eliminate the fusion barriers between data of different formats, element fusion can fill in missing elements in multi-source data and make data elements consistent, and feature fusion can uncover potential relationships between data.
[0041] This includes format fusion, element fusion, and feature fusion of the second structured data, the second semi-structured data, and the second unstructured data, including: Step 1031: Map the second structured data, the second semi-structured data, and the second unstructured data to a preset standard data model. The standard data model performs data format standardization processing through a data format mapping algorithm.
[0042] The second structured data, second semi-structured data, and second unstructured data, after differential preprocessing, are uniformly mapped to a preset standard data model. The standard data model pre-defines unified field specifications, data types, encoding formats, and value ranges based on the core needs of case analysis. Through a data format mapping algorithm, fields of different format data are accurately matched with fields of the standard model. Mismatched fields are converted, split, or merged to ensure that data from all sources and in all formats follows unified specifications, thus achieving standardized and unified data formats.
[0043] Step 1032: By constructing multi-source data element association rules, cosine similarity matching, and semantic association matching, deduplication and fusion are performed on the same second structured data, second semi-structured data, and second unstructured data. Through cross-data source association reasoning mechanism, missing data in the second structured data, second semi-structured data, and second unstructured data are supplemented.
[0044] By constructing association rules for multi-source data elements and combining them with the inherent correlation characteristics of case data, the system automatically identifies identical or similar core elements from different data sources. It employs a combination of cosine similarity matching and semantic association matching to deduplicate and fuse identical elements, and normalizes similar elements. Furthermore, addressing the issue of missing core elements in some data sources, a cross-data source association reasoning mechanism leverages the inherent relationships between different data sources, combined with existing complete element information, to accurately complete missing elements. For example, it uses payer information from transaction data to complete missing related asset information in personnel data, effectively solving the pain point of difficulty in completing multi-source data elements in existing technologies and ensuring the integrity and consistency of the fused data.
[0045] Step 1033: Extract numerical features from the second structured data, label features from the second semi-structured data, and text features from the second unstructured data using a feature extraction algorithm. Normalize the numerical features, label features, and text features, and perform feature fusion based on a feature association analysis algorithm.
[0046] The numerical features of the second structured data, the label features of the second semi-structured data, and the text features after the second unstructured data were extracted using feature extraction algorithms. Feature normalization was then performed on the extracted features to eliminate the differences in the units of measurement between different features. Then, feature association analysis algorithms were used to explore the potential relationships between different types of features. The scattered features were deeply integrated to form a comprehensive feature set that can fully reflect the core information of the case. This provides high-quality and highly correlated data support for entity recognition, relationship mining, and intelligent anomaly detection in the subsequent construction of heterogeneous graphs, realizing the deep integration of multi-source and multi-format data and breaking down data silos.
[0047] It is worth noting that after the data fusion is completed, the fused data is verified from four dimensions: accuracy, completeness, consistency, and relevance. A multi-dimensional data quality verification mechanism is adopted to eliminate unqualified data, thereby ensuring the high quality of the fused data and providing a reliable guarantee for map construction and intelligent analysis.
[0048] Differentiated preprocessing and fusion processing of first-level structured data, first-level semi-structured data, and first-level unstructured data can achieve unified standardization and efficient fusion of data from different sources and in different formats, make up for the shortcomings of missing elements in multi-source data, ensure that case handlers can obtain complete case-related information, and achieve penetrating and multi-dimensional case verification, so as to solve the problems of "inability to achieve efficient fusion of data from different sources and of different types", "difficulty in completing elements of multi-source data" and "high difficulty in structured data analysis".
[0049] Step 104: Define multiple entity types and construct association relationships based on the association characteristics of multiple entity types and case data.
[0050] By combining the inherent characteristics of case data with the core needs of case analysis, multiple core entities are defined. For example, there can be three types of entities. Each type of entity carries a clear case semantic and is given a unique attribute system to ensure the integrity and relevance of entity information, providing a solid foundation for the construction of heterogeneous graphs.
[0051] In some embodiments, the multiple entity types include: person entities, behavior entities, and location entities; the relationships include: relationships between people, relationships between people and behavior, and relationships between people and location.
[0052] Among them, the person entity, as the core entity of case analysis, encompasses various personnel related to the case, including criminal suspects, related persons, and victims. It is a key target node for subsequent identification of suspicious persons, mining of relationships, and prediction of links. Its core attributes include unique identifiers, names, contact information, identity information, case-related identifications, and records of related behaviors, which can comprehensively depict the basic information and case-related characteristics of the person, providing basic support for subsequent anomaly detection and clue analysis.
[0053] Behavioral entities are used to abstract and quantify the specific activities of individuals. They are the core carriers for depicting behavioral patterns and uncovering clues related to cases, directly linking individual entities with location entities and carrying key semantic information for case analysis. Based on the actual needs of case analysis, this study focuses on three types of behaviors with high value: transaction behavior, call behavior, and travel behavior. Transaction behavior primarily reflects fund flows and economic connections, supporting the investigation of abnormal funds; call behavior primarily reflects social connections and communication patterns, helping to uncover hidden associates; and travel behavior primarily reflects spatiotemporal trajectories and activity ranges, providing support for reconstructing the individual's activity path. Its core attributes include unique identifiers, behavior type, time of occurrence, duration of occurrence, associated objects, and related data, comprehensively capturing the key characteristics of the behavior.
[0054] Location entities represent the specific places where various behaviors occur, serving as a crucial link connecting individuals and behaviors. They are essential for case analysis, reconstructing activity paths, and pinpointing key locations involved in a case. The types of locations they encompass include banks, hotels, intersections, offices, and other crime-related venues. Core attributes include unique identifiers, location type, specific address, geographical coordinates, records of location-related behaviors, and information on personnel associated with the location. This allows for precise location pinpointing and association of relevant individuals and behaviors, providing support for reconstructing the crime scene.
[0055] The relationships cover various core related scenarios in case analysis, clarify the semantic connotation and related logic of various relationships, and ensure that the heterogeneous graph can comprehensively and accurately present the complex relationships between entities, providing support for the discovery and analysis of case clues.
[0056] The relationships between individuals encompass various connections, including associates, co-defendants, relatives, and social connections. Among these, associate relationships are used to depict connections between individuals who are not directly involved in the case but have indirect connections; co-defendant relationships are used to clarify connections between individuals jointly involved in the case; kinship relationships are used to uncover hidden connections between relatives; and social connection relationships are used to present social networks between individuals by combining data such as call behavior, providing support for the discovery of suspicious connections.
[0057] The relationship between individuals and behaviors encompasses two core connections: acts of execution and acts of participation. Acts of execution clarify that an individual directly performs a specific type of behavior (such as initiating a transaction, making a call, or traveling independently), while acts of participation clarify that an individual participates in the actions of others (such as jointly participating in a transaction, participating in a call, or traveling together). By associating the specific details of the behaviors, a precise binding between individuals and behaviors is achieved, providing support for characterizing behavioral patterns.
[0058] The relationship between people and locations encompasses three types of connections: transit points, places of stay, and locations involved in the case. Transit points record the places a person passes through; places of stay record the places where a person clearly stayed and the duration of their stay; and locations involved in the case mark the places that are related to the core plot of the case and where a person is involved in the case, providing support for reconstructing the person's activity trajectory and identifying key locations involved in the case.
[0059] Step 105: Construct a case heterogeneity graph based on the comprehensive feature set, multiple entity types, and relationships.
[0060] Please see Figure 2 , Figure 2 This invention provides a case heterogeneity graph for a case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, as an embodiment of the present invention.
[0061] Specifically, based on the data, multiple entity types, and relationships in the comprehensive feature set after multi-source fusion, the framework of the case heterogeneous graph is determined, and the node types, edge types, and attribute definitions of the case heterogeneous graph are clarified. At the same time, standards and specifications for the construction of heterogeneous graphs can be formulated to ensure the standardization and consistency of the construction process.
[0062] In some embodiments, a case heterogeneity graph is constructed based on a comprehensive feature set, multiple entity types, and relationships, including: Basic nodes are constructed based on multiple entity types, edges between corresponding basic nodes are constructed based on relationships, and dynamic data from a comprehensive feature set are used for updating and optimization to construct a case heterogeneous graph.
[0063] A layered construction strategy can be adopted, gradually building a complete heterogeneous graph of cases from the basic layer to the related layers, ensuring the completeness and relevance of the graph. Specifically, multiple types of entities are used as the basic nodes of the heterogeneous graph, classified and labeled according to entity type, and the attribute information of the entities is associated with the corresponding nodes to complete the construction of the basic layer nodes. The basic layer is the core of the heterogeneous graph, ensuring that all case-related entities are included in the graph without omission. Based on the definition of the relationship between entities and combined with the association information, the edges of the heterogeneous graph are constructed. Differentiated association construction methods are adopted for the relationship between different types of entities. For example, the association between people and assets is constructed by integrating asset ownership information from the data; the association between transactions and people is constructed by using the participant information from the transaction data. At the same time, association attributes are added to each edge to enrich the association information and enhance the analytical value of the graph.
[0064] It is worth noting that after the case heterogeneity graph is constructed, a dynamic optimization mechanism can be established to receive new multi-source data in real time and update and optimize the case heterogeneity graph: when new data is connected, data preprocessing and fusion are automatically completed, new entities and relationships are extracted and added to the heterogeneity graph, and the heterogeneity graph is regularly verified and optimized to remove erroneous associations and supplement missing associations, ensuring that the heterogeneity graph can reflect the latest relationship status of the case in real time and provide accurate graph support for case analysis.
[0065] By constructing a case heterogeneity graph that can reflect multiple types of entities and their relationships, the graph can comprehensively present the complex relationships between various entities, providing reliable graph support for case analysis and solving the problem that "existing technologies cannot comprehensively present the complex relationships between various entities".
[0066] Step 106: Construct an anomaly detection algorithm based on case heterogeneous graphs, multiple entity types, and relationships.
[0067] In some embodiments, the anomaly detection algorithm includes the k-nearest neighbor algorithm and the local anomaly factor algorithm, and the anomaly detection algorithm outputs the detection result based on the judgment results of the k-nearest neighbor algorithm and the local anomaly factor algorithm.
[0068] By integrating the k-nearest neighbor algorithm and the local anomaly factor algorithm, algorithm parameters can be optimized based on the characteristics of case analysis scenarios. Relying on the diverse entities and relationships within heterogeneous graphs, quantitative analysis of abnormal characteristics of entities and relationships can be conducted from multiple dimensions, including transactions, behaviors, and asset holdings. This achieves comprehensive coverage, automatic identification, and tiered early warning for various suspicious patterns, overcoming the limitations of manual analysis. The core logic of this judgment lies in calculating the statistical distribution of key indicators for entities or relationships and setting reasonable normal range thresholds based on conventional case analysis standards. When the indicator value exceeds this normal range, it can be judged as an abnormal situation and a warning is issued.
[0069] Specifically, the k-nearest neighbor algorithm is used to calculate the association distance between a given entity node and other nodes. This association distance is calculated using a weighted average based on factors such as the number of associations and the strength of the associations. If the average distance between the node and its k nearest neighbors is significantly greater than the average distance between the node and other nodes, then the node can be identified as an abnormally isolated node and is a key target for investigation.
[0070] The local anomaly factor algorithm is used to calculate the density of an entity node within its local neighborhood, such as the number of nodes per unit association space and the degree of association. If the local density of a node is significantly lower than the local density of its neighboring nodes, it will be identified as a low-density anomaly node, which is likely to have hidden associations or illegal behavior and needs to be included in the scope of in-depth analysis.
[0071] This invention employs an anomaly voting mechanism, integrating the detection results of multiple algorithms to double-mark high-confidence anomaly nodes, thereby reducing the false positive rate and improving the accuracy of anomaly clue identification. Simultaneously, it provides tiered warnings for anomaly nodes based on their anomaly confidence level, facilitating investigators to prioritize high-priority anomaly clues. This overcomes the shortcomings of existing technologies that often use a single algorithm for anomaly detection, resulting in a high false positive rate.
[0072] Step 107: Construct a visualization map, which is used to display the detection results output by the anomaly detection algorithm.
[0073] By displaying the detection results of anomaly detection algorithms through a visual graph, complex entity relationships, the distribution characteristics of anomalous nodes, and anomaly confidence levels can be presented to investigators in an intuitive graphical way. This helps investigators quickly clarify the case details, accurately identify key targets and suspicious clues, and efficiently complete problem verification, significantly improving the accuracy, efficiency, and depth of case investigations, thus solving the problem of "low efficiency of manual judgment." The visual graph can be configured with multiple layout display modes, integrates multiple sub-graph types, and supports interactive functions.
[0074] Specifically, layout display modes can include circular layout, grouped layout, and display of unrelated objects.
[0075] The circular layout arranges the core entities involved in the case and various related entities (such as related personnel, assets, events, and government affairs) in a circular pattern around the center. This is suitable for analyzing the direct relationship network of the core entities and quickly sorting out the core relationship.
[0076] The group layout is based on the entity's government affiliation, association type, and business scope, and only displays the association between entities in the same group, which facilitates classified supervision and verification.
[0077] The display of unrelated objects is based on the existing relationships between entities. It shows objects in a case that are not directly related, which makes it easier to discover potential hidden connections and concealed clues, and achieve in-depth supervision.
[0078] The types of integrated sub-graphs include relationship graphs, transaction fund graphs, communication behavior graphs, and comprehensive graphs.
[0079] The relationship graph, centered on individuals and government agencies, integrates various relationships between entities, including custom relationships, related person relationships, common related person relationships, and government business relationships. This graph visually displays the network of relationships between people and organizations, helping to uncover potential related entities and hidden relationships.
[0080] The transaction fund map, centered on bank cards and fund accounts, integrates the relationships between personnel and assets, assets and assets, and institutions and funds, including common counterparties, transaction frequency, amount range, and fund flow paths. This map clearly presents the counterparties, scale range, and complete flow paths of government-related funds, helping to identify abnormal fund transactions and illegal fund flows.
[0081] The communication behavior graph centers on communication event entities, integrating the relationships between people and events, and between events themselves. This includes shared communication partners, call count filtering, shared base station relationships, and communication time period analysis. The graph can display the correlation between individuals' communication behavior trajectories and events, helping to discover abnormal communication patterns and hidden contact clues.
[0082] The comprehensive mapping system can integrate the relationships between multiple entities based on dimensions such as government office addresses, associated vehicles, fund-related addresses, business processing addresses, and network-related addresses, enabling cross-scenario and multi-dimensional supervision and helping to uncover potential related clues from multiple scenarios.
[0083] Interactive functionality means that the visualization graph supports hovering the mouse to view details (e.g., person nodes display name, position, employing organization, and number of associated entities; relationship edges display association type, core information, and government association identifier), dragging the mouse to adjust node positions, and freely zooming the graph to view the overall network or local details. It also supports highlighting functionality. For example, selecting a person or organization node automatically highlights all its associated entities and relationship edges, while non-associated entities and relationship edges are displayed in grayscale, facilitating focused analysis of the target entity's relationship network and accurately identifying key areas for supervision.
[0084] It's worth noting that the visualization graph features multi-dimensional, high-precision relational query functions, enabling investigators to quickly retrieve all related information about a target entity. Leveraging the diverse entity types and relationships within the heterogeneous graph, it achieves accurate and efficient queries. These functions include precise search, fuzzy search, graph display, list display, detail display, graph export, and analysis report export.
[0085] Specifically, precise search supports inputting the unique identifier of an entity to directly locate the target entity and query all its related entities; fuzzy search supports inputting non-unique attributes of an entity, and the system returns a list of all entities containing the keyword. After the investigator selects the target entity, its related information is displayed. The graph display supports presenting query results in a visual heterogeneous graph format, with the target entity centered and related entities distributed around it, and the relationship edges labeled with the relationship type and key information. The list display supports providing detailed information in a list format, and supports sorting by relevant fields for quick filtering of key information. The detail display allows clicking on a related entity or relationship to pop up a details window displaying complete information. The graph export supports exporting the currently displayed heterogeneous graph in various image formats, allowing selection of either a complete graph or a partial graph. The analysis report export supports exporting structured data generated during the analysis process as a table format, with table fields completely consistent with those displayed in the system, and supports data filtering and sorting functions for further data processing by investigators.
[0086] In some embodiments, after fusing the differentiated preprocessed data to obtain fused processed data, the process includes: The third structured data is stored in the first database, which is a relational database; the third semi-structured data and the third unstructured data are stored in the second database, which is a distributed storage database; a comprehensive feature set is constructed based on the first and second databases.
[0087] By storing the pre-processed and integrated third-party structured data into the first database, establishing primary and foreign key relationships, and setting data access control, the integrity, consistency, and security of the third-party structured data can be ensured. At the same time, the database index design can be optimized to improve the query efficiency of structured data and meet the needs of rapid retrieval of structured data in the case analysis process.
[0088] By storing the third semi-structured data and the third unstructured data in the second database, a distributed storage approach combined with data compression technology is adopted to reduce storage usage while supporting fast retrieval and access, ensuring efficient storage and convenient use of this type of data.
[0089] In some embodiments, an anomaly detection algorithm is constructed based on a case heterogeneity graph, multiple entity types, and relationships, including: Multiple entity types and their relationships are stored in a third database, which is a Neo4j graph database. An anomaly detection algorithm is built based on the case heterogeneous graph and the third database.
[0090] Defined core entities of various types and their relationships are stored in the Neo4j graph database. The Neo4j graph database uses a native graph storage structure, storing heterogeneous graph data in the form of nodes (representing different types of entities) and edges (representing different types of relationships). It supports efficient graph traversal and deep relational queries, enabling rapid response to query requirements for multiple types of entities and relationships in heterogeneous graphs, significantly improving the efficiency of graph visualization and case analysis. Furthermore, the storage structure is optimized for the characteristics of heterogeneous graphs to ensure the integrity and relevance of the heterogeneous graph data.
[0091] It adopts a hybrid storage architecture that combines relational databases, graph databases, and distributed storage to adapt to the storage needs of multi-source fused data and heterogeneous graphs, taking into account data security, integrity, and query efficiency, and achieving efficient storage and fast retrieval of various types of data and heterogeneous graphs.
[0092] This invention eliminates format differences in case data from different sources and of different types by performing differentiated preprocessing and fusion processing on multi-source data, achieving deep fusion of structured, semi-structured, and unstructured multi-source data. Based on a comprehensive feature set, it constructs a heterogeneous case graph containing multiple types of entities and relationships, improving the accuracy and completeness of graph construction, comprehensively and realistically presenting the complex relationships between various entities in a case, uncovering potential hidden relationships, and overcoming the limitations of existing graph technologies in simple associations of single-type entities.
[0093] Furthermore, this invention employs an anomaly voting mechanism, integrating the detection results of multiple algorithms to double-label high-confidence anomaly nodes, reducing the false positive rate and improving the accuracy of anomaly clue identification. Anomalous nodes are tiered and alerted based on their anomaly confidence levels, facilitating investigators to prioritize high-priority anomaly clues. By constructing a case heterogeneity graph that reflects multiple types of entities and their relationships, the graph comprehensively presents the complex relationships between various entities, providing reliable graph support for case analysis and improving the efficiency of manual judgment.
[0094] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A case intelligent analysis method based on multi-dimensional data fusion and knowledge graph, characterized in that, include: Acquire multi-source data, wherein the multi-source data includes first structured data, first semi-structured data, and first unstructured data; Differential preprocessing is performed on the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data; The differentiated preprocessed data is fused to obtain fused data, and a comprehensive feature set is constructed based on the fused data. Define multiple entity types and construct association relationships based on the association characteristics of the entity types and case data. A case heterogeneity graph is constructed based on the comprehensive feature set, the various entity types, and the association relationships. An anomaly detection algorithm is constructed based on the case heterogeneity graph, the various entity types, and the relationships. A visualization graph is constructed to display the detection results output by the anomaly detection algorithm.
2. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1, characterized in that, The step of performing differential preprocessing on the first structured data, the first semi-structured data, and the first unstructured data to obtain differential preprocessed data includes: The first structured data is cleaned and standardized to obtain the second structured data; The first half of the structured data is processed by tag parsing and field extraction to obtain the second half of the structured data; The first unstructured data is processed by OCR recognition, speech-to-text conversion and video frame extraction to obtain the second unstructured data; Construct differentiated preprocessed data comprising the second structured data, the second semi-structured data, and the second unstructured data.
3. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 2, characterized in that, The process of fusing the differentiated preprocessed data to obtain fused processed data includes: The second structured data, the second semi-structured data, and the second unstructured data are fused in terms of format, element, and feature to obtain fused data including the third structured data, the third semi-structured data, and the third unstructured data.
4. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 3, characterized in that, After fusing the differentiated preprocessed data to obtain fused processed data, the process includes: The third structured data is stored in a first database, which is a relational database; The third semi-structured data and the third unstructured data are stored in the second database, which is a distributed storage system. The comprehensive feature set is constructed based on the first database and the second database.
5. A case intelligent analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1 or 4, characterized in that, The anomaly detection algorithm constructed based on the case heterogeneity graph, the various entity types, and the relationships includes: The various entity types and their relationships are stored in a third database, which is a Neo4j graph database. An anomaly detection algorithm is constructed based on the case heterogeneous graph and the third database.
6. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 3, characterized in that, The process of format fusion, element fusion, and feature fusion of the second structured data, the second semi-structured data, and the second unstructured data includes: The second structured data, the second semi-structured data, and the second unstructured data are mapped to a preset standard data model, wherein the standard data model performs data format standardization processing through a data format mapping algorithm; By constructing multi-source data element association rules, cosine similarity matching, and semantic association matching, the same second structured data, second semi-structured data, and second unstructured data are deduplicated and fused. Through a cross-data source association reasoning mechanism, the missing data in the second structured data, second semi-structured data, and second unstructured data are supplemented. Numerical features from the second structured data, label features from the second semi-structured data, and text features from the second unstructured data are extracted using a feature extraction algorithm. The numerical features, label features, and text features are then normalized, and feature fusion is performed based on a feature association analysis algorithm.
7. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1, characterized in that, The construction of the case heterogeneity graph based on the comprehensive feature set, the multiple entity types, and the association relationships includes: Based on the various entity types, basic nodes are constructed, and edges between the corresponding basic nodes are constructed based on the relationships. The data is then updated and optimized based on the dynamic data in the comprehensive feature set to construct a case heterogeneous graph.
8. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1, characterized in that, The various entity types include: person entities, behavior entities, and location entities; the relationships include: relationships between people, relationships between people and behavior, and relationships between people and location.
9. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1, characterized in that, The anomaly detection algorithm includes the k-nearest neighbor algorithm and the local anomaly factor algorithm. The anomaly detection algorithm outputs the detection result based on the judgment results of the k-nearest neighbor algorithm and the local anomaly factor algorithm.
10. The intelligent case analysis method based on multi-dimensional data fusion and knowledge graph as described in claim 1, characterized in that, The first structured data includes data in tabular form; the first semi-structured data includes forms, logs, or XML format files; the first unstructured data includes documents, images, audio, or video.