Aggregation method and system for multi-source heterogeneous intelligence data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By calculating structured entropy and source domain fingerprints to filter data, and using knowledge graphs and cross-modal hash networks for entity recognition and alignment, the problem of deep integration of multi-source heterogeneous intelligence data is solved, achieving efficient and accurate data processing and analysis.

CN122309744APending Publication Date: 2026-06-30CHINA TELECOM CONSTR 4TH ENG

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHINA TELECOM CONSTR 4TH ENG
Filing Date: 2026-03-24
Publication Date: 2026-06-30

Application Information

Patent Timeline

24 Mar 2026

Application

30 Jun 2026

Publication

CN122309744A

IPC: G06F16/353; G06N5/022; G06F40/279; G06F18/213; G06N5/046; G06F40/205; G06F40/30; G06F16/3331; G06F16/334; G06F16/31

AI Tagging

Technology Topics

Semantic vector Theoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional methods for processing scientific and technological intelligence data cannot adapt to the rapid access and analysis needs of new data sources, resulting in information loss and inability to conduct in-depth analysis. Furthermore, existing methods cannot effectively integrate multiple types of information across modalities, and the generated analysis results are one-sided and inaccurate.

Method used

By acquiring multi-source heterogeneous intelligence data, calculating structured entropy and source domain fingerprints for preliminary screening, and using knowledge graphs and cross-modal hash networks for entity identification and alignment, structured scientific and technological intelligence data is generated.

Benefits of technology

It achieves lossless storage and efficient processing of multi-source heterogeneous data, improves the accuracy of entity referencing resolution and cross-source fact alignment, and generates data that can accurately match user intent.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122309744A_ABST

Patent Text Reader

Abstract

This invention provides a method and system for aggregating multi-source heterogeneous intelligence data. The aggregation method includes: acquiring raw multi-source heterogeneous scientific and technological intelligence data; extracting structured entropy representing structural complexity and source domain fingerprint representing source reliability; storing all three in a data lake for unstructured storage; receiving a query request; parsing the query entity and relational pattern; initially screening candidate data units by combining structured entropy and source domain fingerprint; performing entity identification on the candidate data and linking it to a pre-set scientific and technological knowledge graph; constructing a confidence model through graph topology and semantic information to complete entity referencing resolution; extracting semantic vectors from multimodal data instances of the same knowledge graph entity using the multimodal model; mapping them to a shared Hamming space via a cross-modal hash network; aligning cross-source facts based on Hamming distance; retrieving isomorphic subgraphs in the knowledge graph as aggregation patterns based on query relational patterns; and integrating the aligned data to generate structured scientific and technological intelligence.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of aggregation, and in particular relates to a method and system for aggregating multi-source heterogeneous intelligence data. Background Technology

[0002] Scientific and technological intelligence data comes from a wide range of sources and takes diverse forms, encompassing patent documents, academic papers, technical reports, market analyses, news information, expert databases, and more, exhibiting typical characteristics of being multi-source, heterogeneous, and massive. Traditional methods for processing scientific and technological intelligence data largely rely on data warehouse technology, employing a "schema-first" ETL process. This method requires pre-defining a strict and unified data schema before data is entered into the database, cleaning and converting data from different sources into a standardized structured format. Predefined schemas lack flexibility and cannot adapt to the rapid access of new data sources and changing analytical needs; each schema adjustment requires high development and maintenance costs. The forced structured conversion process may lead to the loss of rich contextual information contained in the original data, resulting in information loss. Furthermore, data systems from different sources are often physically and logically isolated, hindering comprehensive information correlation and in-depth analysis.

[0003] Accurately identifying and linking entities from heterogeneous data to a unified knowledge base is fundamental to data aggregation. However, the ambiguity and diversity of entity names make entity linking and referential resolution a core technical challenge. Traditional methods based on string matching or simple rules have limited accuracy when dealing with complex contexts. Furthermore, the same scientific fact or intelligence information may be distributed in different modalities across different documents or different parts of the same document. Bridging the modal divide, determining whether different data instances point to the same objective fact, and performing cross-source fact alignment are key challenges in achieving deep information fusion. Existing methods often focus on information extraction and alignment within a single modality, lacking a unified framework capable of integrating information from multiple modalities such as text, images, and tables. This results in intelligence analysis results that are often fragmented and incomplete, failing to provide complete, accurate, and in-depth decision support. Summary of the Invention

[0004] This disclosure provides a method and system for aggregating multi-source heterogeneous intelligence data.

[0005] According to one aspect of this disclosure, a method for aggregating multi-source heterogeneous intelligence data is provided, comprising: The system acquires raw data of multi-source heterogeneous scientific and technological intelligence, extracts and calculates the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and stores the raw data, structured entropy and source domain fingerprint together in a data lake for unstructured storage; receives query requests, parses the requests to obtain query entities and relational patterns; and, based on the query entities, relational patterns and the structured entropy and source domain fingerprint, initially filters candidate data units from the data lake. The candidate data units are identified as entities and linked to a pre-set science and technology knowledge graph. In the case of multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as contextual constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. For multiple data instances belonging to different modalities that refer to the same knowledge graph entity, the context semantic vector of the data instance is extracted using a pre-trained unified multimodal model. The semantic vector is mapped to a shared Hamming space through a cross-modal hashing network. Based on the fact that the Hamming distance between hash codes is less than a preset threshold, the data instances refer to the same fact, and cross-source fact alignment is performed. Based on the relational pattern of the query request, a subgraph structure that is isomorphic to the pattern is retrieved in the science and technology knowledge graph. The subgraph structure is used as an instant aggregation pattern, and aligned data instances are organized and integrated according to this pattern to generate structured science and technology intelligence data.

[0006] Optionally, the step of acquiring multi-source heterogeneous scientific and technological intelligence raw data, extracting and calculating the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit includes: For semi-structured data, parse the DOM tree or JSON tree structure of the data, count the total number of nodes and the maximum level depth, and calculate the normalized structured entropy by multiplying the ratio of the total number of nodes to the preset maximum number of nodes by the ratio of the maximum level depth to the preset maximum level depth. Extract the main domain name of the data source and generate a 256-bit hash value using the SHA-256 algorithm, which serves as the source domain fingerprint.

[0007] Optionally, the preliminary screening of candidate data units from the data lake based on the query entity, relation schema, and the structured entropy and source domain fingerprint includes: Using an inverted index, the text content of the query entity contained in the data lake is searched in full-text search to obtain the initial set of data units; Based on the complexity of the query relationship pattern, a target range for structured entropy is set, and data units with entropy values within the range are selected. Based on the source domain fingerprint, a pre-set list of authoritative information sources is loaded, and only data units whose source is an information source within the list are retained to obtain candidate data units.

[0008] Optionally, the construction of the confidence calculation model, which calculates the confidence of each identified entity linked to a candidate entity in the knowledge graph, includes: A set of feature functions is constructed for the confidence calculation model. The set of feature functions includes: the string similarity between the identified entity string and the candidate entity name in the knowledge graph, the centrality index of the candidate entity in the knowledge graph, the part-of-speech tag sequence of the identified entity in the original text, and the cosine similarity between the entity context word vector and the candidate entity representation text vector. The feature function is input into the trained gradient boosting decision tree model, and the model outputs a normalized probability value, which is the confidence level.

[0009] Optionally, for multiple data instances belonging to different modalities that refer to the same knowledge graph entity, extracting the contextual semantic vector of the data instances using a pre-trained unified multimodal model includes: For data instances of different modalities such as text, image, and table, the data instances are input into a unified multimodal pre-trained model, and the 768-dimensional output vector of the fused multimodal information corresponding to the start symbol [CLS] in the last encoder layer of the model is extracted as the context semantic vector of the data instance.

[0010] Optionally, the step of mapping the semantic vector to a shared Hamming space via a cross-modal hashing network, and determining that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, includes: Construct a twin hash network with two weight-sharing branches, each branch consisting of three fully connected layers and one output layer; The 768-dimensional semantic vector is input into the network, and a 128-dimensional real number vector is generated through forward propagation. The vector is then processed by the sign function to obtain a 128-bit binary hash code. Calculate the number of different bits between the hash codes of two data instances to obtain the Hamming distance; When the Hamming distance is less than or equal to 10, the two data instances are considered to refer to the same fact.

[0011] Optionally, the step of retrieving subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, and using the subgraph structures as instantaneous aggregation patterns, includes: The relational schema of the query request is transformed into a graph query schema consisting of entity type nodes and relation type edges; The VF2++ algorithm is used to match subgraphs that are isomorphic to the graph query pattern in the science and technology knowledge graph. The structure of the first successfully matched subgraph instance that contains the specific entity in the query request is used as the immediate aggregation mode for this aggregation.

[0012] Optionally, the process of organizing and integrating aligned data instances according to this pattern to generate structured scientific and technological intelligence data includes: Create a JSON object template based on the entities and relationships defined in the instant aggregation pattern; Information from multiple data instances that have been aligned and refer to the same fact is populated into the corresponding entity attribute fields in the JSON template according to the knowledge graph entities corresponding to the data instances. For fields with conflicting content, the source authority score is queried based on the source domain fingerprint of the data instance. Combined with the information release time, the information with the highest authority and the most recent time is selected as the final value to complete the data integration and output a structured JSON format science and technology intelligence data.

[0013] According to another aspect of this disclosure, a system for aggregating multi-source heterogeneous intelligence data is provided, comprising the following modules: The acquisition module is used to acquire raw data of multi-source heterogeneous scientific and technological intelligence, extract and calculate the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and store the raw data, structured entropy and source domain fingerprint together in the data lake for unstructured storage; receive query requests, parse the requests to obtain query entities and relational patterns; and based on the query entities, relational patterns and the structured entropy and source domain fingerprint, initially screen candidate data units from the data lake; The calculation module is used to perform entity recognition on the candidate data units and link them to a pre-set science and technology knowledge graph. When there are multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as context constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. The determination module is used to extract the context semantic vector of multiple data instances belonging to different modalities that refer to the same knowledge graph entity using a pre-trained unified multimodal model, map the semantic vector to a shared Hamming space through a cross-modal hash network, and determine that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, and perform cross-source fact alignment. The generation module is used to retrieve subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, use the subgraph structures as an instant aggregation pattern, and organize and integrate aligned data instances according to this pattern to generate structured science and technology intelligence data.

[0014] Furthermore, the step of acquiring multi-source heterogeneous scientific and technological intelligence raw data, extracting and calculating the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit includes: For semi-structured data, parse the DOM tree or JSON tree structure of the data, count the total number of nodes and the maximum level depth, and calculate the normalized structured entropy by multiplying the ratio of the total number of nodes to the preset maximum number of nodes by the ratio of the maximum level depth to the preset maximum level depth. Extract the main domain name of the data source and generate a 256-bit hash value using the SHA-256 algorithm, which serves as the source domain fingerprint.

[0015] Furthermore, the preliminary screening of candidate data units from the data lake based on the query entity, relation schema, structured entropy, and source domain fingerprint includes: Using an inverted index, the text content of the query entity contained in the data lake is searched in full-text search to obtain the initial set of data units; Based on the complexity of the query relationship pattern, a target range for structured entropy is set, and data units with entropy values within the range are selected. Based on the source domain fingerprint, a pre-set list of authoritative information sources is loaded, and only data units whose source is an information source within the list are retained to obtain candidate data units.

[0016] Furthermore, the construction of the confidence calculation model, which calculates the confidence of each identified entity linked to a candidate entity in the knowledge graph, includes: A set of feature functions is constructed for the confidence calculation model. The set of feature functions includes: the string similarity between the identified entity string and the candidate entity name in the knowledge graph, the centrality index of the candidate entity in the knowledge graph, the part-of-speech tag sequence of the identified entity in the original text, and the cosine similarity between the entity context word vector and the candidate entity representation text vector. The feature function is input into the trained gradient boosting decision tree model, and the model outputs a normalized probability value, which is the confidence level.

[0017] Furthermore, for multiple data instances belonging to different modalities that refer to the same knowledge graph entity, the extraction of the contextual semantic vector of the data instances using a pre-trained unified multimodal model includes: For data instances of different modalities such as text, image, and table, the data instances are input into a unified multimodal pre-trained model, and the 768-dimensional output vector of the fused multimodal information corresponding to the start symbol [CLS] in the last encoder layer of the model is extracted as the context semantic vector of the data instance.

[0018] Furthermore, the step of mapping the semantic vector to a shared Hamming space via a cross-modal hashing network, and determining that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, includes: Construct a twin hash network with two weight-sharing branches, each branch consisting of three fully connected layers and one output layer; The 768-dimensional semantic vector is input into the network, and a 128-dimensional real number vector is generated through forward propagation. The vector is then processed by the sign function to obtain a 128-bit binary hash code. Calculate the number of different bits between the hash codes of two data instances to obtain the Hamming distance; When the Hamming distance is less than or equal to 10, the two data instances are considered to refer to the same fact.

[0019] Further, the step of retrieving subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, and using the subgraph structures as instantaneous aggregation patterns, includes: The relational schema of the query request is transformed into a graph query schema consisting of entity type nodes and relation type edges; The VF2++ algorithm is used to match subgraphs that are isomorphic to the graph query pattern in the science and technology knowledge graph. The structure of the first successfully matched subgraph instance that contains the specific entity in the query request is used as the immediate aggregation mode for this aggregation.

[0020] Furthermore, the process of organizing and integrating aligned data instances according to this pattern to generate structured scientific and technological intelligence data includes: Create a JSON object template based on the entities and relationships defined in the instant aggregation pattern; Information from multiple data instances that have been aligned and refer to the same fact is populated into the corresponding entity attribute fields in the JSON template according to the knowledge graph entities corresponding to the data instances. For fields with conflicting content, the source authority score is queried based on the source domain fingerprint of the data instance. Combined with the information release time, the information with the highest authority and the most recent time is selected as the final value to complete the data integration and output a structured JSON format science and technology intelligence data.

[0021] This invention employs a delayed-mode binding strategy, storing raw data along with structured entropy and source domain fingerprints in a data lake, achieving lossless, unstructured storage of multi-source heterogeneous data. During the query phase, structured entropy and source domain fingerprints are used for initial data screening, improving the quality and relevance of candidate data units and increasing processing efficiency. By utilizing the topological and semantic information of knowledge graphs as contextual constraints, a confidence model is constructed, improving the accuracy of entity referencing resolution. Furthermore, a cross-modal hashing network is used to map multimodal data to a shared Hamming space for comparison, achieving accurate alignment of facts from different sources and modalities. Aligned data is organized by generating aggregation patterns in real-time based on user queries, ensuring that the generated structured intelligence accurately matches user intent and solving the problem of incomplete integration and full utilization of heterogeneous data. Attached Figure Description

[0022] Figure 1 A flowchart of a method for aggregating multi-source heterogeneous intelligence data is provided in this application; Figure 2 A schematic diagram of the multi-modal sensing network hierarchical architecture provided in this application. Detailed Implementation

[0023] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.

[0024] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.

[0025] See in this application. Figure 1As shown, a method for aggregating multi-source heterogeneous intelligence data includes the following steps: A) Obtain raw data of multi-source heterogeneous scientific and technological intelligence, extract and calculate the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and store the raw data, structured entropy and source domain fingerprint together in the data lake for unstructured storage. Using web crawling frameworks like Scrapy or WebMagic, targeted crawling of scientific literature websites, patent databases, and industry reports is employed. Structured data is obtained through RESTful API calls, and data is extracted from relational databases using pyodbc or JDBC connectors. Those skilled in the art should understand that such crawling must be conducted in accordance with legal and regulatory requirements. Alternatively, information can be obtained from the target website by calling its API; see [reference needed]. Figure 2 Technological intelligence can also include patents, industry reports, etc. For each acquired data unit, such as an HTML document, PDF file, or data table, structural parsing is performed. For document-type data, the lxml library is used to parse the DOM tree of the data, and the frequency of occurrence of tags at each level, such as h1, p, and table, is counted. Based on Shannon entropy formula The process involves calculating structured entropy. For PDFs and images, information is obtained through image recognition and other methods, followed by structure parsing. For source information, metadata such as source URL, publishing institution, and publication time are extracted. A reliability score is calculated using preset weighting rules, such as journal impact factor and website authority. A unique source domain fingerprint is generated from the source metadata string using the SHA-256 hash algorithm. The original data file, along with a JSON metadata file containing structured entropy, source domain fingerprint, and reliability score, is then uploaded to a data lake built on Hadoop HDFS or Amazon S3.

[0026] In some embodiments, the step of acquiring multi-source heterogeneous scientific and technological intelligence raw data, extracting and calculating the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit includes: For semi-structured data, parse the DOM tree or JSON tree structure of the data, count the total number of nodes and the maximum level depth, and calculate the normalized structured entropy by multiplying the ratio of the total number of nodes to the preset maximum number of nodes by the ratio of the maximum level depth to the preset maximum level depth. Extract the main domain name of the data source and generate a 256-bit hash value using the SHA-256 algorithm, which serves as the source domain fingerprint.

[0027] The calculation process for structured entropy adapts to different data types. For semi-structured data units, such as an HTML page, the DOM tree of the data is parsed using the lxml library, and the total number of nodes N and the maximum level depth D are counted. To achieve normalization, a preset maximum total number of nodes is set based on statistical analysis of web pages in the field of scientific and technological intelligence. =5000 and preset maximum layer depth =30. The structured entropy H is calculated through weighted combination: The weight The preferred value is =0.5, =0.5, to balance the contribution of the number of nodes and the depth of structure to complexity. For example, if N=1200 and D=15, then H=0.37. For unstructured data such as plain text, which has a simple structure, the structured entropy can be set to a fixed low value, such as H=0.01, to distinguish data with complex hierarchical structures.

[0028] The generation of source domain fingerprints provides a stable and unique identifier for each data source. This process extracts the source URL from the metadata of the data unit, such as https: / / www.xxx.com / articles / s41586-021-03947-8. The host domain is extracted using a URL resolution library and normalized; for example, www.xxx.com and staging.xxx.com are unified into the subdomain xxx.com to ensure source consistency. This normalized domain string is then used as input and the cryptographic hash function SHA-256 is applied. This algorithm converts inputs of arbitrary length into a fixed-length 256-bit binary hash value, typically represented as a 64-character hexadecimal string. This hash value is the source domain fingerprint, which, compared to using a domain string, has a fixed length, is easy to index, and reliably identifies the data source.

[0029] B) Receive a query request, parse the request to obtain the query entity and relation schema; based on the query entity, relation schema, and the structured entropy and source domain fingerprint, initially screen candidate data units from the data lake; A web service interface is built using Flask or Spring Boot frameworks to receive users' natural language query requests. Natural language understanding tools based on the BERT model, such as Google's BERT or pre-trained models integrated in SpaCy, are used to perform named entity recognition and relation extraction on the query statements, parsing out the core query entities and expected relational patterns. For example, the query "methods for preparing graphene" is parsed into the entity "graphene" and the relation "preparation methods". A composite query instruction is constructed and submitted to the search engine Elasticsearch deployed on the data lake. This query instruction includes using a MATCH query to match query entity keywords, a RANGE query to filter structured entropy and source domain reliability scores, and a BOL query to combine these conditions, thereby retrieving a list of highly relevant and reliable candidate data units from massive amounts of data.

[0030] In some embodiments, the preliminary screening of candidate data units from the data lake based on the query entity, relation schema, and the structured entropy and source domain fingerprint includes: Using an inverted index, the text content of the query entity contained in the data lake is searched in full-text search to obtain the initial set of data units; Based on the complexity of the query relationship pattern, a target range for structured entropy is set, and data units with entropy values within the range are selected. Based on the source domain fingerprint, a pre-set list of authoritative information sources is loaded, and only data units whose source is an information source within the list are retained to obtain candidate data units.

[0031] Recall is performed using a full-text search engine based on inverted indexes, such as Elasticsearch. When a request for the query entity "carbon nanotube" is received, documents that exactly match "carbon nanotube" are retrieved. The query is expanded using a thesaurus such as "carbonnanotube" (CNT) to improve recall. All potentially relevant documents are retrieved from the massive dataset, forming a large initial set of data units.

[0032] A two-stage filtering process is applied to the initial data unit set to represent the complexity of the query relation schema and map it to the target interval of structured entropy. The complexity is represented based on the number of entity nodes and relation edges in the relation schema; for example, the complexity... For simple queries such as Einstein - Publication -> Paper, C=1.6, and the entropy interval is set to [0.05, 0.3] to match the paper abstract page. For complex queries such as Institution A - Funding -> Project B - Output -> Patent C, C=3.0, and the entropy interval is set to [0.4, 0.8] to match the structured project report. Maintain a pre-defined list of authoritative sources by field, stored in the form of source domain fingerprints. For example, the authoritative source list in the field of physics includes Nature, Science, etc., and the patent field includes Google Patents, USPTO, etc. According to the field of the query content, select the corresponding authoritative source list and compare it with the source domain fingerprint of the data unit. Usually, a hash set is used to implement O(1) query, and only the data units with matching sources are retained, so as to obtain a candidate data unit set with controllable size and highly relevant content.

[0033] C) Entity recognition is performed on the candidate data units, and they are linked to a pre-set science and technology knowledge graph. In the case of multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as context constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. Named Entity Recognition (NER) is performed on the text content of candidate data units using a pre-trained SciBERT model specific to the technology domain. For each identified entity mention, multiple candidate link entities are retrieved from a pre-built technology knowledge graph stored in Neo4j using fuzzy string matching algorithms such as Jaro-Winkler distance or index lookup. A confidence calculation model is constructed, taking candidate links as input. BERT is used to obtain the contextual semantic vector of the entity mention and the semantic vector of the candidate entity's text representation in the knowledge graph, and the cosine similarity between the two is calculated as a semantic similarity feature. Other clearly linked entities in the text are extracted, and the shortest path length between these entities and the current candidate entity is calculated in the knowledge graph, or the relevance is calculated using the Personalized PageRank algorithm, as a topological consistency feature. A pre-trained logistic regression classifier or a small feedforward neural network is used to calculate the confidence score for each candidate link based on the above features, and the candidate entity with the highest score is selected as the link result.

[0034] In some embodiments, constructing a confidence calculation model to calculate the confidence level of each identified entity linked to a candidate entity in the knowledge graph includes: A set of feature functions is constructed for the confidence calculation model. The set of feature functions includes: the string similarity between the identified entity string and the candidate entity name in the knowledge graph, the centrality index of the candidate entity in the knowledge graph, the part-of-speech tag sequence of the identified entity in the original text, and the cosine similarity between the entity context word vector and the candidate entity representation text vector. The feature function is input into the trained gradient boosting decision tree model, and the model outputs a normalized probability value, which is the confidence level.

[0035] The core of the confidence calculation model is to construct a comprehensive feature vector. The Sentence-BERT model used in this model is a neural network model based on the Transformer architecture. This model consists of multiple stacked encoder units, each containing a multi-head self-attention module and a feedforward neural network. The model is fine-tuned using a Siamese network structure based on the pre-trained BERT model. The output word vector sequence of the last encoder layer is aggregated into a single sentence vector through average pooling. The input of the model is a text, such as the sentence containing the entity reference or the representation text of the candidate entity. After word segmentation, the text is converted into a sequence of integer identifiers representing words. The output is a 768-dimensional real vector, which is the semantic representation of the input text. For each identified entity and a candidate linked entity in the knowledge graph, the Jaro-Winkler algorithm is used to calculate the similarity between the entity reference and the candidate entity name. For example, MIT and Massachusetts Institute of Technology have a similarity score of 0.91. The PageRank score or in-degree of the candidate entity in the knowledge graph is queried, for example, a PageRank value of 0.0085. The Sentence-BERT model is used to generate 768-dimensional vectors for the sentence containing the entity reference and the candidate entity representation text, respectively, and the cosine similarity is calculated, for example, 0.88. The entity type predicted by the entity recognition system, such as organization, is checked to see if it is consistent with or compatible with the type of candidate entity in the knowledge graph, such as dbo:Educational Institution, generating a binary feature value of 1 or 0. The number of entities that also exist in the neighboring nodes of the candidate entity's knowledge graph appearing in the context window of the entity reference, such as the preceding and following 10 words, is counted as a supplementary feature.

[0036] The Gradient Boosting Decision Tree (GBDT) model is an ensemble learning model. Its structure consists of a series of sequentially constructed decision trees. Each tree learns to fit the residuals of the ensemble predictions from all preceding trees. The model output is a weighted sum of the predictions from all trees. The input to this model is a feature vector, which is five-dimensional in this embodiment; the output is a normalized probability value between 0 and 1. The five represented feature values are concatenated into a feature vector, for example, [0.91, 0.0085, 0.88, 1, 3], and input into the pre-trained Gradient Boosting Decision Tree model, such as LightGBM. This model learns on a training set containing tens of thousands of manually labeled entity references, candidate entities, and correctly linked triples, automatically learning the optimal combination and weights of each feature. The model performs forward computation on the input feature vector and outputs a probability value between 0 and 1, such as 0.97. This value is the link confidence. Repeat this process for all candidate entities and select the link result with the highest confidence as the decision for entity referencing resolution. If the highest confidence is still lower than a preset threshold, such as 0.6, then mark the entity as unlinkable (NIL).

[0037] D) For multiple data instances belonging to different modalities that refer to the same knowledge graph entity, the context semantic vector of the data instance is extracted using a pre-trained unified multimodal model. The semantic vector is mapped to a shared Hamming space through a cross-modal hashing network. Based on the fact that the Hamming distance between hash codes is less than a preset threshold, the data instances refer to the same fact, and cross-source fact alignment is performed. For data instances linked to the same knowledge graph entity, such as a text representing an experimental result and a corresponding result graph, a pre-trained multimodal model using OpenAI's CLIP or Google's ALIGN is used. The data instances are input into the model's text encoder and image encoder, respectively, generating floating-point feature vectors with the same dimension and semantic alignment. These feature vectors are then input into a pre-trained deep cross-modal hashing network (DCMH). This network uses fully connected layers and a tanh activation function to convert the high-dimensional vector into a fixed-length binary hash code, such as 64 bits. The Hamming distance between the corresponding hash codes of two data instances is calculated—that is, the number of 1s in the result after a bitwise XOR operation. If this distance is less than a pre-set threshold, the two data instances from different sources or different modalities are determined to represent the same objective fact, thus achieving alignment at the factual level.

[0038] In some embodiments, extracting the context semantic vector of multiple data instances belonging to different modalities that refer to the same knowledge graph entity using a pre-trained unified multimodal model includes: For data instances of different modalities such as text, image, and table, the data instances are input into a unified multimodal pre-trained model, and the 768-dimensional output vector of the fused multimodal information corresponding to the start symbol [CLS] in the last encoder layer of the model is extracted as the context semantic vector of the data instance.

[0039] The processing flow begins with standardized preprocessing of data from different modalities to adapt to the input format of the UNITER model. ViT is a neural network based on a Transformer encoder architecture. This neural network structure segments the input image into a series of fixed-size image patches, performs a linear embedding transformation on each patch to obtain a block vector, and adds learnable positional embedding information to the block vector. The resulting vector sequence is then fed into a multi-layered stacked Transformer encoder for processing. The model's input is a digital image; the output is a sequence of visual features, where each vector represents a high-level semantic feature of the corresponding image patch after global context modeling. For text instances, a WordPiece tokenizer is used to convert the text into a token sequence, and [CLS] and [SEP] tags are added. For image instances, the ViT feature extractor encodes the instance into a fixed-length visual feature sequence, where each token represents an image patch, enabling more comprehensive image information detection than RoI-based object detection methods. For a table instance, the instance is linearized into a string with special markers, such as: [CLS][TABLE][COL]Temperature[COL]Pressure[ROW]300K[CELL]1atm[SEP], explicitly preserving the two-dimensional structure information.

[0040] The UNITER model is a unified multimodal pre-trained model based on the Transformer architecture. Its core structure consists of 12 stacked Transformer encoder layers, each containing a multi-head self-attention module and a feedforward neural network. This structure enables joint modeling of concatenated text and visual feature sequences. The model's input is a unified sequence composed of concatenated embedding vectors from different modalities, including word embeddings for text, visual feature embeddings for images, and their respective positional and modality type embeddings. The output is a context vector sequence of the same length as the input sequence, with the 768-dimensional output vector corresponding to the sequence start symbol [CLS] used as a global contextual semantic representation. Preprocessed modal data are embedded into vectors and concatenated into a unified input sequence, which is then fed into the UNITER model. This model's multi-head self-attention mechanism can detect deep semantic associations across modalities, such as the relationship between the peak value at 300K in text and the corresponding coordinate region in a chart. After the data flows through the model's 12 encoder layers, the 768-dimensional output vector output by the last encoder layer, which corresponds to the input sequence start symbol [CLS], is regarded as the global context representation of the entire multimodal input.

[0041] In some embodiments, the step of mapping the semantic vector to a shared Hamming space via a cross-modal hashing network, and determining that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, includes: Construct a twin hash network with two weight-sharing branches, each branch consisting of three fully connected layers and one output layer; The 768-dimensional semantic vector is input into the network, and a 128-dimensional real number vector is generated through forward propagation. The vector is then processed by the sign function to obtain a 128-bit binary hash code. Calculate the number of different bits between the hash codes of two data instances to obtain the Hamming distance; When the Hamming distance is less than or equal to 10, the two data instances are considered to refer to the same fact.

[0042] The cross-modal hashing network employs a weight-sharing Siamese neural network architecture. Each branch is a multilayer perceptron consisting of three fully connected layers: the first layer maps the input to 1024 dimensions using ReLU activation; the second layer reduces the dimension to 512 dimensions using ReLU activation; and the third layer outputs 128 dimensions using Tanh activation. The network's input is a 768-dimensional context semantic vector; the output is a 128-bit binary hash code, obtained by applying a sign function to the 128-dimensional real-valued vector output by the third layer. Training relies on a training set containing a large number of data instances A, B, and labeled triples, where a label of 1 indicates that A and B refer to the same fact, and -1 indicates different facts. A combined loss function is used during training. ,in These are paired losses, such as contrastive loss or triple loss, used to bring similar or dissimilar samples closer or further apart; It is quantization loss, used to reduce information loss in the process of converting real number vectors into binary hash codes; It is a hyperparameter used to balance the two loss terms, with an optimal value of 0.1.

[0043] During the inference phase, a 768-dimensional semantic vector is transformed into a 128-dimensional real vector v through forward propagation in a single branch of the network. Subsequently, v is binarized using the sign function sign(x): if The i-th bit of the hash code A hash value is 1 if the data instance refers to the same fact and 0 otherwise. Thus, each data instance acquires a 128-bit binary hash code. To determine whether two data instances refer to the same fact, the hash code is calculated. and The Hamming distance between them is calculated using bitwise operations. Complete. The Hamming distance threshold T is determined based on the precision-recall curve on the validation set, reaching an optimal balance point, for example, T=10. When the Hamming distance between two hash codes is less than or equal to 10, it indicates that at least approximately 92.2% of the bits are the same, and the two data instances are determined to be in fact aligned.

[0044] Since hash networks can only compare pairs, in one embodiment, a Hamming distance-based approach is used. Specifically, for N data instances linked to the same knowledge graph entity, the Hamming distance between each pair is calculated, constructing an N×N distance matrix. Subsequently, a density-based spatial clustering algorithm, such as DBSCAN or hierarchical clustering, is used to group data instances with a Hamming distance less than a preset threshold, such as less than or equal to 10, into the same semantic cluster. Further, to ensure that data instances within the same semantic cluster are not only similar in semantic topics but also consistent in objective facts, the key-value pairs inherent in structured and semi-structured data are used for fact verification. For data instances grouped into the same cluster, key fact constraints, such as time, value, or location, are extracted from their structured fields. If the numerical deviation of different instances in the above key constraints exceeds the allowable tolerance range, such as the time span not being the same day or the core values being unequal, they are determined to refer to different facts, and are removed from the current cluster or split into a new fact cluster.

[0045] E) Based on the relational pattern of the query request, retrieve subgraph structures that are isomorphic to the pattern in the science and technology knowledge graph, use the subgraph structures as an instant aggregation pattern, and organize and integrate aligned data instances according to this pattern to generate structured science and technology intelligence data.

[0046] The parsed query relationship pattern, such as Entity A-Relation R-Entity B, is converted into a Cypher query statement in the Neo4j graph database. For example, MATCHa:EntityTypeA-[r:RelationTypeR]->b:EntityTypeBWHERE a.name=query entity A's name RETURN a,r,b. This query is executed to retrieve all subgraph instances in the science and technology knowledge graph that satisfy the pattern. Each returned subgraph instance constitutes the skeleton of an instantaneous aggregation pattern. Each subgraph instance is traversed, and for each entity node, all data instances aligned to that entity node in the previous step, including text fragments, image URLs, and table data, are used as content evidence to populate the node. The populated subgraph structure is serialized into a structured JSON object array, where each JSON object represents a piece of aggregated science and technology intelligence, containing a clearly defined subject, relationship, object, and multimodal data evidence supporting the fact.

[0047] In some embodiments, the step of retrieving subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, and using the subgraph structures as instantaneous aggregation patterns, includes: The relational schema of the query request is transformed into a graph query schema consisting of entity type nodes and relation type edges; The VF2++ algorithm is used to match subgraphs that are isomorphic to the graph query pattern in the science and technology knowledge graph. The structure of the first successfully matched subgraph instance that contains the specific entity in the query request is used as the immediate aggregation mode for this aggregation.

[0048] The system parses user queries, whether in natural language or structured form, into a standard graph query pattern, which can be represented as a statement in a graph query language such as Cypher or SPARQL. For example, the query to find papers published by Tsinghua University that cite Huawei patents can be converted into a Cypher query: MATCH(n1: Institution {name: "Tsinghua University"}) - [: Published] -> (n2: Paper) - [: Citation] -> (n3: Patent) - [: Applicant] -> (n4: Institution {name: "AA Technology Co., Ltd."}) RETURN n1,n2,n3,n4. This pattern defines the types of nodes, attribute constraints, and relationships between nodes.

[0049] This graph query is executed on a graph database such as Neo4j. The query engine of the graph database has a built-in optimized subgraph matching algorithm, such as a variant based on VF2++. The algorithm starts from the nodes with the strongest constraints in the graph query pattern, namely nodes n1 and n4 with the specified name attribute, and quickly locates the starting entity in the knowledge graph through indexing. It performs directed graph traversal and expansion along the relational path defined in the query pattern, such as "publish,:citation", while checking whether the types of nodes encountered on the path match. This process uses various pruning strategies to avoid invalid searches. The system is configured to obtain the first successfully matched complete subgraph instance. For example, the found instance is Tsinghua University - Published -> Paper A - Citation -> Patent B - Applicant -> AA. The topology of this instance is extracted as Institution - Published -> Paper - Citation -> Patent - Applicant -> Institution, and this topology link is used as the immediate aggregation pattern for this data aggregation task.

[0050] In yet another embodiment, for each successfully matched candidate subgraph structure, its comprehensive score is calculated. .in, The normalized value of the frequency of the subgraph topology in the entire science and technology knowledge graph represents the universality of the structure. The information density score represents the proportion of non-empty fields in the subgraph's attribute field that can be filled by all currently aligned data instances. All candidate subgraphs will be traversed, and the subgraph instance structure with the highest overall score S will be selected as the instantaneous aggregation pattern for this data organization and integration.

[0051] In some embodiments, organizing and integrating aligned data instances according to this pattern to generate structured science and technology intelligence data includes: Create a JSON object template based on the entities and relationships defined in the instant aggregation pattern; Information from multiple data instances that have been aligned and refer to the same fact is populated into the corresponding entity attribute fields in the JSON template according to the knowledge graph entities corresponding to the data instances. For fields with conflicting content, the source authority score is queried based on the source domain fingerprint of the data instance. Combined with the information release time, the information with the highest authority and the most recent time is selected as the final value to complete the data integration and output a structured JSON format science and technology intelligence data.

[0052] Based on the acquired instant aggregation pattern of organization-publication-paper, a nested JSON object template is generated. The structure of this template reflects the topological relationship of the pattern and reserves key attribute fields obtained from the knowledge graph schema for each entity, such as: {"organization":{"name":null,"country":null,"type":null},"relationship":{"type":"publication","date":null},"publication":{"title":null,"doi":null,"authors":[],"abstract":null}}.

[0053] Iterate through all aligned data instances and, based on the knowledge graph entities linked to each instance, populate the corresponding positions in the JSON template with the information of those entities. When multiple data instances provide different values for the same attribute field, a configurable conflict resolution strategy is triggered. The default strategy prioritizes authority over timeliness: it queries the preset source authority score range of 0-1 and the information publication time corresponding to the source domain fingerprint of each instance. For example, if instance A has an authority score of 0.9 and a publication date of 2022-01-10, and instance B has an authority score of 0.7 and a publication date of 2022-01-12, providing different values for `publication.date`, the value of instance A with the higher authority score is selected. For specific data types, more refined strategies can be applied: for numerical data such as funding amounts, an authority-weighted average can be used; for textual data such as summaries, content from multiple sources can be concatenated or a text summarization algorithm can be used to generate a fused summary. After completing all population and conflict resolution, a complete and consistent structured JSON object is output.

[0054] Based on the aggregation method for multi-source heterogeneous intelligence data based on any of the above embodiments, this application also provides an aggregation system for multi-source heterogeneous intelligence data, including the following modules: The acquisition module is used to acquire raw data of multi-source heterogeneous scientific and technological intelligence, extract and calculate the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and store the raw data, structured entropy and source domain fingerprint together in the data lake for unstructured storage; receive query requests, parse the requests to obtain query entities and relational patterns; and based on the query entities, relational patterns and the structured entropy and source domain fingerprint, initially screen candidate data units from the data lake; The calculation module is used to perform entity recognition on the candidate data units and link them to a pre-set science and technology knowledge graph. When there are multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as context constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. The determination module is used to extract the context semantic vector of multiple data instances belonging to different modalities that refer to the same knowledge graph entity using a pre-trained unified multimodal model, map the semantic vector to a shared Hamming space through a cross-modal hash network, and determine that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, and perform cross-source fact alignment. The generation module is used to retrieve subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, use the subgraph structures as an instant aggregation pattern, and organize and integrate aligned data instances according to this pattern to generate structured science and technology intelligence data.

[0055] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0056] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0057] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.

Claims

1. A method for aggregating multi-source heterogeneous intelligence data, characterized in that, Includes the following steps: Obtain raw data of multi-source heterogeneous scientific and technological intelligence, extract and calculate the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and store the raw data, structured entropy and source domain fingerprint together in the data lake for unstructured storage. Receive a query request and parse the request to obtain the query entity and relation schema; Based on the query entity, relation schema, structured entropy, and source domain fingerprint, candidate data units are initially screened from the data lake; The candidate data units are identified as entities and linked to a pre-set science and technology knowledge graph. In the case of multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as contextual constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. For multiple data instances belonging to different modalities that refer to the same knowledge graph entity, the context semantic vector of the data instance is extracted using a pre-trained unified multimodal model. The semantic vector is mapped to a shared Hamming space through a cross-modal hashing network. Based on the fact that the Hamming distance between hash codes is less than a preset threshold, the data instances refer to the same fact, and cross-source fact alignment is performed. Based on the relational pattern of the query request, a subgraph structure that is isomorphic to the pattern is retrieved in the science and technology knowledge graph. The subgraph structure is used as an instant aggregation pattern, and aligned data instances are organized and integrated according to this pattern to generate structured science and technology intelligence data.

2. The method according to claim 1, characterized in that, The process of acquiring multi-source heterogeneous scientific and technological intelligence raw data, extracting and calculating the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit includes: For semi-structured data, parse the DOM tree or JSON tree structure of the data, count the total number of nodes and the maximum level depth, and calculate the normalized structured entropy by multiplying the ratio of the total number of nodes to the preset maximum number of nodes by the ratio of the maximum level depth to the preset maximum level depth. Extract the main domain name of the data source and generate a 256-bit hash value using the SHA-256 algorithm, which serves as the source domain fingerprint.

3. The method according to claim 1, characterized in that, The preliminary screening of candidate data units from the data lake based on the query entity, relation schema, structured entropy, and source domain fingerprint includes: Using an inverted index, the text content of the query entity contained in the data lake is searched in full-text search to obtain the initial set of data units; Based on the complexity of the query relationship pattern, a target range for structured entropy is set, and data units with entropy values within the range are selected. Based on the source domain fingerprint, a pre-set list of authoritative information sources is loaded, and only data units whose source is an information source within the list are retained to obtain candidate data units.

4. The method according to claim 1, characterized in that, The construction of the confidence calculation model, which calculates the confidence of each identified entity linked to a candidate entity in the knowledge graph, includes: A set of feature functions is constructed for the confidence calculation model. The set of feature functions includes: the string similarity between the identified entity string and the candidate entity name in the knowledge graph, the centrality index of the candidate entity in the knowledge graph, the part-of-speech tag sequence of the identified entity in the original text, and the cosine similarity between the entity context word vector and the candidate entity representation text vector. The feature function is input into the trained gradient boosting decision tree model, and the model outputs a normalized probability value, which is the confidence level.

5. The method according to claim 1, characterized in that, For multiple data instances belonging to different modalities that refer to the same knowledge graph entity, the context semantic vector of the data instances is extracted using a pre-trained unified multimodal model, including: For data instances of different modalities such as text, image, and table, the data instances are input into a unified multimodal pre-trained model, and the 768-dimensional output vector of the fused multimodal information corresponding to the start symbol [CLS] in the last encoder layer of the model is extracted as the context semantic vector of the data instance.

6. The method according to claim 1, characterized in that, The step of mapping the semantic vector to a shared Hamming space through a cross-modal hashing network, and determining that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, includes: Construct a twin hash network with two weight-sharing branches, each branch consisting of three fully connected layers and one output layer; The 768-dimensional semantic vector is input into the network, and a 128-dimensional real number vector is generated through forward propagation. The vector is then processed by the sign function to obtain a 128-bit binary hash code. Calculate the number of different bits between the hash codes of two data instances to obtain the Hamming distance; When the Hamming distance is less than or equal to 10, the two data instances are considered to refer to the same fact.

7. The method according to claim 1, characterized in that, The step of retrieving subgraph structures that are isomorphic to the relational pattern of the query request in the scientific and technological knowledge graph, and using the subgraph structures as instantaneous aggregation patterns, includes: The relational schema of the query request is transformed into a graph query schema consisting of entity type nodes and relation type edges; The VF2++ algorithm is used to match subgraphs that are isomorphic to the graph query pattern in the science and technology knowledge graph. The structure of the first successfully matched subgraph instance that contains the specific entity in the query request is used as the immediate aggregation mode for this aggregation.

8. The method according to claim 1, characterized in that, The process of organizing and integrating aligned data instances according to this model to generate structured science and technology intelligence data includes: Create a JSON object template based on the entities and relationships defined in the instant aggregation pattern; Information from multiple data instances that have been aligned and refer to the same fact is populated into the corresponding entity attribute fields in the JSON template according to the knowledge graph entities corresponding to the data instances. For fields with conflicting content, the source authority score is queried based on the source domain fingerprint of the data instance. Combined with the information release time, the information with the highest authority and the most recent time is selected as the final value to complete the data integration and output a structured JSON format science and technology intelligence data.

9. A system for aggregating multi-source heterogeneous intelligence data, characterized in that, Includes the following modules: The acquisition module is used to acquire raw data of multi-source heterogeneous scientific and technological intelligence, extract and calculate the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit, and store the raw data, structured entropy and source domain fingerprint together into the data lake for unstructured storage. Receive a query request and parse the request to obtain the query entity and relation schema; Based on the query entity, relation schema, structured entropy, and source domain fingerprint, candidate data units are initially screened from the data lake; The calculation module is used to perform entity recognition on the candidate data units and link them to a pre-set science and technology knowledge graph. When there are multiple candidate linked entities, a confidence calculation model is constructed based on the topological structure and semantic information of the knowledge graph as context constraints. The confidence of each identified entity linked to the candidate entity in the knowledge graph is calculated, and the link result with the highest confidence is selected to complete the entity referencing resolution. The determination module is used to extract the context semantic vector of multiple data instances belonging to different modalities that refer to the same knowledge graph entity using a pre-trained unified multimodal model, map the semantic vector to a shared Hamming space through a cross-modal hash network, and determine that the data instances refer to the same fact based on the Hamming distance between hash codes being less than a preset threshold, and perform cross-source fact alignment. The generation module is used to retrieve subgraph structures that are isomorphic to the relational pattern of the query request in the science and technology knowledge graph, use the subgraph structures as an instant aggregation pattern, and organize and integrate aligned data instances according to this pattern to generate structured science and technology intelligence data.

10. The system according to claim 9, characterized in that, The process of acquiring multi-source heterogeneous scientific and technological intelligence raw data, extracting and calculating the structured entropy representing the structural complexity of each data unit and the source domain fingerprint representing the reliability of the source of each data unit includes: For semi-structured data, parse the DOM tree or JSON tree structure of the data, count the total number of nodes and the maximum level depth, and calculate the normalized structured entropy by multiplying the ratio of the total number of nodes to the preset maximum number of nodes by the ratio of the maximum level depth to the preset maximum level depth. Extract the main domain name of the data source and generate a 256-bit hash value using the SHA-256 algorithm, which serves as the source domain fingerprint.