A multi-source heterogeneous data fusion processing method and system of an industrial internet platform

By calculating the number of path edges and guidance coefficients from words to standard classification nodes in the industrial internet platform, the topic distribution is automatically aligned with the industry standard classification, solving the problem of manual annotation and mapping required in existing technologies, and realizing efficient and automatic data fusion processing.

CN122196185APending Publication Date: 2026-06-12NINGBO LANYUAN IND & CITY GROUP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NINGBO LANYUAN IND & CITY GROUP CO LTD
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, when industrial internet platforms process data fusion, the topics generated by the topic model cannot directly correspond to industry standard classifications, requiring manual annotation and mapping, which results in low processing efficiency and difficulty in meeting real-time requirements.

Method used

By constructing a structure decay factor by calculating the number of shortest path edges from words to standard classification nodes, and combining it with the guiding coefficient in Gibbs sampling as a prior constraint, the topic distribution is automatically aligned with the industry standard classification. The guiding coefficient is constructed using distribution significance and path association density, thereby automating and standardizing the topic allocation process.

🎯Benefits of technology

The system can generate topic distributions with natural alignment standards without manual annotation, which improves the automation and real-time performance of data fusion processing, ensures direct availability to downstream business systems, and reduces manual maintenance costs and classification bias.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196185A_ABST
    Figure CN122196185A_ABST
Patent Text Reader

Abstract

The present application relates to the field of data processing, more particularly, the present application relates to a kind of industrial internet platform multi-source heterogeneous data fusion processing method and system, method includes: from multiple heterogeneous data sources collection multi-source text, pre-processing obtains global vocabulary;Based on industry standard classification tree, the shortest path edge number of word to each standard classification node is calculated;The distribution significance of word is calculated, and the path correlation density is obtained by combining structure attenuation factor, and then the guide coefficient of each text belonging to each standard classification node is constructed;Each standard classification node is set as a theme, and the guide coefficient is taken as priori constraint and is integrated into probability calculation in gibbs sampling process, to obtain the probability that text belongs to each theme;Through confidence threshold screening, the fusion label set of each text is output.The present application realizes the automatic alignment of multi-source heterogeneous data and industry standard classification system, improves the automation level and business availability of data fusion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing. More specifically, this invention relates to a method and system for multi-source heterogeneous data fusion processing on an industrial internet platform. Background Technology

[0002] During operation, industrial internet platforms need to integrate data from multiple sources, such as enterprise demand pools, patent databases, and academic paper repositories, and transform them into standardized information categorized and labeled according to industry standards for use by downstream business systems.

[0003] Currently, commonly used data fusion processing methods mainly include keyword matching and topic models based on Latent Dirichlet Allocation (LDA). While the standard LDA algorithm can automatically discover potential topics in text based on word co-occurrence statistics, the topics it generates are mathematical clustering results, lacking interpretability at the business level. When these topics need to be applied to industrial scenarios, they cannot be directly mapped to the platform's pre-set industry standard classification system.

[0004] In existing technologies, each cluster topic typically requires manual interpretation and mapping to standard classifications. However, manual annotation is inefficient and requires repeated annotation and maintenance as data grows dynamically or industry standards are updated, making it difficult to meet the requirements of industrial internet platforms for automated and real-time data fusion. Summary of the Invention

[0005] To address the aforementioned technical problems, the present invention provides solutions in the following aspects.

[0006] In the first aspect, a method for multi-source heterogeneous data fusion processing of an industrial internet platform includes: Multi-source text is collected from multiple heterogeneous data sources, and the multi-source text is preprocessed to obtain a global vocabulary. Based on a preset industry standard classification tree, the number of shortest path edges from each word in the global vocabulary to each standard classification node in the industry standard classification tree is determined. Calculate the distribution saliency of each word in each text; for each word and each standard classification node, construct a structure decay factor based on the number of shortest path edges, and multiply the distribution saliency by the structure decay factor to obtain the path association density between each word and each standard classification node; based on the path association density, obtain the guiding coefficient for each text to belong to each standard classification node; A topic is set for each standard classification node. During the Gibbs sampling iteration of the topic, the guiding coefficient is incorporated as a prior constraint into the probability calculation to obtain the probability that each text belongs to each topic. Based on the probability that each text belongs to each topic, a confidence threshold is used to filter the results, and the resulting fusion tag set for each text is output as the fusion processing result.

[0007] Optionally, the calculation process for the significance of the distribution includes: Select any word from the global vocabulary as the target word, and select any text as the target text; Obtain the frequency of occurrence of the target word in the target text, the total number of words contained in the target text, and the number of all texts containing the target word; The importance of the target words in the target text is calculated based on the frequency of their occurrence in the target text and the total number of words contained in the target text. Calculate the scarcity of the target word in all texts based on the total number of texts and the number of texts containing the target word; Multiplying the importance of the target word in the target text by the scarcity of the target word in all texts yields the significance of the target word's distribution in the target text.

[0008] Optionally, the structural decay factor is the reciprocal of the sum of the number of shortest path edges from a word to a standard classification node and one.

[0009] Optionally, the calculation process of the guiding coefficient includes: The total evidence score is obtained by summing the path association densities of all words in the text with the same standard classification node; the global evidence sum is obtained by summing the path association densities of all words in the text with all standard classification nodes; the evidence score is then divided by the global evidence sum to obtain the guiding coefficient that the text belongs to the standard classification node.

[0010] Optionally, incorporating the guiding coefficient as a priori constraint into the probability calculation includes: Assign a topic to each word in each text. During the iteration of Gibbs sampling, for the current word, its old topic assignment is revoked; Construct text-topic items and topic-word items separately; Multiply the text-topic item by the topic-word item and normalize to obtain the probability of the word being assigned to each topic; The topic assignment for the current word is updated based on the given probability.

[0011] Optionally, the text-topic item is the number of words in a single text that have been assigned to any topic, plus the product of the guiding coefficient and the constraint strength coefficient corresponding to that text; the constraint strength coefficient is the total number of words in the current text.

[0012] Optionally, the process of constructing the topic-term items includes: Calculate the first parameter and the second parameter separately, divide the first parameter by the second parameter, and the result is the topic-term item; The first parameter is the sum of the number of times the current word has appeared under a single topic and the preset smoothing parameter; the second parameter is the total number of times all words under that topic have appeared plus the global word list size multiplied by the smoothing parameter.

[0013] Optionally, the screening by confidence threshold includes: Calculate the distribution weight of each topic based on the number of words assigned to each topic in each text after iterative convergence; A pre-set confidence threshold is used to include the standard classification nodes corresponding to topics with distribution weights greater than or equal to the confidence threshold into the fusion label set; if the fusion label set is empty, the standard classification node corresponding to the topic with the highest distribution weight is selected.

[0014] Secondly, a multi-source heterogeneous data fusion processing system for an industrial internet platform includes: a processor and a memory, wherein the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the multi-source heterogeneous data fusion processing method for the industrial internet platform described in any one of the claims is implemented.

[0015] The present invention has the following beneficial effects: 1. This invention constructs a structural decay factor by calculating the number of shortest path edges from words to standard classification nodes. The path association density is obtained by multiplying the distribution salience of words by this decay factor. This results in words semantically closer to a particular classification node having a greater contribution weight to that node, while effectively suppressing the contribution of cross-domain words. Based on this, a guiding coefficient is directly used as a prior constraint in Gibbs sampling, ensuring that the topic allocation process is always guided by the structure of the industry classification tree. The resulting topic distribution naturally aligns with standard classification nodes, eliminating the need for manually establishing a mapping relationship between clustered topics and the classification system, thus reducing the cost of manual annotation and maintenance.

[0016] 2. This invention calculates the distribution significance by multiplying the frequency of a word in a text by its inverse text frequency in the entire corpus. This automatically suppresses frequently occurring general words, while core technical words that appear only in a few texts but have high weight in the target text receive higher significance. This improves the model's ability to perceive the core content of the text and avoids classification bias caused by differences in text length and interference from general words.

[0017] 3. This invention uses a confidence threshold screening mechanism to treat multiple topics with a weight greater than the threshold as a fusion tag set for each text output. At the same time, it automatically falls back to the topic with the highest weight when the tag set is empty, which ensures the integrity and reliability of the output results. This allows downstream business systems to directly obtain the weighted multi-tag classification results and use them for supply and demand matching, risk warning and other scenarios without additional processing. Attached Figure Description

[0018] Figure 1 This is a flowchart of steps S1-S3 in a multi-source heterogeneous data fusion processing method for an industrial internet platform according to an embodiment of the present invention.

[0019] Figure 2 This is a flowchart illustrating the calculation of the probability distribution of each text belonging to each topic in a multi-source text data set in a multi-source heterogeneous data fusion processing method for an industrial internet platform according to an embodiment of the present invention.

[0020] Figure 3 This is a structural block diagram of a multi-source heterogeneous data fusion processing system for an industrial internet platform according to an embodiment of the present invention. Detailed Implementation

[0021] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are some embodiments of the present invention, but not all embodiments.

[0022] The industrial internet platform used in this embodiment needs to integrate the technical requirements, relevant patent abstracts, and academic paper abstracts published by enterprises during its operation, and automatically label these heterogeneous data as predefined industry standard classification codes, such as sealing materials, transmission bearings, and industrial software, for subsequent supply and demand matching and risk warning system calls.

[0023] To address this, this invention provides a method for multi-source heterogeneous data fusion processing for industrial internet platforms. This method aims to achieve the following ultimate goal: automatically convert heterogeneous texts of varying formats and semantics collected from multiple sources such as enterprise demand pools, patent databases, and academic paper repositories into structured data with clear industry standard classification labels, and store this data in a dynamic trusted data pool. This data can then be directly accessed by downstream business systems such as risk control and supply-demand matching. This solves the problem in existing technologies where themes generated by topic models cannot directly correspond to industry standard classifications and require manual annotation and mapping, enabling the data fusion processing results to be used immediately without the need for secondary manual annotation.

[0024] Reference Figure 1 A method for fusion processing of multi-source heterogeneous data on an industrial internet platform includes steps S1-S3, as detailed below: S1: Collect multi-source text from multiple heterogeneous data sources and perform preprocessing operations.

[0025] Within each data fusion task cycle, for example, triggered once every 24 hours, the system reads technical requirement text from the enterprise demand pool, patent abstract text from the patent database, and research conclusion text from the academic paper database to form a multi-source text data set. In this embodiment of the invention, it is assumed that the multi-source text data set contains 1,000 texts.

[0026] Simultaneously, a pre-configured industry-standard classification tree is loaded from the platform configuration center. This classification tree is stored as a hierarchical object, with each node containing a unique classification code, a parent node reference identifier, and a hierarchy depth value. The system loads this into memory through a data reading interface and constructs it into a tree-like index structure that supports fast queries.

[0027] For each text in the above multi-source text dataset, perform the following preprocessing operations in sequence: Perform text cleaning to remove HTML tags, characteristic symbols, and stop words.

[0028] The word segmentation process is performed to obtain the standard word sequence.

[0029] A global vocabulary list is created for all unique words obtained after cleaning and word segmentation. The frequency of each word in the global vocabulary list in the text is counted, as well as the total number of texts containing the word.

[0030] Using a pre-built dictionary of technical terms and standard classification nodes, and through exact matching and synonym expansion rules, the system determines the node in the industry-standard classification tree for each cleaned and segmented word. If multiple candidate nodes are matched, the node with the deepest hierarchy is selected; if the match fails, the node is marked as null.

[0031] For each successfully mapped word, the breadth-first search algorithm is used to calculate the number of shortest path edges from the corresponding home node to any standard classification node in the industry standard classification tree; the number of shortest path edges for unmapped words is set to infinity, for example, 9999.

[0032] Through the above operations, we finally obtain the multi-source text data set after cleaning and word segmentation, as well as the frequency of each word in each text in the global vocabulary, the total number of texts in which each word appears in the entire multi-source text data set, and the number of shortest path edges from the node to which each word belongs to each classification node in the industry standard classification tree.

[0033] S2: Based on the preset industry standard classification constraints and the preprocessed multi-source text, calculate the probability that each text belongs to each topic; each topic corresponds one-to-one with an industry standard classification.

[0034] S1 above has provided preprocessed multi-source text data and industry-standard classification trees. In this embodiment of the invention, topics refer to probabilistic clusters generated by the Latent Dirichlet Distribution (LDA) model. Each topic needs to correspond to an industry-standard classification node. However, topics generated by the standard LDA algorithm cannot automatically align with classification nodes. Therefore, it is necessary to utilize the structural knowledge of classification trees to impose constraints on the topic allocation process, so that the generated topic distribution directly corresponds to each industry-standard classification.

[0035] Reference Figure 2 Through the following sub-steps S20-S24, the probability distribution of each text in the multi-source text dataset belonging to each topic is calculated, so that the generated topic distribution directly corresponds to the standard classification of each industry.

[0036] S20: Initialize the number of topics and establish a mapping relationship with the standard classification nodes.

[0037] Considering that the topic numbers in the standard LDA algorithm are arbitrary and cannot directly correspond to business categories, a one-to-one mapping between topic indices and category node indices is established, setting the total number of topics equal to the number of category nodes. Specifically, the total number of topics is set to k, which is equal to the total number of category nodes c used for business tagging in the industry standard classification tree. In this embodiment, c is set to 15, and a one-to-one correspondence is established between topic index k and category node c, i.e., k=c. For example, k=1 corresponds to sealing materials, k=2 to transmission bearings, k=3 to industrial software, and so on. This ensures that each topic directly represents a specific industry standard classification code in a physical sense.

[0038] S21: Based on the established mapping relationship, calculate the distribution significance of each word in the multi-source text dataset.

[0039] Furthermore, considering that raw word frequency is affected by text length and common vocabulary, it cannot truly reflect the representativeness of words to the text's theme. Therefore, it is necessary to calculate an index that comprehensively considers both the local importance and global rarity of words.

[0040] Specifically, any word in the global vocabulary is selected as the target word, any text is selected as the target text, and the frequency of occurrence of the target word in the target text and the total number of words contained in the target text are obtained.

[0041] The ratio of the frequency of a target word in a target text to the total number of words in the target text is used to determine the importance of the target word in the target text.

[0042] Furthermore, all texts containing the target word are obtained and their counts are statistically analyzed. The ratio of the number of texts in the multi-source text dataset to the number of texts containing the target word is then calculated, and the logarithm is taken to determine the scarcity of the target word within the entire multi-source text dataset. Specifically, after calculating the ratio of the number of texts in the multi-source text dataset to the number of texts containing the target word, a logarithm based on base 10 is used for further calculation.

[0043] Furthermore, by multiplying the importance of the target words in the target text obtained above by the scarcity of the target words in the entire multi-source text dataset, we obtain the distribution significance of the target words in the target text.

[0044] The significance of the distribution of the aforementioned target words increases with the frequency of their occurrence in the target text and decreases with the number of texts containing the word; that is, the rarer the word and the more important it is in the current target text, the higher its distribution significance. This indicator can highlight core technical vocabulary and suppress interference from common vocabulary and differences in text length.

[0045] By analogy, the saliency of the distribution of all words in all texts in a multi-source text dataset can be obtained through the above operations.

[0046] S22: Calculate the path association density between each word and each standard classification node in the multi-source text dataset.

[0047] Knowing only the distribution significance of words is insufficient; it is also necessary to clarify their semantic distance from each standard classification node. The higher the distribution significance of words calculated in S21 above, the better it reflects the core technical content of the text. Therefore, the distribution significance of a word is used as the strength of evidence for that word to the standard classification node. However, the larger the number of edges in the shortest path from the word's home node to the standard classification node, the weaker the semantic connection between the two, and the smaller the contribution of this strength of evidence to the standard classification node should be. To this end, a structural attenuation factor is defined that decreases as the number of edges in the shortest path increases. The larger the number of edges in the shortest path, the smaller the structural attenuation factor, and the more significant the attenuation of the evidence contribution. When there is a direct hit, i.e., the number of edges in the shortest path is 0, the structural attenuation factor is 1, and the evidence is completely preserved.

[0048] Furthermore, taking the target text and target words as examples again, and selecting any standard classification node as the target node, the salience of the target word's distribution in the target text is multiplied by the aforementioned structural attenuation factor to obtain the path association density between the target word's belonging node and the target node in the target text. The higher this path association density, the closer the target word is to the target node semantically, and the more important the target word itself is in the target text. This ensures that only words that are truly semantically close to the target node and have high salience can have a significant impact, avoiding cross-domain noise interference.

[0049] The aforementioned structural attenuation factor is specifically taken as the reciprocal of the number of edges on the shortest path from the target word's home node to the target node plus 1.

[0050] Similarly, based on the above operations, the path association density from the belonging node of each word to each standard classification node in each text can be obtained. For unmapped words, the number of edges of their shortest path is infinite, so the structure decay factor approaches 0, and the path association density also approaches 0, which does not affect subsequent calculations.

[0051] S23: Construct the guiding coefficients for each text for each standard classification node based on the calculated path association density.

[0052] The path association density of each word to each standard classification node only reflects local evidence, which is scattered across different words and cannot be directly used to determine the classification of the entire text. In order to aggregate the scattered word-level evidence into text-level classification tendency, the guiding coefficient for classifying the text to each topic is calculated.

[0053] Specifically, for each standard classification node, the path association density of all words in the text to that standard classification node is summed to obtain the total evidence score for that standard classification node.

[0054] Furthermore, the sum of the path association densities of all words in the text to all standard classification nodes is taken as the global evidence sum.

[0055] Furthermore, the total evidence score of a single standard classification node is divided by the total global evidence score to obtain the guiding coefficient for classifying the text as belonging to that standard classification node.

[0056] Through the above calculations, the guiding coefficient increases with the total evidence score for a single standard classification node in the text, and the sum of the guiding coefficients for all standard classification nodes is 1. When highly significant words in the text point to a certain standard classification node, the guiding coefficient corresponding to that standard classification node is close to 1, effectively guiding subsequent topic assignments. Therefore, this guiding coefficient is essentially the probability tendency for the text to belong to various topics.

[0057] S24: Perform improved topic assignment iterative calculation based on the calculated guiding coefficients.

[0058] The standard LDA algorithm relies solely on internal statistical counts and cannot incorporate business constraints. The guiding coefficients obtained from S23 are incorporated as external knowledge into the probability formula for Gibbs sampling. Through iterative optimization, topic assignments are gradually corrected to align the final result with industry-standard classifications. The iterative process is executed as follows: The first step is to randomly assign a topic to each word in each text within the multi-source text dataset. Record the initial topic label for each word.

[0059] The second step is to calculate the text-topic count based on the initial allocation results, which is to calculate the number of words assigned to a single topic in a single text, and to calculate the topic-word count, which is the frequency of a single word in a single topic.

[0060] The third step is to set a maximum number of iterations, such as 200, and a convergence threshold, for example, a change in the topic of words between two adjacent iterations of less than 1%. Repeat the following steps until the convergence condition is met: Get the current topic assigned to any word in a single text, decrement the text-topic count in the second step by 1, and also decrement the topic-word count in the second step by 1, thereby revoking the old state and updating to obtain the new state.

[0061] Then, based on the newly updated state and the guiding coefficient calculated in S23 above, the probability of assigning the word to a certain topic in a single text is calculated, as follows: First, construct a text-topic item, whose value is equal to the number of words in the text that have been assigned to the topic plus the product of the text's guidance coefficient and constraint strength coefficient. The constraint strength coefficient is taken as the total number of words in the current text.

[0062] Secondly, construct the topic-word item by calculating the number of times the current word has appeared under the topic and adding a smoothing parameter, for example, 0.001, to obtain the first parameter; calculate the total number of occurrences of all words under the topic and add the global vocabulary size multiplied by the same smoothing parameter to obtain the second parameter; divide the first parameter by the second parameter to obtain the topic-word item.

[0063] Then, the text-topic item and the topic-word item are multiplied and normalized as usual to obtain the score for the topic.

[0064] After performing the same calculation on all topics, the score for each topic is divided by the sum of the scores for all topics to obtain the probability of that word being assigned to each topic.

[0065] The fourth step is to randomly select a topic as the new topic for the word using a roulette wheel method, based on the calculated probability distribution.

[0066] Fifth, update the topic tag of the word to the new topic, and increment the text-topic count and topic-word count of the new topic by 1.

[0067] After completing one round of updates for all words in all texts, calculate the proportion of words whose themes have changed compared to the previous round.

[0068] If the change rate is less than 1% or the maximum number of iterations has been reached, the iteration is terminated; otherwise, the next round of iterations continues.

[0069] S3: Based on the probability that each text belongs to each topic, a set of fusion tags for each text is generated by filtering through a confidence threshold to complete the fusion processing of multi-source heterogeneous data.

[0070] For a single text, obtain the number of words assigned to each topic in the text after iterative convergence. Divide the number of words assigned to a single topic in the text by the sum of the number of words assigned to all topics in the text to obtain the distribution weight of each topic.

[0071] A confidence threshold is preset, for example, set to 0.15. All topics are iterated through: if the distribution weight of a topic is greater than or equal to the confidence threshold, the industry standard classification node code and name corresponding to that topic are included in the fusion tag set of the text, and the corresponding distribution weight is recorded. If the tag set is empty after filtering, the single topic with the highest distribution weight is selected as the main fusion tag.

[0072] Finally, the original text content, the fusion tag set (including classification code, name, and weight), and metadata (such as data source and collection time) are encapsulated into a JSON object and stored in the dynamic trusted data pool of the industrial internet platform for direct access by downstream business systems such as supply and demand matching and risk warning.

[0073] Thus, the embodiments of the present invention have fully realized the process from multi-source heterogeneous data acquisition and topic distribution calculation based on industry standard constraints to multi-source heterogeneous data fusion processing generated by fusion tags.

[0074] This invention also provides a multi-source heterogeneous data fusion processing system for an industrial internet platform. For example... Figure 3 As shown, the system includes a processor and a memory. The memory stores computer program instructions. When the computer program instructions are executed by the processor, the multi-source heterogeneous data fusion processing method of the industrial internet platform according to the first aspect of the present invention is implemented.

[0075] The system also includes other components well known to those skilled in the art, such as communication buses and communication interfaces, the settings and functions of which are known in the art and will not be described in detail here.

[0076] It should be noted that those skilled in the art can make various modifications and improvements without departing from the inventive concept, and these all fall within the scope of protection of this invention. Therefore, the scope of protection of this patent should be determined by the appended claims.

Claims

1. A method for multi-source heterogeneous data fusion processing in an industrial internet platform, characterized in that, include: Multi-source text is collected from multiple heterogeneous data sources, and a global vocabulary is obtained by preprocessing the multi-source text. Based on the preset industry standard classification tree, determine the number of shortest path edges from each word in the global vocabulary to each standard classification node in the industry standard classification tree; Calculate the distribution saliency of each word in each text; for each word and each standard classification node, construct a structure decay factor based on the number of shortest path edges, and multiply the distribution saliency by the structure decay factor to obtain the path association density between each word and each standard classification node; based on the path association density, obtain the guiding coefficient for each text to belong to each standard classification node; A topic is set for each standard classification node. During the Gibbs sampling iteration of the topic, the guiding coefficient is incorporated as a prior constraint into the probability calculation to obtain the probability that each text belongs to each topic. Based on the probability that each text belongs to each topic, a confidence threshold is used to filter the results, and the resulting fusion tag set for each text is output as the fusion processing result.

2. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 1, characterized in that, The calculation process for the significance of the distribution includes: Select any word from the global vocabulary as the target word, and select any text as the target text; Obtain the frequency of occurrence of the target word in the target text, the total number of words contained in the target text, and the number of all texts containing the target word; The importance of the target words in the target text is calculated based on the frequency of their occurrence in the target text and the total number of words contained in the target text. Calculate the scarcity of the target word in all texts based on the total number of texts and the number of texts containing the target word; Multiplying the importance of the target word in the target text by its scarcity in all texts yields the significance of the target word's distribution in the target text.

3. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 1, characterized in that, The structural decay factor is the reciprocal of the sum of the number of edges in the shortest path from a word to a standard classification node and one.

4. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 1, characterized in that, The calculation process of the guiding coefficient includes: The total evidence score is obtained by summing the path association densities of all words in the text with the same standard classification node; the global evidence sum is obtained by summing the path association densities of all words in the text with all standard classification nodes; the evidence score is then divided by the global evidence sum to obtain the guiding coefficient that the text belongs to the standard classification node.

5. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 1, characterized in that, Incorporating the guiding coefficient as a prior constraint into probability calculations includes: Assign a topic to each word in each text. During the iteration of Gibbs sampling, for the current word, its old topic assignment is revoked; Construct text-topic items and topic-word items separately; Multiply the text-topic item by the topic-word item and normalize to obtain the probability of the word being assigned to each topic; The topic assignment for the current word is updated based on the given probability.

6. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 5, characterized in that, The text-topic item is the number of words in a single text that have been assigned to any topic, plus the product of the guiding coefficient and the constraint strength coefficient corresponding to that text; the constraint strength coefficient is the total number of words in the current text.

7. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 6, characterized in that, The process of constructing the topic-term items includes: Calculate the first parameter and the second parameter separately, divide the first parameter by the second parameter, and the result is the topic-term item; The first parameter is the sum of the number of times the current word has appeared under a single topic and the preset smoothing parameter; the second parameter is the total number of times all words under that topic have appeared plus the global word list size multiplied by the smoothing parameter.

8. The method for multi-source heterogeneous data fusion processing of an industrial internet platform according to claim 1, characterized in that, The screening based on confidence thresholds includes: Calculate the distribution weight of each topic based on the number of words assigned to each topic in each text after iterative convergence; A pre-set confidence threshold is used to include the standard classification nodes corresponding to topics with distribution weights greater than or equal to the confidence threshold into the fusion label set; if the fusion label set is empty, the standard classification node corresponding to the topic with the highest distribution weight is selected.

9. A multi-source heterogeneous data fusion processing system for an industrial internet platform, characterized in that, include: A processor and a memory, wherein the memory stores computer program instructions, which, when executed by the processor, implement the multi-source heterogeneous data fusion processing method for the industrial internet platform according to any one of claims 1-8.