A big data management system for industrial digital operation information

By calculating lexical ambiguity and semantic relevance, dynamically correcting word segmentation, and constructing entity degree indicators, the problem of information fragmentation in industrial digital operations is solved, enabling accurate identification of key information and improved decision-making efficiency.

CN121543585BActive Publication Date: 2026-06-30SHANDONG WEIXUANKANG TECH INNOVATION CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG WEIXUANKANG TECH INNOVATION CO LTD
Filing Date
2025-12-25
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively identify and integrate key information from unstructured text data in industrial digital operations, leading to information fragmentation and low decision-making efficiency.

Method used

By calculating the ambiguity and semantic relevance of words, dynamically correcting the word segmentation results, constructing an entity degree index, integrating word frequency and cross-document distribution features, and merging synonyms using an agglomerative hierarchical clustering algorithm, a knowledge graph is formed.

Benefits of technology

It enables accurate identification and filtering of key information in industry data, forming a unified entity database, and improving the level of intelligent operation and decision-making efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121543585B_ABST
    Figure CN121543585B_ABST
Patent Text Reader

Abstract

This invention relates to the field of big data management technology, specifically to a big data management system for industrial digital operation information. The system includes: a data acquisition module for acquiring various texts of industrial digital operation information and performing sentence and word segmentation processing; a word segmentation processing module for correcting word segmentation and identifying key entities by calculating the ambiguity and semantic relevance of words; determining the word information approximation level by utilizing the mutual exclusivity between entities, combined with entity degree differences and character differences; and an information management module for extracting feature entities from the text based on the word information approximation level to manage the industrial digital operation information. This application, through intelligent identification and normalization of key entities, transforms text data into a knowledge graph supporting intelligent decision-making, thereby improving the intelligence level and decision-making efficiency of industrial digital operations.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data management technology, specifically to a big data management system for industrial digital operation information. Background Technology

[0002] In the wave of industrial digital transformation, massive amounts of unstructured text data are generated during production and operation, such as production logs, maintenance reports, shift handover records, technical documents, and meeting minutes. This text data contains key information such as equipment status, process parameters, quality anomalies, and expert experience, and is a valuable asset for optimizing operations, predicting risks, and supporting decision-making. However, this information exists in natural language, is highly fragmented, has inconsistent formats, and is highly specialized, making it impossible for computer systems to directly understand and utilize it, thus forming "data silos" and "information fog."

[0003] Currently, extracting structured knowledge from such texts typically relies on Natural Language Processing (NLP) technology, whose basic processes include text segmentation, entity recognition, and relation extraction. However, in specific industry scenarios, existing general-purpose technical solutions face significant challenges: simple entity recognition methods based on word frequency or dictionaries struggle to distinguish between entities carrying key information and ordinary words in the text; simultaneously, the same entity may have multiple expressions, and existing methods lack effective mechanisms to merge and standardize these synonyms and near-synonyms, leading to information fragmentation and an inability to form a unified knowledge view, thereby reducing the level of intelligence and decision-making efficiency in industrial digital operations. Summary of the Invention

[0004] To address the aforementioned technical problems, the present invention aims to provide an industrial digital operation information big data management system, the specific technical solution of which is as follows:

[0005] This invention proposes an industrial digital operation information big data management system, the system comprising:

[0006] The data acquisition module is used to acquire various texts of industrial digital operation information;

[0007] The word segmentation module is used to obtain the various adjacent words of each word in all sentences in each text. By analyzing the difference between the frequency of each word co-occurring with any adjacent word in the same sentence and the frequency of each word co-occurring with all other adjacent words in the same sentence, the ambiguity of each word is determined.

[0008] Based on the ambiguity, erroneous words are filtered out from all words in each text; by analyzing the frequency of each word in each erroneous word and its adjacent words in all texts, as well as the frequency of each word and its next adjacent word forming a word in all texts, the semantic relevance of each word in each erroneous word and its next adjacent word forming a word is determined, so as to reclassify each erroneous word in each text.

[0009] In each text after the vocabulary re-segmentation, the frequency of each word in all texts and the probability that each word is contained in all sentences in each text are calculated. Combined with the ambiguity and semantic relevance, the entity degree of each word is determined in order to filter out entities from all words.

[0010] Based on the frequency of each entity appearing in all texts, and the frequency of each entity appearing in the same text as any other entity, the mutual exclusivity between each entity and any other entity is determined. Combined with the entity degree difference and character difference between each entity and any other entity, the lexical information similarity between each entity and any other entity is determined.

[0011] The information management module is used to extract feature entities from the text based on the similarity of the vocabulary information in order to manage the digital operation information of the industry.

[0012] Preferably, the method for determining the degree of ambiguity of each word is as follows:

[0013] The sum of the differences between the frequency of each word co-occurring with any adjacent word in the same sentence and the frequency of each word co-occurring with all other adjacent words in the same sentence is calculated. The sum is multiplied by the total number of all adjacent word types and the result is recorded as the co-occurrence probability difference of each word.

[0014] The total frequency of each word co-occurring with all adjacent words in the same sentence was statistically analyzed. The ambiguity of each word was positively correlated with the maximum frequency of co-occurrence of each word with all adjacent words in the same sentence and the difference in co-occurrence probability.

[0015] Preferably, the step of filtering out erroneous words from all words in each text includes:

[0016] Calculate the mean of the ambiguity of all words in each text, and mark words with an ambiguity greater than the mean as incorrect words.

[0017] Preferably, the expression for the semantic relevance of each character in each erroneous word to its adjacent next character is: In the formula, This represents the semantic relevance between the Ath character in the incorrect word Q and its next adjacent character. This indicates the frequency of the word formed by the Ath character in the erroneous word Q and its next adjacent character in all texts. , These represent the frequency of the Ath character in the incorrect word Q across all texts, and the frequency of the next character immediately following the Ath character across all texts, respectively. The character difference between the sentence containing the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts and the sentences containing the word in all other occurrences is represented by: M represents the total number of occurrences of the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts; norm() represents the normalization function. This indicates a constant that is pre-defined as being greater than 0.

[0018] Preferably, the step of re-classifying each erroneous word in each text includes:

[0019] In each erroneous word, the average semantic relevance of each word to the word formed by the next word next to it is recorded as the relevance threshold.

[0020] Starting from the first character of each erroneous word, if the semantic relevance of the word formed by the first character and its next adjacent character is greater than the relevance threshold, then proceed to Step 1; otherwise, proceed to Step 2.

[0021] Step 1: Combine the first character with its next adjacent character to form a word, and use this word as the first character. Calculate the semantic relevance of this word with its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between this word and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, proceed to Step 2.

[0022] Step 2: Take the second character as the first character, and calculate the semantic relevance of the word formed by the character and its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between the character and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, continue to Step 2.

[0023] The above loop continues until the second-to-last character of each erroneous word is reached, resulting in the reclassification of each erroneous word. This process iterates through all erroneous words in each text and reclassifies the words in each text.

[0024] Preferably, the entity degree of each word is positively correlated with the frequency of each word in all texts, the proportion of all sentences containing each word in each text to the total number of sentences in the text, and the semantic relevance of each word, and negatively correlated with the ambiguity of each word. For words other than erroneous words, their semantic relevance is a preset value.

[0025] Preferably, the step of filtering entities from all words includes:

[0026] In each text after the vocabulary is re-segmented, the average entity degree of all words is recorded as the entity degree threshold, and words with an entity degree greater than or equal to the entity degree threshold are treated as entities.

[0027] Preferably, the method for determining the mutual exclusion degree between each entity and any other entity is as follows:

[0028] Calculate the product of the frequency of each entity in all texts and the frequency of any entity in all texts. Add the product to a preset factor and record the result as the frequency feature value of each entity.

[0029] The mutual exclusion degree between each entity and any other entity is determined by dividing the frequency feature value of each entity by the frequency feature value of each entity and any other entity in all texts.

[0030] Preferably, the lexical information similarity between each entity and any other entity is positively correlated with the mutual exclusivity between each entity and any other entity, and negatively correlated with the entity degree difference and character difference.

[0031] Preferably, the step of extracting feature entities from the text includes:

[0032] For all entities in all texts, cluster them and use the absolute difference between the similarity of word information as the distance metric to output all clusters;

[0033] In each cluster, the mean of the lexical similarity between each entity and all entities is multiplied by the entity degree of each entity, and the result is recorded as the feature value of each entity. The entity with the largest feature value in each cluster is taken as the feature entity.

[0034] The present invention has the following beneficial effects:

[0035] This application proposes and calculates the ambiguity of words and the semantic relevance between characters / words to automatically detect and correct errors in the initial word segmentation. This method not only relies on a static dictionary but also dynamically judges based on the contextual statistical features of the text data itself. It is particularly adept at handling domain-specific terminology and newly emerging compound words, thus providing a high-quality lexical foundation for subsequent processing. Furthermore, by fusing semantic relevance, ambiguity, word frequency, and cross-document distribution features, it constructs the entity degree of words, intelligently distinguishing key information-carrying entities from ordinary words in the text. The entity degree quantifies the importance and uniqueness of words in a specific industry context, making the entity recognition process more accurate, effectively filtering noise, and focusing on elements that are substantially meaningful for operation and management. Furthermore, this application… By calculating the similarity of lexical information between entities and employing an agglomerative hierarchical clustering algorithm, this application automatically discovers and merges different textual expressions representing the same real-world object. This solves the common synonym problem in industrial data, forming a unified and clean entity library and laying a solid foundation for building a consistent knowledge system. Finally, through relation extraction and event extraction, this application integrates these into an industrial knowledge graph. This connects and organizes scattered information from countless text fragments, forming a networked knowledge describing equipment, processes, events, and causal relationships. The analysis results are integrated and displayed through a visual dashboard, supporting real-time monitoring, intelligent question answering, root cause analysis, and predictive maintenance, thereby improving the intelligence level and decision-making efficiency of industrial digital operations. Attached Figure Description

[0036] To more clearly illustrate the technical solutions and advantages in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0037] Figure 1 A block diagram of an industrial digital operation information big data management system provided in one embodiment of this application;

[0038] Figure 2 This is a flowchart of an entity screening process provided in one embodiment of this application. Detailed Implementation

[0039] To further illustrate the technical means and effects adopted by the present invention to achieve its intended purpose, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of an industrial digital operation information big data management system proposed according to the present invention. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined in any suitable form.

[0040] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0041] The following description, in conjunction with the accompanying drawings, details the specific solution of the industrial digital operation information big data management system provided by this invention.

[0042] Please see Figure 1 The diagram illustrates a block diagram of an industrial digital operation information big data management system according to an embodiment of the present invention. The system includes: a data acquisition module 101, a word segmentation processing module 102, and an information management module 103.

[0043] The data acquisition module 101 is used to acquire various texts of industrial digital operation information.

[0044] In the wave of industrial digital transformation, massive amounts of valuable unstructured text data are generated every moment on the production and operation site. This data exists in the form of production logs, maintenance reports, shift handover records, technical documents, and meeting minutes, serving as the core carriers of equipment status, process parameters, quality anomalies, and expert experience. However, these valuable knowledge assets are highly dispersed in natural language, with inconsistent formats and strong technicalities, forming "data silos" and "information fog" that are difficult for computers to directly understand and utilize, hindering their effective transformation into insights to support operational decisions. To break down this information barrier, the primary task of this module is to acquire and standardize this multi-source, heterogeneous data from its source, building a high-quality corpus foundation for subsequent in-depth analysis. Specifically:

[0045] First, production logs, technician shift handover records, maintenance reports, R&D test reports, SOP documents, and meeting minutes related to industrial digital operation are acquired. To ensure consistency in subsequent processing, all collected documents undergo format standardization: electronic files in .doc, .pdf, etc. formats are converted to plain text formats such as .txt or .json using text extraction tools; historical scanned documents or handwritten records are converted to text using OCR (Optical Character Recognition) technology. Ultimately, the various texts of industrial digital operation information are obtained. Document format conversion and OCR technologies are well-known in the field and will not be elaborated further.

[0046] Furthermore, after textification, each text is segmented into independent sentences, and the sentences are then processed through word segmentation. Specifically: First, punctuation marks are used as sentence boundary markers to process the text, dividing it into multiple sentences. Then, the jieba software Chinese word segmentation tool, combined with a pre-built domain-specific dictionary, is used to perform initial word segmentation on the sentences. This domain-specific dictionary is constructed by collecting professional terms from industry-related process documents, equipment manuals, and expert experience to ensure accurate segmentation of professional compound terms such as "cold rolling temperature," "quenching oil model," and "CNC-1000," laying a solid foundation for subsequent precise analysis.

[0047] The process of dividing text into sentences and using the jieba Chinese word segmentation tool and domain-specific dictionaries to segment text are well-known techniques and will not be elaborated further.

[0048] The word segmentation processing module 102 is used to correct word segmentation and identify key entities by calculating the ambiguity and semantic relevance of words; and to determine the similarity of word information by utilizing the mutual exclusivity between entities and combining entity degree differences and character differences.

[0049] The initial word segmentation results often have ambiguities. For example, a general word segmentation tool may incorrectly segment "combined into" as "knotted / composed" instead of the correct "combined / into". To solve this problem, in this embodiment, first, by obtaining various adjacent words of each word in all sentences where the word is located in each text, and analyzing the difference between the co-occurrence frequency of each word and any adjacent word in the same sentence compared with the co-occurrence frequency of each word and the rest of the various adjacent words in the same sentence, the ambiguity degree of each word is determined. The core lies in that the degree of change of the adjacent words of a word is closely related to the reliability of its word segmentation. Specifically, taking any word as an example, if its adjacent words are diverse and vary richly (such as the general word "temperature" can be collocated with various words like "high temperature", "measurement", "decrease", etc.), it indicates that this word is an independent and semantically clear word, and the possibility of being incorrectly segmented is very low; on the contrary, if the adjacent words of this word are highly fixed and vary very little (such as "composed" always appears adjacent to "knotted"), it strongly implies that this word is not an independent word but a part of a larger word (such as "combined into"), and at this time, the risk of being incorrectly segmented increases sharply. Therefore, by quantifying the degree of change of such adjacent words, the words most likely to have segmentation errors can be effectively identified. The specific process is as follows:

[0050] First, in this embodiment, the sum of the differences between the co-occurrence frequency of each word and any adjacent word in the same sentence compared with the co-occurrence frequency of each word and the rest of all kinds of adjacent words in the same sentence is calculated, and the result of multiplying the sum by the total number of types of all adjacent words is recorded as the co-occurrence probability difference of each word;

[0051] Furthermore, the total co-occurrence frequency of each word and various adjacent words in the same sentence is statistically analyzed. The ambiguity degree of each word is positively correlated with the maximum value within the co-occurrence frequency of each word and all kinds of adjacent words in the same sentence and the co-occurrence probability difference.

[0052] It should be understood that a positive correlation means that the dependent variable increases as the independent variable increases and decreases as the independent variable decreases. The specific relationship can be an additive relationship or a multiplicative relationship, etc., which is determined by the actual application and is not specially limited in this application; a negative correlation means that the dependent variable decreases as the independent variable increases and increases as the independent variable decreases, and can be a subtractive relationship or a divisive relationship, etc., which is determined by the actual application.

[0053] Preferably, as an implementation manner, in this embodiment, the ambiguity degree of word G The expression is: wherein, represents the maximum value within the co-occurrence frequency of word G and all kinds of adjacent words in the same sentence; represents the co-occurrence probability difference of word G; Let G represent the total number of all neighboring words of word G; exp() represents an exponential function with the natural constant as the base.

[0054] Based on the ambiguity of each word, it can be understood that ambiguity is an inverse indicator used to quantify the reliability of the initial word segmentation results. The higher the ambiguity, the greater the possibility that the word was incorrectly segmented in the initial word segmentation. It is mainly affected by two factors: first, the co-occurrence frequency of the word with its most frequent neighboring words; and second, the dispersion of the distribution of all neighboring word types. Specifically, when a word (such as "synthesis") always co-occurs frequently with a specific word (such as "knot"), and all its neighboring words are of a very simple type and concentrated in distribution, its ambiguity will increase significantly. This strongly suggests that the word is not an independent word. The initial segmentation tool incorrectly segmented the word into a single semantic unit, but rather it is likely part of a larger, more stable compound word (such as "combined into"). Conversely, when a word (such as "temperature") has a wide variety of adjacent words with rich variations, and its co-occurrence frequency with any single word is not absolutely dominant, its ambiguity will be very low. This indicates that the word itself is semantically independent, flexible in context, and the segmentation tool is very likely to segment it correctly. Therefore, the core role of ambiguity is to accurately locate potential segmentation errors that are "mistakenly" caused by fixed collocations, providing a clear target for subsequent correction steps.

[0055] Furthermore, this embodiment filters out erroneous words from all words in each text based on the aforementioned ambiguity. By analyzing the frequency of each character in each erroneous word and its adjacent characters in all texts, as well as the frequency of each character and its next adjacent character forming a word in all texts, the semantic relevance of each character in each erroneous word and its next adjacent character forming a word is determined, so as to reclassify each erroneous word in each text. The specific process is as follows:

[0056] First, this embodiment filters out erroneous words from all words in each text based on the aforementioned ambiguity level. Specifically:

[0057] Calculate the mean of the ambiguity of all words in each text, and mark words with an ambiguity greater than the mean as incorrect words.

[0058] Furthermore, this embodiment determines the semantic relevance of each character in each erroneous word to the word formed by it and its adjacent characters in all texts by analyzing the frequency of each character and the word formed by it and its next adjacent character in all texts. Specifically:

[0059] In one implementation method, in this embodiment, the semantic relevance of the Ath character in the erroneous word Q to its next adjacent character constitutes the semantic relevance of the word. The expression is: In the formula, This represents the semantic relevance between the Ath character in the incorrect word Q and its next adjacent character. This indicates the frequency of the word formed by the Ath character in the erroneous word Q and its next adjacent character in all texts. , These represent the frequency of the Ath character in the incorrect word Q across all texts, and the frequency of the next character immediately following the Ath character across all texts, respectively. The character difference between the sentence containing the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts and the sentences containing the word in all other occurrences is represented by: M represents the total number of occurrences of the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts; norm() represents the normalization function. This indicates a preset constant greater than 0, used to prevent the denominator from being 0. The value is set manually, in this embodiment. The value is 0.01. In practical applications, as other implementation methods, implementers can also set it according to specific circumstances. This embodiment does not impose any special restrictions.

[0060] Specifically, for non-erroneous words, their semantic relevance is set to a preset value, which is 1 in this embodiment.

[0061] It should be noted that there are many commonly used methods for measuring the character differences between words. In this embodiment, the edit distance between the j-th occurrence of entity U and the j-th occurrence of entity W in all texts is taken as the character difference between the j-th occurrence of entity U and the j-th occurrence of entity W in all texts. In practical applications, as other implementation methods, implementers may also use other methods such as the reciprocal of cosine similarity to measure the differences between words, depending on the specific circumstances. This embodiment does not impose any special restrictions.

[0062] The calculation process for the edit distance is a well-known technique and will not be elaborated further.

[0063] It should be noted that, unless otherwise specified, in this embodiment, all content involving the measurement of character differences between words uses the edit distance calculation method.

[0064] It should be noted that there are many commonly used normalization methods. In this embodiment, the maximum-minimum normalization method is used to normalize the semantic correlation of all adjacent words in each erroneous word. In practical applications, as other implementation methods, implementers may also choose other normalization methods according to specific circumstances. This embodiment does not impose any special restrictions on the selection of normalization methods.

[0065] The process of normalizing data using the maximum-minimum normalization method is a well-known technique and will not be elaborated further.

[0066] Based on semantic relevance, we can understand that semantic relevance is a measure of the strength of adjacent characters or words when combined to form a meaningful vocabulary. The higher the semantic relevance, the more likely these adjacent elements are to form a meaningful and identifiable word. When a pair of characters (such as "quenching") not only always appears in a fixed collocation but is also widely used in various sentences and documents, and the latter character "fire" is not a particularly common function word, its semantic relevance will be very high. This reflects that "quenching" is a stable and universal professional term. Conversely, if two adjacent characters are only accidentally combined, or only appear in a very few similar sentences, or the latter character is a high-frequency function word, its semantic relevance will be very low. Therefore, the core role of semantic relevance is to provide a dynamic, statistically based "glue" judgment during word segmentation correction, determining which adjacent characters should be merged into a meaningful word.

[0067] Furthermore, based on the aforementioned semantic relevance, each erroneous word in each text is re-segmented, specifically:

[0068] In each erroneous word, the average semantic relevance of each word to the word formed by the next word next to it is recorded as the relevance threshold.

[0069] Starting from the first character of each erroneous word, if the semantic relevance of the word formed by the first character and its next adjacent character is greater than the relevance threshold, then proceed to Step 1; otherwise, proceed to Step 2.

[0070] Step 1: Combine the first character with its next adjacent character to form a word, and use this word as the first character. Calculate the semantic relevance of this word with its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between this word and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, proceed to Step 2.

[0071] Step 2: Take the second character as the first character, and calculate the semantic relevance of the word formed by the character and its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between the character and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, continue to Step 2.

[0072] The above loop continues until the second-to-last character of each erroneous word is reached, resulting in the reclassification of each erroneous word. This process iterates through all erroneous words in each text and reclassifies the words in each text.

[0073] Furthermore, in each text after the vocabulary re-segmentation, the frequency of each word in all texts and the probability that each word is contained in all sentences of each text are calculated. Combined with the ambiguity and semantic relevance, the entity degree of each word is determined to filter entities from all words. The specific process is as follows:

[0074] First, in each text after the vocabulary re-segmentation, the frequency of each word appearing in all texts and the probability that each word is contained in all sentences of each text are calculated. Then, combining the ambiguity and semantic relevance, the entity degree of each word is determined. Specifically:

[0075] The entity degree of each word is positively correlated with the frequency of each word in all texts, the proportion of all sentences containing each word in each text to the total number of sentences in the text, and the semantic relevance of each word, and negatively correlated with the ambiguity of each word. For words other than erroneous words, their semantic relevance is a preset value.

[0076] Preferably, as one implementation method, in this embodiment, the entity degree of word U The expression is: In the formula, This indicates the frequency of word U across all texts; This represents the percentage of all sentences in text v containing the word U out of the total number of sentences in text v. Indicates the total number of texts; , These represent the semantic relevance and ambiguity of word U, respectively.

[0077] Based on the entity degree of each word, it can be understood that entity degree is a core indicator used to comprehensively evaluate the importance of a word as a key information entity in a specific industry scenario. The higher the entity degree, the more likely the word is to carry key business information, rather than an irrelevant generic word. Entity degree is a multi-dimensional comprehensive function, mainly affected by four factors: the frequency of the word's occurrence in all texts, the number of different texts in which the word appears, semantic stability (reflected by semantic relevance; the more accurate the word segmentation, the more stable the semantics), and word segmentation reliability (reflected by ambiguity; the lower the ambiguity, the higher the reliability). Specifically, a word (such as "CNC-10")... If a word (e.g., 00) appears frequently in all documents and is widely distributed across many different reports and logs, and its word segmentation results are highly reliable (low ambiguity) and its internal semantics are stable (high relevance), then its entity degree will be extremely high. This reflects that it is a core, stable, and important entity in the entire information system. Conversely, even if a word has a high frequency, if it is concentrated in only a few documents, or if the word segmentation is unreliable and the semantics are ambiguous, its entity degree will be very low. Therefore, the core value of entity degree lies in intelligently filtering out truly valuable "key roles" from massive amounts of vocabulary, laying the foundation for building a high-quality knowledge base.

[0078] Furthermore, based on the aforementioned entity degree, entities are filtered from all words, specifically:

[0079] In each text after the vocabulary is re-segmented, the average entity degree of all words is recorded as the entity degree threshold, and words with an entity degree greater than or equal to the entity degree threshold are treated as entities.

[0080] Preferably, the entity screening process flowchart provided in this embodiment is as follows: Figure 2 As shown.

[0081] Furthermore, based on the frequency of each entity appearing in all texts, and the frequency of each entity appearing in the same text as any other entity, the mutual exclusion degree between each entity and any other entity is determined. Then, combining the entity degree difference and character difference between each entity and any other entity, the lexical information similarity between each entity and any other entity is determined. The specific process is as follows:

[0082] In this embodiment, firstly, based on the frequency of each entity appearing in all texts, and the frequency of each entity appearing in the same text as any other entity, the mutual exclusion degree between each entity and any other entity is determined. Specifically:

[0083] Calculate the product of the frequency of each entity in all texts and the frequency of any entity in all texts. Add the product to a preset factor and record the result as the frequency feature value of each entity.

[0084] The frequency eigenvalue of each entity is divided by the frequency of each entity and any other entity appearing together in the same text across all texts, and the result is used as the mutual exclusivity between each entity and any other entity.

[0085] It should be noted that the preset factor is used to prevent the denominator from being 0. Its value is set manually. In this embodiment, the preset factor is set to 1. Under the premise of ensuring that the denominator is not 0 and does not excessively affect the calculation result, the implementer can also set it according to the specific situation. This embodiment does not impose any special restrictions.

[0086] Furthermore, this embodiment determines the lexical information similarity between each entity and any other entity based on the mutual exclusion degree between each entity and any other entity, combined with the entity degree difference and character difference between each entity and any other entity. Specifically:

[0087] In this embodiment, the lexical information similarity between each entity and any other entity is positively correlated with the mutual exclusivity between each entity and any other entity, and negatively correlated with the entity degree difference and character difference.

[0088] Preferably, as one implementation method, in this embodiment, the lexical information similarity between entity U and entity W is... The expression is: In the formula, This represents the mutual exclusion between entity U and entity W; This represents the absolute difference in entity degree between entity U and entity W, used to express the difference in entity degree. Let J represent the character difference between entity U, which appears j times in all texts, and entity W, which appears j times in all texts; J represents the minimum number of times entity U and entity W appear in all texts. This represents a preset constant greater than 0, used to prevent the denominator from being 0. In this embodiment... The value of is 0.01. Provided that the denominator is not zero and does not excessively affect the calculation result, the implementer may also set it according to the specific situation. This embodiment does not impose any special restrictions.

[0089] Based on lexical information similarity, it can be understood that lexical information similarity is a similarity index used to measure whether two different words refer to the same real-world object. The higher the lexical information similarity between two words, the more likely they are different aliases of the same entity (e.g., "CNC-1000" and "1000th CNC machine tool"). Lexical information similarity is a composite index, mainly influenced by three factors: mutual exclusion (two words rarely appear together in the same document, indicating they may be mutually exclusive aliases), entity difference (whether the importance of the two words is similar), and character difference (whether the literal forms of the two words are similar). Specifically, when the entity difference between two words... If the word's lexical information similarity is very high and similar, and their literal forms are also very similar, and they exhibit a strong "mutual exclusion" phenomenon in the document (i.e., documents mentioning A rarely mention B, and vice versa), then their lexical information similarity will be very high, which conforms to the business logic of "different aliases of the same entity". Conversely, if two words have significantly different entity degrees, completely different literal forms, or frequently appear side by side in the same document, it indicates that they are two different but related entities, and then their lexical information similarity will be very low. Therefore, the core function of lexical information similarity is to achieve entity normalization and disambiguation, automatically clustering and merging synonyms and aliases scattered in different texts to form a unified and clean entity library.

[0090] Thus, this embodiment achieves dynamic self-checking and correction of word segmentation errors in industrial texts by calculating ambiguity and semantic relevance, providing a high-quality lexical foundation for subsequent analysis. Furthermore, it integrates multi-dimensional features to construct entity degree indicators, accurately selects key information entities, and effectively solves the normalization problem of synonyms and aliases by utilizing lexical information similarity. Finally, it constructs a unified and clean entity library, laying the foundation for forming a structured knowledge graph and supporting intelligent decision-making.

[0091] The information management module 103 is used to extract feature entities from the text based on the similarity of the vocabulary information in order to manage the digital operation information of the industry.

[0092] Furthermore, this embodiment extracts feature entities from the text based on the lexical information similarity obtained from the word segmentation processing module in order to manage industrial digital operation information. The specific process is as follows:

[0093] In this embodiment, all entities in all texts are clustered, and the absolute difference between the similarity of word information is used as the distance metric to output all clusters;

[0094] In each cluster, the mean of the lexical similarity between each entity and all entities is multiplied by the entity degree of each entity, and the result is recorded as the feature value of each entity. The entity with the largest feature value in each cluster is taken as the feature entity.

[0095] It should be noted that there are many commonly used clustering algorithms. In this embodiment, the agglomerative hierarchical clustering algorithm is used to cluster entities. In the actual application process, as other implementation manners, the implementer can also adopt other clustering algorithms according to specific situations, and this embodiment does not make special restrictions.

[0096] Among them, the process of clustering entities by using the agglomerative hierarchical clustering algorithm is a well-known technology and will not be elaborated herein.

[0097] Furthermore, a machine learning algorithm based on rule matching of predefined patterns is adopted to dynamically add attributes to each feature entity from the context or associated structured data. For example, "device type", "production parameters", "department to which it belongs" and the like are attached to the device entity; furthermore, the relationships between feature entities are extracted from text statements to form structured facts. For example: the relationship between the device and the fault: <CNC-1000, occurrence, spindle overheat>; the relationship between the process and the parameters: <quenching process, parameter, 850°C>; subsequently, the key events (such as device downtime, process adjustment, quality abnormality) in all texts are extracted, the feature entities and relationships are organized into an event graph, and then combined with context information such as time stamps and production lines, the event sequence and influence are analyzed to support root cause analysis or predictive maintenance. Each feature entity, relationship, and event is integrated into a knowledge graph to form a queryable and inferable structured knowledge base. Finally, the entity recognition and subsequent analysis results are integrated into the big data management platform to provide a visual dashboard to display key entities (such as device status, production indicators) and abnormal alarms in real time.

[0098] So far, this embodiment clusters entities based on the lexical information similarity, extracts the most representative feature entities, and dynamically attaches attributes and relationships to them. Finally, an industrial knowledge graph integrating entities, relationships, and events is constructed. The key information and abnormal alarms are displayed in real time through the visual dashboard, converting the massive text data into an intelligent decision-making basis that can support root cause analysis and predictive maintenance, thereby improving the intelligent level and decision-making efficiency of industrial digital operation.

[0099] It should be noted that the above sequence of the embodiments of the present invention is only for description and does not represent the superiority or inferiority of the embodiments. The processes depicted in the drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some implementation manners, multitasking and parallel processing are also possible or may be advantageous.

[0100] Each embodiment in this specification is described in a progressive manner, and the same or similar parts between each embodiment can be referred to each other. Each embodiment focuses on the differences from other embodiments.

[0101] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An industrial digital operation information big data management system, characterized in that, The system includes: The data acquisition module is used to acquire various texts of industrial digital operation information; The word segmentation module is used to obtain the various adjacent words of each word in all sentences in each text. By analyzing the difference between the frequency of each word co-occurring with any adjacent word in the same sentence and the frequency of each word co-occurring with all other adjacent words in the same sentence, the ambiguity of each word is determined. Based on the ambiguity, erroneous words are filtered out from all words in each text; by analyzing the frequency of each word in each erroneous word and its adjacent words in all texts, as well as the frequency of each word and its next adjacent word forming a word in all texts, the semantic relevance of each word in each erroneous word and its next adjacent word forming a word is determined, so as to reclassify each erroneous word in each text. In each text after the vocabulary re-segmentation, the frequency of each word in all texts and the probability that each word is contained in all sentences in each text are calculated. Combined with the ambiguity and semantic relevance, the entity degree of each word is determined in order to filter out entities from all words. Based on the frequency of each entity appearing in all texts, and the frequency of each entity appearing in the same text as any other entity, the mutual exclusivity between each entity and any other entity is determined. Combined with the entity degree difference and character difference between each entity and any other entity, the lexical information similarity between each entity and any other entity is determined. The information management module is used to extract feature entities from the text based on the similarity of the vocabulary information in order to manage the digital operation information of the industry. The method for determining the degree of ambiguity of each term is as follows: The sum of the differences between the frequency of each word co-occurring with any adjacent word in the same sentence and the frequency of each word co-occurring with all other adjacent words in the same sentence is calculated. The sum is multiplied by the total number of all adjacent word types and the result is recorded as the co-occurrence probability difference of each word. The total frequency of each word co-occurring with all adjacent words in the same sentence was statistically analyzed. The ambiguity of each word was positively correlated with the maximum value of the co-occurrence frequency of each word with all adjacent words in the same sentence and the difference in co-occurrence probability. The expression for the semantic relevance of each character in each erroneous word to its adjacent next character is as follows: In the formula, This represents the semantic correlation between the A-th character in the incorrect word Q and its next adjacent character. This indicates the frequency of the word formed by the Ath character in the erroneous word Q and its next adjacent character in all texts. , These represent the frequency of the Ath character in the incorrect word Q across all texts, and the frequency of the next character immediately following the Ath character across all texts, respectively. The character difference between the sentence containing the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts and the sentences containing the word in all other occurrences is represented by: M represents the total number of occurrences of the word formed by the Ath character of the erroneous word Q and its next adjacent character in all texts; norm() represents the normalization function. This indicates a constant that is pre-defined as being greater than 0; The lexical information similarity between each entity and any other entity is positively correlated with the mutual exclusivity between each entity and any other entity, and negatively correlated with the entity degree difference and character difference. The extraction of feature entities from text includes: For all entities in all texts, cluster them and use the absolute difference between the similarity of word information as the distance metric to output all clusters; In each cluster, the mean of the lexical similarity between each entity and all entities is multiplied by the entity degree of each entity, and the result is recorded as the feature value of each entity. The entity with the largest feature value in each cluster is taken as the feature entity.

2. The industrial digital operation information big data management system according to claim 1, characterized in that, The process of filtering out erroneous words from all words in each text includes: Calculate the mean of the ambiguity of all words in each text, and mark words with an ambiguity greater than the mean as incorrect words.

3. The industrial digital operation information big data management system according to claim 1, characterized in that, The process of re-classifying each erroneous word in each text includes: In each erroneous word, the average semantic relevance of each word to the word formed by the next word next to it is recorded as the relevance threshold. Starting from the first character of each erroneous word, if the semantic relevance of the word formed by the first character and its next adjacent character is greater than the relevance threshold, then proceed to Step 1; otherwise, proceed to Step 2. Step 1: Combine the first character with its next adjacent character to form a word, and use this word as the first character. Calculate the semantic relevance of this word with its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between this word and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, proceed to Step 2. Step 2: Take the next character as the first character, and calculate the semantic relevance of the word formed by the character and its next adjacent character according to the semantic relevance calculation method. Determine whether the semantic relevance between the character and its next adjacent character is greater than the relevance threshold. If yes, continue to Step 1; otherwise, continue to Step 2. The above loop continues until the second-to-last character of each erroneous word is reached, resulting in the reclassification of each erroneous word. This process iterates through all erroneous words in each text and reclassifies the words in each text.

4. The industrial digital operation information big data management system according to claim 3, characterized in that, The entity degree of each word is positively correlated with the frequency of each word in all texts, the proportion of all sentences containing each word in each text to the total number of sentences in the text, and the semantic relevance of each word, and negatively correlated with the ambiguity of each word. For words other than erroneous words, their semantic relevance is a preset value.

5. The industrial digital operation information big data management system according to claim 1, characterized in that, The process of filtering entities from all words includes: In each text after the vocabulary is re-segmented, the average entity degree of all words is recorded as the entity degree threshold, and words with an entity degree greater than or equal to the entity degree threshold are treated as entities.

6. The industrial digital operation information big data management system according to claim 1, characterized in that, The method for determining the mutual exclusion between each entity and any other entity is as follows: Calculate the product of the frequency of each entity in all texts and the frequency of any entity in all texts. Add the product to a preset factor and record the result as the frequency feature value of each entity. The frequency eigenvalue of each entity is divided by the frequency of each entity and any other entity appearing together in the same text across all texts, and the result is used as the mutual exclusivity between each entity and any other entity.