Target entity business type identification method and system based on multi-source data fusion and dynamic weight adjustment

By using multi-source data fusion and dynamic weight adjustment, the problem of insufficient accuracy and reliability in business type identification caused by fixed weights in existing technologies is solved, and efficient and automated identification and interpretable analysis of the business direction of target entities are achieved.

CN122196370APending Publication Date: 2026-06-12OXFORD INTELLIGENT (HANGZHOU) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
OXFORD INTELLIGENT (HANGZHOU) TECHNOLOGY CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-12

Smart Images

  • Figure CN122196370A_ABST
    Figure CN122196370A_ABST
Patent Text Reader

Abstract

The application provides a target entity business type identification method and system based on multi-source data fusion and dynamic weight adjustment, comprising: collecting original data of a plurality of heterogeneous data sources, wherein the plurality of heterogeneous data sources at least include a financial quantitative time series data source, a technical attribute data source, an operating activity trajectory data source and a text description data source; preprocessing the collected original data to extract feature vectors corresponding to each data source; calculating weight coefficients of each data source in the current analysis scene according to a preset dynamic weight adjustment strategy, wherein the dynamic weight adjustment strategy is generated according to at least one of a data source quality attribute parameter, an industry stage characteristic parameter or an external system configuration parameter; and performing comprehensive weighted calculation based on the feature vectors of each data source and the corresponding weight coefficients to output a business type identification result of a target entity and a corresponding confidence degree. The application realizes automatic identification of the business direction of the target entity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and information processing technology, and in particular to a method and system for identifying target entity business types based on multi-source data fusion and dynamic weight adjustment. Background Technology

[0002] In the fields of industry analysis, fintech, and enterprise information services, identifying the core business direction of a target entity (such as a company or institution) is a fundamental task. Current mainstream technical solutions typically achieve this in the following ways: The first approach is text classification based on keyword matching. This approach crawls the target entity's official website text, annual reports, or business registration information to extract business scope descriptions, uses natural language processing to extract keywords, and matches them with a pre-set industry tag library to determine the business type. However, this approach has technical drawbacks: text descriptions often lag behind the entity's actual business development and suffer from broad expressions and keyword ambiguity, resulting in low classification accuracy and difficulty in reflecting the entity's true business focus.

[0003] The second approach is based on statistical analysis techniques using single-dimensional data. For example, analyzing intellectual property data (such as the distribution of patent IPC classification numbers) can infer the direction of technological layout, or the revenue composition in financial reports can quantify revenue sources. However, this approach suffers from a significant information silo effect: while intellectual property data can reflect the direction of technological research and development, it cannot be directly equated with businesses that have already generated stable revenue; and while structured revenue data can reflect the revenue structure, for non-listed companies or entities with incomplete financial disclosures, data acquisition is difficult and time-sensitive.

[0004] The third approach employs a rule-based engine-based fusion analysis technique. This approach attempts to integrate multi-source data through weighted calculations using preset fixed weight rules (e.g., 0.4 weight for revenue structured data and 0.3 weight for intellectual property data). However, this approach suffers from several technical bottlenecks: the weight settings lack dynamic adaptability, failing to respond to fluctuations in data source reliability (e.g., missing or abnormal data sources), changes in industry stage characteristics (e.g., differences in indicator sensitivity between emerging and mature industries), and personalized needs of external system analysis preferences. This results in insufficient robustness of the identification model across different scenarios. Furthermore, when multiple weight adjustment strategies are triggered simultaneously, the existing technology lacks an effective weight fusion and conflict resolution mechanism, which can easily lead to computational logic confusion and affect the stability of the system output.

[0005] Therefore, how to construct a technical solution that can dynamically integrate multi-source heterogeneous data, adaptively adjust the analysis weights of each data source, and automatically correct biases when weights are abnormal, in order to improve the accuracy, reliability, and interpretability of identifying the business direction of target entities, has become a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] To address the aforementioned technical problems, the technical solution adopted by this invention is as follows: According to a first aspect of the present invention, a method for identifying target entity business types based on multi-source data fusion and dynamic weight adjustment is provided, comprising the following steps: Raw data is collected from multiple heterogeneous data sources, including at least financial quantitative time-series data sources, technical attribute data sources, business activity trajectory data sources, and text description data sources.

[0007] The collected raw data is preprocessed to extract the feature vectors corresponding to each data source.

[0008] According to the preset dynamic weight adjustment strategy, the weight coefficient of each data source in the current analysis scenario is calculated; the dynamic weight adjustment strategy is generated based on at least one of the data source quality attribute parameters, industry stage characteristic parameters, or external system configuration parameters.

[0009] Based on the feature vectors and corresponding weight coefficients of each data source, a comprehensive weighted calculation is performed to output the business type identification result of the target entity and the corresponding confidence level.

[0010] According to a second aspect of the present invention, an electronic device is provided, including a processor and a memory; the processor executes the steps of the method described in the first aspect of the present invention by invoking a program or instructions stored in the memory.

[0011] According to a third aspect of the present invention, a computer-readable storage medium is provided that stores a program or instructions that cause a computer to perform the steps of the method described in the first aspect of the present invention.

[0012] This invention overcomes the information limitations of a single data source by fusing multi-source heterogeneous data, thereby improving the accuracy of business type identification; it enhances the dynamic adaptability of the method by using a dynamic weight adjustment strategy to enable weight allocation to respond to changes in data quality, differences in industry stages, and external configuration requirements; it improves identification efficiency through automated processing; and it provides a quantifiable representation of the reliability of the identification results by outputting confidence scores.

[0013] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0014] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 A flowchart of a target entity business type identification method based on multi-source data fusion and dynamic weight adjustment provided in an embodiment of the present invention. Detailed Implementation

[0016] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0017] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of this invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.

[0018] It should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the steps as sequential processes, many of these steps can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the steps can be rearranged. A process can be terminated when its operation is complete, but it may also have additional steps not included in the figures. A process can correspond to a method, function, procedure, subroutine, subroutine, etc.

[0019] The technical problem this invention aims to solve is to address the shortcomings of existing technologies in multi-source data fusion analysis, such as fixed weights and the inability to adaptively adjust weights based on dynamic changes in data source quality attributes, differences in industry stage characteristics, and personalized configuration requirements of external systems. This invention provides a method and system for identifying target entity business types based on multi-source data fusion and dynamic weight adjustment. This achieves automated identification of the business direction of target entities and dynamically adjusts the analysis weights of each data source according to data source quality attribute parameters, industry stage characteristic parameters, and external system configuration parameters, thereby improving the accuracy, reliability, and interpretability of business direction identification. To clearly illustrate the technical solution of this invention, the relevant terms are defined as follows: Target entity: refers to the object being analyzed, including enterprises, institutions, organizations, or other legal or non-legal entities with independent business activities. The target entity is the data acquisition and analysis object of the technical solution of this invention, and its business characteristics are characterized through multi-source heterogeneous data.

[0020] Business type: refers to the category of core economic activities engaged in by the target entity, used to characterize the target entity's main revenue sources, technological layout direction, or business behavior characteristics. Business types can be classified according to industry classification standards, or a custom classification system can be defined according to the application scenario. The technical solution of this invention outputs the business type identification result of the target entity and the corresponding confidence level.

[0021] Quality attribute parameters: These refer to a set of quantitative indicators used to characterize the quality features of the data source itself, including scores for authenticity, completeness, and timeliness. These quality attribute parameters reflect the credibility of the data source at a specific point in time and are one of the fundamental bases for dynamic weight adjustment.

[0022] In this invention, business type identification is essentially a multi-classification problem. The business characteristics of the target entity are quantified and represented through multi-source heterogeneous data sources, forming feature vectors in a high-dimensional feature space. Through a dynamic weight adjustment strategy, the feature vectors from different data sources are weighted and fused, ultimately mapping them to a preset business type label space. The business type label space can be a standardized industry classification system or a classification system customized for a specific application scenario.

[0023] This invention provides a method for identifying target entity business types based on multi-source data fusion and dynamic weight adjustment, such as... Figure 1 As shown, it includes the following steps: S100 collects raw data from multiple heterogeneous data sources.

[0024] The multiple heterogeneous data sources include at least financial quantitative time-series data sources, technical attribute data sources, business activity trajectory data sources, and text description data sources. Among them: The financial quantitative time-series data source includes numerical data such as the target entity's revenue composition, profit distribution, and cash flow structure at continuous time points. Its technical features are the computability of the data and its time-series attributes.

[0025] The technical attribute data source includes structured or semi-structured data such as the distribution of patent IPC classification numbers, frequency of technical subject terms, and patent citation relationships of the target entity. Its technical feature is that it reflects the dimensional attributes of the technical layout.

[0026] The data source for the operational activity trajectory includes behavioral sequence data such as the target entity's bidding records, product launch events, and supply chain relationships. Its technical feature lies in the spatiotemporal attributes of the behavioral trajectory.

[0027] The text description data source includes natural language text data such as the target entity's company name, business scope, and promotional text, and its technical feature is the extractability of semantic information.

[0028] S200 preprocesses the collected raw data and extracts the feature vectors corresponding to each data source.

[0029] The preprocessing includes: Data cleaning: Remove duplicates, detect and correct outliers, and filter noise from the raw data to ensure the quality of the input data; Format conversion: Converting unstructured or semi-structured data from heterogeneous data sources into a unified vector computing format, including converting text data into numerical representations and standardizing non-standard date formats; Missing value handling: Missing data items are handled using mean imputation, interpolation, or deletion strategies to ensure the integrity of the feature vector.

[0030] The feature vector extraction employs corresponding feature engineering methods tailored to the technical characteristics of different data sources, specifically including: (1) Extract the business revenue feature vector from the financial quantitative time series data source.

[0031] From the target entity's financial reports or structured revenue data, extract the revenue amount, revenue share, and revenue growth rate of each business segment to construct a business revenue feature vector. This vector includes information on the absolute size, relative importance, and growth trend of each business category, used to characterize the target entity's revenue structure.

[0032] (2) Extract the patent technology topic distribution vector from the technology attribute data source.

[0033] From the target entity's patent data, extract multiple features to characterize the direction of technology layout and the intensity of R&D investment, specifically including: IPC Classification Distribution Characteristics: Statistically count the number of patents under each IPC classification number and construct an IPC classification distribution vector to reflect the breadth of the target entity's patent layout in different technology fields; Technical topic distribution characteristics: The patent text is trained by LDA topic model to extract the probability distribution vector of patent technical topics. Each dimension of the vector corresponds to the probability value of a technical topic, which is used to reflect the core technical direction of the target entity. R&D activity characteristics: Calculate the annual growth rate of the number of patents to construct technology R&D activity characteristics, which are used to characterize the target entity's investment trend in technology R&D.

[0034] The aforementioned IPC classification distribution vector, patent technology topic probability distribution vector, and R&D activity characteristics together constitute a set of patent technology topic distribution vectors extracted from the technology attribute data source, which are used to characterize the target entity's technology layout direction and R&D investment intensity.

[0035] (3) Extract business behavior feature vectors from the business activity trajectory data source.

[0036] From the target entity's bidding records, product launch information, and supply chain data, multiple features are extracted to characterize market behavior and position within the industry chain. These features include: Product operation frequency characteristics: Statistically analyze the occurrence frequency of each product category term in bidding projects, select product category terms that exceed the preset frequency threshold or are in the top N% of the frequency distribution, and construct a product operation frequency vector to reflect the activity level of the target entity in various product markets; Supply chain association characteristics: Extract the main customers and suppliers of the target entity from the supply chain association data. The main customers and suppliers refer to partners whose transaction amount accounts for more than a preset threshold or whose transaction frequency ranks in the top 10%. Statistically analyze the industry distribution of the above partners and construct a supply chain association feature vector to reflect the upstream and downstream position of the target entity in the supply chain and the distribution of related industries. Business activity activity characteristics: Time series analysis of product launch events is conducted to extract business activity activity characteristics, which are used to characterize the frequency and trend of the target entity's behavior in the market.

[0037] The aforementioned product operation frequency vector, industry chain association feature vector, and business activity activity feature together constitute a set of business behavior feature vectors extracted from the business activity trajectory data source, which are used to characterize the market behavior characteristics and industry chain position of the target entity.

[0038] (4) Extract text semantic embedding vectors from text description data sources.

[0039] From the natural language text of the target entity, such as its company name, business scope, and promotional materials, we extract multiple features used to characterize its self-description and market positioning. These features include: Keyword features: The TF-IDF algorithm is used to calculate the word weights of the text description. Words with TF-IDF weights exceeding the preset weight threshold or those ranked in the top 20 by weight are selected to construct bag-of-words feature vectors, which are used to capture the core words with distinctiveness in the text. Semantic embedding features: The text description is encoded by a pre-trained language model, and the text semantic embedding vector is extracted. Each dimension of the vector corresponds to a latent feature in the semantic space, which is used to capture the deep semantic information of the text.

[0040] The bag-of-words feature vectors and text semantic embedding vectors mentioned above together constitute a set of text semantic embedding vectors extracted from text description data sources, which are used to characterize the self-descriptive features and market positioning of the target entity.

[0041] It should be noted that the aforementioned preset thresholds (frequency threshold, percentage threshold, TF-IDF weight threshold, etc.) can be dynamically adjusted according to the data scale, industry characteristics, and application scenarios. In a preferred embodiment, the frequency threshold is set to 5 times, the transaction amount percentage threshold is set to 5%, and the TF-IDF weight threshold is set to 0.1. In another embodiment, an adaptive threshold algorithm can be used to automatically determine the threshold based on the quantiles of the data distribution. For example, the top 20% of the frequency distribution can be defined as high-frequency, and the top 10% of the transaction amount percentage distribution can be defined as major customers.

[0042] The extracted multi-dimensional feature vectors will serve as the quantitative representation of the target entity in the multi-source data space. In step S300, dynamic weight coefficients will be assigned to the feature vectors of different data sources based on the credibility assessment parameters or industry stage feature parameters of each data source. In step S400, a comprehensive weighted calculation will be performed on each feature vector based on the weight coefficients, and the business type identification result and corresponding confidence level of the target entity will be output.

[0043] S300 calculates the weight coefficients of each data source in the current analysis scenario based on a preset dynamic weight adjustment strategy.

[0044] The dynamic weight adjustment strategy is generated based on at least one of the following: data source quality attribute parameters, industry stage characteristic parameters, or external system configuration parameters. When multiple parameters exist simultaneously, the final strategy is determined according to the following priority order: external system configuration parameters have the highest priority, followed by data source quality attribute parameters, and industry stage characteristic parameters serve as the basic adjustment strategy.

[0045] Specifically, this invention supports the independent or combined application of three strategies, with the following selection rules: (1) Strategy selection priority If weight configuration parameters are received from an external system via an application programming interface, the adjustment strategy based on the external system's configuration parameters will be adopted first to meet the external system's personalized analysis needs. If no external system configuration parameters are received, but a significant change in the quality attribute parameters of any data source is detected (such as fluctuations in authenticity, completeness, or timeliness scores exceeding preset thresholds), an adjustment strategy based on the data source quality attribute parameters will be triggered. If no configuration parameters from the external system are received and the quality attribute parameters of the data source have not changed significantly, but the stage characteristic parameters of the target entity's industry have changed (such as updates to indicators like industry growth rate and market concentration), then an adjustment strategy based on the industry stage characteristic parameters will be triggered. If none of the above conditions are triggered, the currently effective weighting coefficients will be used.

[0046] (2) Multi-strategy conflict handling When multiple strategies simultaneously meet the triggering conditions, the final strategy is determined according to the following priority order: adjustment strategies based on external system configuration parameters > adjustment strategies based on data source quality attribute parameters > adjustment strategies based on industry stage characteristic parameters. That is, external system configuration has the highest priority, followed by data source quality attributes, and industry stage characteristics serve as the basic adjustment strategy.

[0047] (3) Strategy execution After determining the appropriate strategy based on the above rules, the process proceeds to the corresponding sub-step to calculate the weight coefficients. This invention provides three illustrative embodiments, each corresponding to a specific implementation of one of the three strategies.

[0048] Example 1: Adjustment strategy based on data source quality attribute parameters This embodiment dynamically adjusts the weighting coefficients based on the quality attributes of each data source, ensuring that higher-quality data sources receive higher analysis weights. S310, Obtain the score value of each data source on the preset credibility assessment dimension.

[0049] The preset credibility assessment dimensions include authenticity, completeness, and timeliness. The calculation methods for the scores of each dimension are as follows: The score for the authenticity dimension is calculated by weighting the cross-validation score, data source tracing score, and logical verification score. The calculation formula is: t i =k1×t cross +k2×t trace +k3×t logic , t i t represents the authenticity score for the i-th data source, where i ranges from 1 to n, and n is the number of data sources. cross For the cross-validation score, t trace For data source tracing score, t logicThe scores are for logical verification, and each score ranges from [0, 1]. k1 to k3 are preset coefficients. In an illustrative embodiment, k1 = 0.4, k2 = k3 = 0.3.

[0050] In this invention, cross-validation scores are used to measure the degree of consistency between the content of a data source and other independent data sources. The determination method is as follows: For the data source to be evaluated, select at least two other independent data sources as validation benchmarks. For example, when evaluating the authenticity of a financial data source, supply chain data sources and tax announcement data sources can be selected as cross-validation benchmarks.

[0051] Key information items in the data source to be evaluated (such as operating revenue, number of patents, bidding amount, etc.) are compared with corresponding information items in other data sources.

[0052] Count the number of consistent information items and calculate the consistency rate.

[0053] The calculation formula is: t cross =N match / N total , where N match N is the number of information items that match other data sources. total This represents the total number of information items included in the comparison.

[0054] In a preferred embodiment, if there are multiple validation benchmark data sources, the average value of the validation results of each benchmark is taken as the final cross-validation score.

[0055] In this invention, the data source traceability score is used to measure the authority and traceability of the data source. The determination method includes: 1. Establish a tiered scoring standard based on the authority of the data publishing organization and the traceability of the data: If the data comes from authoritative channels such as official government agencies or legally disclosed platforms of stock exchanges, the traceability score is 1.0; If the data comes from authoritative industry associations, international organizations, or well-known third-party data service providers, the traceability score is 0.8. If the data comes from general industry websites or local information platforms, the traceability score is 0.5. If the data originates from self-media or unverified online sources, the source tracing score is 0.2. If the source of the data cannot be traced or is unknown, the traceability score is 0.0.

[0056] 2. Based on the actual source of the data source, query the authority level of that source and directly map it to obtain the source tracing score.

[0057] 3. If the data source contains multiple sub-sources (for example, financial data comes from both company annual reports and securities firm research reports), the weighted average of the scores of each sub-source is taken as the final source tracing score, and the weights are determined according to the proportion of data volume.

[0058] In this invention, the logical verification score is used to measure whether the data content conforms to basic business logic and numerical patterns. The determination method includes: 1. Pre-set corresponding logical validation rules for different types of data sources: The verification rules for financial data include: whether the sum of the revenue share of each business line equals 100%, whether the profit is less than the revenue, and whether the growth rate is within a reasonable range. The verification rules for patent data include: whether the patent application date is earlier than the grant date, and whether the IPC classification number conforms to the conventional distribution of the technical field. The verification rules for bidding data include: whether the winning bid amount is within a reasonable range, and whether the winning bid time conforms to the project timeline. The validation rules for text description data include: whether the company name contains technical terms not mentioned in the business scope (which may lead to inconsistencies), etc.

[0059] 2. Execute all applicable logical validation rules on the data source to be evaluated and record the number of rules that pass.

[0060] The formula for calculating the logic check score is: t logic =R pass / R total .

[0061] Among them, R pass R represents the number of rules that passed the validation. total This represents the total number of applicable rules.

[0062] 4. For critical logical rules (such as the sum of financial data percentages not equal to 1), a veto mechanism can be set up. If such a rule fails, then t... logic Set it directly to 0.

[0063] The completeness score is calculated based on the proportion of missing data and the completeness of key information items; the calculation formula is: c i =1-(m miss -m total )×b1-k miss / k total ×b2, where c i Let m be the score for the integrity dimension of the i-th data source. miss m is the number of missing data items. tota k represents the total number of data items. miss k represents the number of missing key information items.total b1 represents the total number of key information items; b2 and b1 are preset coefficients. In an illustrative embodiment, b1 = 0.6 and b2 = 0.4.

[0064] The timeliness score is calculated using an exponential decay function based on the difference between the most recent update time and the current time. The formula is fi=e -λ×(Tnow-Tlast) , where f i Let λ be the timeliness score of the i-th data source, λ be the preset decay rate, Tnow be the current time, and Tlast be the last time the data was updated.

[0065] The overall credibility score is calculated using a weighted product formula: r i =t α i ×c β i ×f γ i .

[0066] Where, r i Let α, β, and γ represent the overall credibility of the i-th data source, where α, β, and γ are preset weighting coefficients and α + β + γ = 1. In a preferred embodiment, α = 0.4, β = 0.3, and γ = 0.3, reflecting the principle of prioritizing authenticity.

[0067] S312, based on the comprehensive credibility score of each data source, adjust the preset basic weight coefficients to obtain the preliminary adjusted weights.

[0068] The basic weight coefficients are the pre-set initial weights for each data source, and the initial weight of the i-th data source is denoted as w. base i The initial adjusted weight w of the i-th data source. init i The calculation formula is: w init i =w base i ×r i .

[0069] If the overall credibility score of a certain data source is r i If the initial adjustment weight of the data source is 0, it will be set to 0 and will not participate in subsequent weight allocation.

[0070] S313, normalize the preliminary adjusted weights to obtain the weight coefficients of each data source after adjustment based on the data source quality attribute parameters.

[0071] The formula for normalization is: wq i = (w init i ) / ∑ j n (w) init j ).

[0072] Among them, w q i Let be the weight coefficient of the i-th data source after adjustment based on the data source quality attribute parameters, and satisfy ∑ i n (w) q i ) = 1. w init j This is the initial adjustment weight for the j-th data source, where j ranges from 1 to n.

[0073] Example 2: Adjustment strategy based on industry stage characteristic parameters This embodiment dynamically adjusts the weighting coefficients according to the development stage of the industry in which the target entity is located, so that the weighting allocation is adapted to the business characteristics of different industry stages.

[0074] S320: Obtain quantitative values ​​of multiple stage characteristic indicators of the industry to which the target entity belongs.

[0075] The multiple stage characteristic indicators include at least two of the following: Time-series data on industry growth rate: the average annual revenue growth rate of the industry to which the target entity belongs over the past three years; Temporal characteristics of entity lifecycle: the number of years the target entity has been established; Structured characteristic data of R&D investment ratio: the proportion of the target entity's R&D investment to its operating revenue in the past year; Quantitative characteristics of market concentration: CR5 market concentration index of the industry to which the target entity belongs.

[0076] S321, Normalize the quantified values ​​to construct an industry stage feature vector.

[0077] The normalization process uses the extreme value normalization formula: x uv norm =[x uv -min(x v )] / [max(x v )-min(x v )).

[0078] Where, x uv Let x be the original value of the u-th target entity on the feature index at the v-th stage. uvnorm The value is the normalized value, and its range is [0, 1]. max(x) v ) represents the maximum value of all target entities on the feature index at the v-th stage, min(x) v ) represents the minimum value of all target entities in the characteristic index at the v-th stage. u ranges from 1 to p, where p is the number of target entities, and v ranges from 1 to q, where q is the number of stage characteristic indices.

[0079] For the u-th target entity, the normalized values ​​of this target entity across all q indicators form an industry stage feature vector: s u =[x u1 norm x u2 norm , ..., x uq norm This vector is used to characterize the position of the target entity in the industry stage feature space.

[0080] S322, Input the industry stage feature vector into the pre-trained stage classification model, and output the industry stage category of the target entity.

[0081] The stage classification model employs a support vector machine, random forest, or neural network classifier, trained using historical enterprise data labeled with industry stage categories as the training set. These industry stage categories include startup, growth, maturity, or decline.

[0082] After obtaining the normalized values ​​of the characteristic indicators for each stage, the comprehensive industry stage score S can be further calculated. life This score is used to quantify the development stage of the industry in which the target entity operates. The overall score is calculated using a weighted summation method, as shown in the following formula: S life =c1×G norm +c2×(1-Y norm )+c3×R norm +c4×(1-C norm ), where G norm Y is the normalized value of the industry growth rate indicator. norm R is the normalized value of the number of years the target entity has been established. norm C is the normalized value of the R&D expenditure as a percentage of GDP. norm This is the normalized value of the market concentration index.

[0083] c1 to c4 are the weight coefficients corresponding to each indicator, and they satisfy c1+c2+c3+c4=1. In an illustrative embodiment, each weight coefficient can be set to c1=0.3, c2=0.25, c3=0.25, and c4=0.2.

[0084] In the above formula, the 1-Y ratio is used for the establishment period indicator and the market concentration indicator. norm and 1-C norm The reason for this format is that the longer an enterprise has been established and the higher its market concentration, the more mature its industry stage usually is. By taking the complement, the relationship between the overall score and the industry stage can be kept consistent (the higher the score, the more it is biased towards the early stage).

[0085] Industry Stage Overall Score life The value range is [0, 1]. Based on this score, the industry stage category of the target entity can be determined: Startup Stage: Overall Industry Score (S) life ∈[0.7, 1.0]; Growth Stage: Industry Stage Overall Score S life ∈[0.4, 0.7); Mature stage: Industry stage overall score S life ∈[0.2, 0.4); Recession Phase: Industry Phase Overall Score life ∈[0, 0.2).

[0086] S324, call the preset adjustment coefficient matrix corresponding to the currently determined industry stage category, adjust the preset basic weight vector to obtain the initial adjustment weight.

[0087] The preset adjustment coefficient matrix is ​​denoted as A. stage Let be a g×n matrix, where g is the number of industry stage categories and n is the number of data sources. Each row of the matrix corresponds to an industry stage category, and each column corresponds to a data source. Matrix elements α di This represents the adjustment factor for the i-th data source under the d-th industry stage category. The value of d ranges from g.

[0088] The adjustment coefficient vectors corresponding to each stage can be pre-set according to industry characteristics, for example: Early stage: α start =[0.25, 1.33, 1.5, 2.0, 0, 0], used to weaken the weight of financial data and strengthen the weight of intellectual property and product operation data; Growth stage: α grow =[0.75,1.17,1.25,1.5,0,0], used to appropriately increase the weight of non-financial data sources; Maturity stage: α mature =[1.0,1.0,1.0,1.0,0,0], used to maintain the baseline weight; Decline phase: α decline=[0.5,0.67,1.0,0.5,1.0,1.5] is used to de-emphasize the weight of financial and intellectual property data and enable supply chain and public opinion data sources.

[0089] In step S323, based on the industry stage category d output in step S322, the corresponding row vector α is extracted from the preset adjustment coefficient matrix. d Then, it is compared with the preset base weight vector w. base Perform element-wise multiplication to obtain the corresponding initial adjustment weight vector w. init g :w init g =w base ⊙α g Here, ⊙ represents element-wise multiplication.

[0090] S324, normalize the initial adjustment weights to obtain the weight coefficients of each data source after adjustment based on industry stage characteristic parameters.

[0091] Specifically, each element in the initial adjusted weight vector is divided by the sum of all elements in the vector, and the result is the normalized weight coefficient.

[0092] To further improve the accuracy of weight adjustment, this embodiment also includes a self-learning optimization step: (1) Obtain the historical business type identification accuracy rate, which is updated monthly; (2) When the accuracy rate of the historical business type identification is lower than the target accuracy rate, the preset adjustment coefficient matrix is ​​iteratively updated according to the reinforcement learning algorithm. The update formula is: α t+1 stage =α t stage +η×(Acc target -Acc t )×α t stage ; Where, α t stage Let be the adjustment coefficient matrix for the t-th iteration, and η be the learning rate, such as 0.05, Acc. t Let be the recognition accuracy at the t-th iteration, quantifying the performance of the current coefficient matrix in historical business recognition tasks. target The target accuracy is 0.9. When |Acc t -Acc targetA value less than 0.01 indicates that the current accuracy is close to the target, and iteration is paused. Subsequently, the accuracy of subsequent business type identification is continuously monitored. If a decrease in accuracy is detected (i.e., the historical accuracy is lower than the target value), iteration is restarted, and the historical accuracy is used again to drive coefficient updates. Therefore, the historical business type identification accuracy serves as both an input to the iteration process and a criterion for starting and stopping iterations.

[0093] Example 3: Adjustment strategy based on external system configuration parameters This embodiment allows external systems to dynamically configure the weight coefficients of each data source through an application programming interface according to personalized analysis needs.

[0094] The S330 receives weight configuration parameters from external systems via the application programming interface.

[0095] The weight configuration parameters include at least one of the following three forms: Target weight coefficient format: Target weight coefficient w directly input from external systems to each data source. target =[w1 target w2 target , ..., w i target , ..., w n target ];w i target This represents the target weight coefficient for the i-th data source. This form is suitable for scenarios where the expected weights of each data source are already defined in the external system.

[0096] Weight adjustment magnitude format: The weight adjustment magnitude Δw for each data source input from the external system is Δw = [Δw1, Δw2, ..., Δw] i , ..., Δw n ],△w i This represents the increase or decrease in the weight of the i-th data source based on the preset base weights. This form is suitable for scenarios where external systems only want to fine-tune the existing weights. The adjusted weights will be added to the preset base weight vector with the increase or decrease, and then normalized.

[0097] Policy identifier format: A predefined policy identifier id is passed in from the external system. strategy This is used to call a preset weight template from an external system. The weight template is a pre-configured vector of weight coefficients, corresponding to a specific analysis scenario (such as investment decision-making, competitor analysis, supply chain research, etc.). This format is suitable for scenarios where the external system wants to quickly switch to a preset analysis mode.

[0098] The three parameter formats mentioned above can be used independently or in combination. When multiple formats of parameters are passed in simultaneously, the final weight coefficient is determined according to priority rules. For example, the target weight coefficient format has the highest priority, followed by the strategy identifier format, and the weight adjustment magnitude format has the lowest priority.

[0099] S331, Perform a validity check on the weight configuration parameters.

[0100] The validity check includes: Data structure validation: Check whether the data format of the input parameters conforms to the API specification and whether it is a valid JSON or XML format; Numerical domain verification: Check whether the weight coefficient value is within the range of [0,1], whether the weight adjustment range is within a reasonable range (e.g., [-0.2, 0.2]), and whether the strategy identifier is a defined identifier; Weighting and verification: For target weight coefficients, check if the sum of all target weight coefficients equals 1. If not, reject the configuration.

[0101] If the verification fails, an error code and error message will be returned to the external system, and the current weight coefficient will remain unchanged.

[0102] S332, after the verification is passed, the weight coefficients of each data source in the current analysis scenario are determined according to the weight configuration parameters, and the confidence level of the business type identification result is recalculated in real time.

[0103] Depending on the different weight configuration parameter formats, the corresponding weight determination method shall be adopted: For target weight coefficients: the target weight coefficients input from the external system are directly used as the weight coefficients w of each data source in the current analysis scenario. user , i.e. w user =w target .

[0104] For the weight adjustment magnitude format: based on the preset base weight vector, the weight is incrementally adjusted according to the input weight adjustment magnitude to obtain the adjusted weight vector w. adj =w base Add △w, then normalize the adjusted weights to obtain the weight coefficients w of each data source in the current analysis scenario. use。

[0105] For policy identifier format: invoke the preset weight template corresponding to the policy identifier to obtain the weight coefficient vector w of each data source defined in the template. template (id strategy ), that is, w user =w template (id strategy ).

[0106] After determining the weighting coefficients, immediately substitute them into the confidence score calculation formula Conf=∑ i n (w) i user ×p i The confidence level of the business type identification result is recalculated in real time, where w i user p represents the weight coefficient of the i-th data source in the current analysis scenario. i This represents the business matching score corresponding to the feature vector of the i-th data source. The business matching score can be obtained by calculating the similarity between the feature vectors of each data source and a preset business category feature library. Specific calculation methods can include cosine similarity, Euclidean distance, or the output probability of a classification model.

[0107] To further enhance the traceability and auditability of the system, this embodiment also includes a step of saving weight configuration operation records: Each time weight configuration parameters are received from an external system via an application programming interface and successfully applied, a weight configuration operation record is automatically saved. The weight configuration operation record includes: Caller ID: Used to uniquely identify the external system or user that initiated the configuration request; Operation timestamp: Records the time when the configuration operation occurred; Operation instruction type: Records the specific type of this operation, such as "target weight configuration", "incremental adjustment" or "strategy template call"; Weight parameter values ​​before and after the operation: Record the weight coefficients of each data source before and after the configuration takes effect; Weight configuration parameter content: Records the original configuration parameters passed in from the external system; Operation reason description: Record the configuration reason provided by the external system.

[0108] The saved weight configuration operation records can be used for subsequent audit traceability, anomaly diagnosis, and historical weight recovery.

[0109] S400 performs a comprehensive weighted calculation based on the feature vectors and corresponding weight coefficients of each data source, and outputs the business type identification result of the target entity and the corresponding confidence level.

[0110] The feature vectors from each data source extracted in step S200 are fused with the weight coefficients calculated in step S300 to calculate the matching score of the target entity in each preset business category. The specific calculation formula is as follows: Suppose there are Q preset business categories, and the matching score between the feature vector of the i-th data source and the k-th business category is s. ikThen the target entity's overall score in the kth business category is... k for: Score k =∑ i n (w) i ×s ik ).

[0111] Among them, w i Let s be the weight coefficient of the i-th data source in the current analysis scenario. ik It can be obtained by calculating the similarity between the feature vector and the feature library of the business category. The specific calculation method can be cosine similarity, Euclidean distance or the output probability of the classification model.

[0112] Based on the comprehensive score of each business category, the business type identification result of the target entity is determined. The determination of the identification result includes at least one of the following methods: Maximum value method: The business category with the highest comprehensive score is used as the identification result; Threshold method: If the combined score of multiple business categories exceeds the preset threshold, then all of these categories will be used as the identification results to characterize the diversified business characteristics of the target entity. Ranking method: Output the top M business categories from high to low based on the comprehensive score as the identification results, and mark the relative importance of each business category.

[0113] The confidence level is used to characterize the reliability of the recognition result, and its value ranges from [0,1]. The calculation method of the confidence level varies depending on how the recognition result is determined: When using the maximum value method, the confidence level can be defined as the normalized result of the difference between the highest comprehensive score and the second highest comprehensive score, or the highest comprehensive score itself can be used directly. When using the threshold method or the ranking method, the confidence level can be defined as the weighted average of the comprehensive scores of each output business category, or the probability value output by the classification model.

[0114] In one illustrative embodiment, the confidence level Conf is directly adopted using the highest overall score, i.e., Conf = max{Score} k}

[0115] The output results are presented in structured data format, including: Target entity identifier; Identified business type tags (one or more); Confidence level or overall score for each business type; The key chain of evidence includes the data sources and feature vector information that contribute the most to the identification results.

[0116] Furthermore, the method also includes: S500, Cross-validation Steps After the service type identification result is output in step S400, a preset cross-validation rule set is called to verify the credibility of the identification result, so as to enhance the robustness and interpretability of the identification result.

[0117] S510: After outputting the business type identification result, a preset cross-validation rule set is invoked to verify the credibility of the business type identification result.

[0118] The cross-validation rule set includes at least one of the following validation logics, each of which makes a judgment based on the correlation or consistency between different data sources: (1) Verification of the correlation between the first data source and the second data source.

[0119] The first data source is a financial quantitative time-series data source, and the second data source is a technical attribute data source. This verification logic is used to examine whether the target entity's high-revenue business aligns with its technological layout direction, and specifically includes the following sub-steps: Extracting high-revenue business categories: From the financial quantitative time-series data source, extract business categories whose numerical percentage exceeds a preset quantitative threshold θrev (e.g., 5%), denoted as set Crev. The numerical percentage refers to the proportion of revenue from this business category to total revenue.

[0120] Extracting the patent technology topic distribution: Extract the patent technology topic probability distribution vector vpat from the technology attribute data source. This vector is obtained by training the patent text using the LDA topic model. Each dimension corresponds to the probability value of a technology topic.

[0121] Calculate association similarity: For each high-revenue business category, first obtain its text description (e.g., "smart security camera", "cloud computing service", etc.), and then map it to the technology topic space using one of the following methods: Method 1: Keyword matching. Extract keywords from the business category text, count the frequency of these keywords under each technical topic, and obtain the technical topic vector corresponding to the business category after normalization.

[0122] Method 2: Semantic Embedding. A pre-trained word vector model (such as Word2Vec or BERT) is used to encode business category texts into semantic vectors. Then, the similarity between the semantic vector and the center vector of each technology topic (obtained by averaging the semantic vectors of patent texts belonging to that topic) is calculated, and technology topic vectors are constructed using the similarity as weights.

[0123] Through the above mapping, each high-revenue business category is represented as a vector in the technology theme space, which reflects the distribution characteristics of the business category in the technology theme dimension.

[0124] Calculate association similarity: Calculate the cosine similarity between the technology topic vector corresponding to each business category and the overall patent technology topic distribution vector of the target entity. This similarity measures the degree of consistency between the business category and the technology layout direction.

[0125] Comprehensive Judgment: Calculate the weighted average of the similarities among all high-revenue business categories (the weights can be the revenue share of each business category) to obtain the overall relevance. If the overall relevance is lower than the preset similarity threshold θsim (e.g., 0.3), the verification logic is deemed to have failed, indicating that the high-revenue businesses lack corresponding technical support.

[0126] (2) Semantic consistency verification between the third data source and the second data source.

[0127] The third data source is a business activity trajectory data source, and the second data source is a technical attribute data source. This verification logic is used to check whether the target entity's business activities are consistent with its technological layout direction, and specifically includes the following sub-steps: Extracting Business Item Category Text: Extracting category description texts for business activities such as bidding projects and product launch events from the business activity trajectory data source, forming a text set Top.

[0128] Extracting a set of technical keywords: Extracting technical keywords (such as high-frequency keywords extracted via TF-IDF) from the technical attribute data source to form a set of technical keywords, Ktech.

[0129] Mapping to the semantic space: A pre-trained word vector model (such as Word2Vec, BERT) is used to map the text description of each business item in Top to a semantic vector, and each technical keyword in Ktech to a semantic vector. All vectors are in the same semantic space and have the same dimension.

[0130] Calculate the semantic consistency score: For each business item text vector, calculate its mean cosine similarity with all technical keyword vectors, and take the maximum value as the matching degree between the business item and the target entity's technical layout. Then, take the average of the matching degrees of all business items to obtain the overall semantic consistency score.

[0131] Judgment result: If the overall semantic consistency score is lower than the preset semantic consistency threshold θsem (e.g., 0.4), the verification logic is deemed to have failed, indicating that there is a semantic gap between business activities and technology layout.

[0132] S520, Verification Result Processing If any of the above verification logics fails, at least one of the following processing measures will be executed: Reduce confidence level: Multiply the confidence level output by the S400 step by a discount factor λ (e.g., 0.8) to obtain a corrected confidence level that reflects the unreliable factors found by cross-validation.

[0133] Triggering the secondary verification process: Initiating a more in-depth verification mechanism, specifically including: Invoke supplementary verification rules: Based on the type of verification logic that failed, invoke the corresponding supplementary verification rules. For example, if the financial-technology relevance verification fails, the supplementary verification rules may include: checking the citation frequency of the patent, technological hotspot trends, etc.

[0134] Collect supplementary data sources: Introduce additional data sources for secondary verification, such as industry research reports, expert review data, etc.

[0135] Secondary validation calculation: Based on supplementary rules and data sources, a correction factor is recalculated to adjust the confidence level or generate a validation report.

[0136] Generate verification exception prompts: Output verification exception reports to external systems, including the verification logic that failed, key evidence (such as similarity values ​​and thresholds), and suggested review measures.

[0137] If all verification logic passes, the original confidence level remains unchanged, and the verification success flag is recorded in the results report.

[0138] Furthermore, the method also includes: S600, Weight Deviation Monitoring and Automatic Recovery Procedures This step is used to monitor whether the weight coefficients calculated in the S400 step deviate from a reasonable baseline range, and automatically restore the weight configuration when a significant deviation is detected, so as to ensure the rationality of the weight allocation and the stability of the identification results.

[0139] S610, when the confidence level corresponding to the business type identification result of the target entity is lower than the preset confidence threshold, calculate the deviation between the current weight coefficient of each data source and the preset benchmark weight coefficient.

[0140] When the confidence level of the target entity business type identification result output by step S400 is lower than the preset confidence threshold, the weight deviation calculation process is triggered. The confidence threshold can be preset according to the application scenario; for example, a higher threshold can be set for high-risk decision-making scenarios. In one illustrative embodiment, this threshold is set to 0.6. The preset benchmark weight coefficient refers to the weight configuration vector deemed reasonable in the current analysis scenario. This vector can be determined through one of the following methods: Preset benchmark weight coefficients derived from data source quality attribute parameters: According to the method described in steps S310 to S313, calculate the comprehensive credibility score based on the current authenticity score, integrity score and timeliness score of each data source, and normalize the comprehensive credibility score of each data source to obtain the preset benchmark weight coefficients.

[0141] Preset benchmark weight coefficients derived from industry stage feature parameters: According to the method described in steps S320 to S324, based on the current industry stage category of the target entity, the corresponding preset adjustment coefficient matrix is ​​called to adjust and normalize the basic weight vector to obtain the preset benchmark weight coefficients.

[0142] Preset baseline weight coefficients based on historical configurations of external systems: Extract the weight coefficients that were most recently configured by external systems and showed good confidence performance (e.g., the confidence level has been consistently higher than the threshold after the configuration took effect) from the weight configuration operation records saved in step S333, and use them as preset baseline weight coefficients.

[0143] In a preferred embodiment, a preset benchmark weight coefficient derived from data source quality attribute parameters is preferred, as it directly reflects the real-time reliability of the data source at the current moment.

[0144] Deviation is used to quantitatively measure the degree of difference between the current weight coefficient of each data source and the preset benchmark weight coefficient. This invention supports the following deviation calculation methods: Absolute Deviation: Calculates the absolute value of the difference between the current weight of each data source and the baseline weight. This method is suitable for scenarios that require precise control over the weight deviation of individual data sources.

[0145] Relative Deviation: Calculate the ratio of the absolute value of the difference between the current weight and the benchmark weight for each data source to the benchmark weight. When the benchmark weight is zero, a very small positive number is introduced as the denominator to avoid division by zero errors. This method is suitable for cases where the benchmark weight is small or large, and can reflect the relative magnitude of weight changes.

[0146] Vector space distance: This calculates the overall deviation between all weight vectors, and can be achieved using conventional vector distance metrics such as Euclidean distance or cosine distance. This method is suitable for assessing the degree of deviation in the overall weight distribution.

[0147] In practical applications, one or more deviation calculation methods can be selected according to the specific scenario, and corresponding deviation thresholds can be set. The deviation calculation result will serve as the basis for deviation determination in step S620.

[0148] S620, if the deviation of any data source exceeds the preset deviation threshold, the weight coefficient of that data source is automatically replaced with the preset benchmark weight coefficient, and the weight coefficients of all enabled data sources are renormalized.

[0149] For different deviation calculation methods, corresponding deviation thresholds are preset: For absolute deviation, a deviation threshold θabs=0.1 can be set, meaning that a deviation of more than 0.1 in the weight of a single data source is considered a significant deviation. For relative deviation, a deviation threshold θrel=0.2 can be set, which means that a weight deviation exceeding 20% ​​of the benchmark weight is considered a significant deviation. For vector space distance, the Euclidean distance deviation threshold θeuclidean=0.3 or the cosine distance threshold θcosine=0.15 can be set.

[0150] If absolute or relative deviation is used, check if any data source has a deviation exceeding the corresponding preset deviation threshold. If so, it is determined that the weight has significantly deviated.

[0151] If vector space distance is used, check whether the overall distance exceeds a preset threshold. If it does, it is determined that the weights have deviated significantly.

[0152] When a significant deviation in weights is detected, the following automatic recovery operation is performed: Weight Replacement: Automatically replaces the current weight coefficients of all data sources with preset baseline weight coefficients.

[0153] Renormalization: The replaced weight vector is renormalized to ensure that the sum of all weight coefficients is 1. Since the baseline weights are already normalized vectors, this step usually does not require additional calculation, but to ensure numerical accuracy, a verification normalization can be performed once.

[0154] Recalculate the identification results: Using the recovered weight coefficients, return to step S400 to recalculate the comprehensive weighted results, and output the corrected business type identification results and corresponding confidence scores.

[0155] Optionally, after each automatic weight recovery is triggered, the recovery operation record is automatically saved, including the trigger time, current confidence level, deviation calculation result, weight coefficients before and after recovery, etc., and a weight anomaly recovery notification can be sent to an external system for users to trace and review.

[0156] Furthermore, the method also includes the following steps: S710, after completing the automatic replacement and renormalization of weight coefficients, recalculates the business type identification result and corresponding confidence level of the target entity.

[0157] After completing the automatic replacement and renormalization of the weight coefficients in step S630, the weighted calculation is re-executed in step S400 based on the restored weight coefficients.

[0158] Specifically, the feature vectors of each data source extracted in step S200 are weighted and fused with the recovered weight coefficients. The matching score of the target entity in each preset business category is recalculated according to the comprehensive score calculation formula described in step S400. Based on the matching score, the business type identification result and the corresponding confidence level are re-determined. The recalculation process is completely consistent with step S400, ensuring that a more reliable identification result is obtained using the corrected weights.

[0159] S720: If the recalculated confidence level is still lower than the preset confidence threshold, a weight anomaly report is generated and sent to an external system.

[0160] The recalculated confidence score Confnew is compared with the preset confidence threshold θconf (e.g., 0.6). If Confnew is still lower than θconf, it indicates that even after automatic weight recovery, the reliability of the identification results is still insufficient, and there may be deeper problems such as data quality or model adaptability. At this point, the anomaly report generation process is triggered. Report content: The generated weight anomaly report should include at least the following information: Target entity identifier; The timestamp that triggered the exception; Confidence level before recovery and confidence level after recovery; The weight coefficient vector before and after restoration; Weight deviation calculation results (including deviation values ​​for each data source and the deviation calculation method used); The source of the preset benchmark weight coefficients (e.g., based on quality attributes, industry stage, or historical configuration); Suggested areas for investigation (e.g., "the timeliness score of financial data sources is too low" or "the completeness of technical attribute data sources is insufficient").

[0161] Report format: Anomaly reports are generated in a structured data format (such as JSON or XML) to facilitate automatic parsing and processing by external systems.

[0162] Sending method: The generated exception report is sent to a pre-configured external system (such as a monitoring platform, user terminal, or operations and maintenance center) via an application programming interface (API), and the sending status is recorded. If the sending fails, the report is cached locally and retried until successful or the maximum number of retries is reached.

[0163] By following the steps above, when the confidence level still cannot be reached after the weights are automatically restored, a timely warning can be sent to the external system so that manual intervention or further model optimization can be carried out, thereby enhancing the reliability and maintainability of the entire recognition system.

[0164] The present invention will be further described in detail below using "XX Intelligent Technology Co., Ltd." as an example, in conjunction with the accompanying drawings. This embodiment is only used to explain the present invention and is not intended to limit the scope of protection of the present invention. Specific Implementation

[0166] 1. Data acquisition and feature extraction (corresponding to steps S100 and S200) Collect raw data of the target entity from multiple heterogeneous data sources: Financial quantitative time-series data source: Revenue composition data were obtained from the company's publicly disclosed annual financial reports, including "intelligent security cameras" accounting for 65% of revenue, "technical consulting services" accounting for 20% of revenue, and "other" accounting for 15% of revenue.

[0167] Technical attribute data source: Patent information obtained from the State Intellectual Property Office shows that the company has 15 invention patents, of which 12 involve technical topics such as "image recognition algorithm" and "video compression transmission", and 3 involve "shell structure design".

[0168] Data source for business activity tracking: Recent business projects obtained from bidding websites, including the successful bid for the "Procurement Project of Security Monitoring System for Smart Park in a Certain City".

[0169] Text description data source: Business scope text obtained from the business registration system: "Technical development and sales of computer software and hardware, electronic products and communication equipment".

[0170] The above raw data is preprocessed (data cleaning, format conversion, missing value handling), and feature vectors corresponding to each data source are extracted: Extract the revenue share vector from the financial data, where the value share for "intelligent security cameras" is 0.65.

[0171] Patent technology topic distribution vectors are extracted from patent data, and the probability distribution of each topic is obtained through LDA topic model. Among them, the probability of the "image recognition" topic is significant.

[0172] Extract the category feature vector of the business project from the business project, and map "smart park security monitoring" into semantic features.

[0173] Text semantic embedding vectors are extracted from the text description, and the business scope is encoded into a fixed-dimensional vector using the BERT model.

[0174] 2. Dynamic weight adjustment and preliminary identification (corresponding steps S300 and S400) The system calculates the weight coefficients of each data source in the current analysis scenario according to a preset dynamic weight adjustment strategy. In this embodiment, the target entity is in a growth stage (rapid industry growth and high R&D investment ratio), so an adjustment strategy based on industry stage characteristic parameters is adopted (see S320-S325).

[0175] Based on the industry stage characteristic parameters, the preset adjustment coefficient matrix α corresponding to the growth stage is invoked. grow =[0.75,1.17,1.25,1.5,0,0], for the basic weight vector w base The initial weights are obtained by adjusting the values ​​of [0.4, 0.3, 0.2, 0.1, 0, 0]. After normalization, the weight coefficients of each data source are obtained: financial quantitative time series data source 0.294, technical attribute data source 0.343, business activity trajectory data source 0.245, and text description data source 0.118.

[0176] Based on the aforementioned weighting coefficients and feature vectors from each data source, a comprehensive weighted calculation is performed. Each feature vector is then matched against a pre-defined business category feature library to obtain a comprehensive score for each business category. The "Intelligent Security Hardware and Systems" category has the highest comprehensive score (0.85), significantly higher than other categories (e.g., "Technical Consulting Services" scores 0.32). Therefore, the primary business is preliminarily identified as "Intelligent Security Hardware and Systems," with the highest confidence level of 0.85 assigned.

[0177] 3. Cross-validation (corresponding to step S500) The pre-defined cross-validation rule set is invoked to verify the credibility of the preliminary identification results: Financial-Technology Correlation Verification: Business categories ("intelligent security cameras") with a numerical percentage exceeding 5% are extracted from financial data and mapped to a technology theme space to obtain corresponding technology theme vectors. The cosine similarity between this vector and the overall technology theme distribution vector of the patent is calculated to be 0.78, which is higher than the preset threshold of 0.3, thus the verification passes.

[0178] Operational-Technical Semantic Consistency Verification: Semantic vectors were extracted from the business project text "Smart Park Security Monitoring System Procurement Project" and matched with the semantic vectors of the patent technology keyword set. The average maximum similarity was calculated to be 0.65, which is higher than the preset threshold of 0.4, and the verification passed.

[0179] All cross-validations passed, maintaining the original confidence level of 0.85.

[0180] 4. Result generation and output (corresponding to the latter half of step S400 and post-processing of S500) Based on the above analysis, the system generates the following structured report: Target entity identifier: XX Intelligent Technology Co., Ltd.; Main business identification result: Research and development and sales of intelligent security hardware (cameras) and system solutions; Confidence level: 85% (based on weighted composite score, and cross-validation passed); Key chain of evidence: Financial quantitative time-series data shows that "smart security cameras" account for 65% of revenue, making them the primary source of income. The technical attributes data source shows that it has 12 patents related to image recognition and video compression, which are highly compatible with the security business; Data from the business activity tracking system shows that the company recently won a bid for a smart park security monitoring project, confirming its actual business activities in this area.

[0181] Business model assessment: Technology-driven (high R&D investment ratio, possessing core algorithm patents) Risk warning: The business scope described in the text data source is quite broad (covering computer software and hardware, electronic products, etc.), which does not fully match the current main business and has a legal basis for business diversification, but the business is currently highly focused on the security field.

[0182] The final identification results are output to the user terminal through the system interface and saved to the results database for subsequent model optimization and audit traceability.

[0183] This embodiment fully demonstrates the application of the method of the present invention in a real-world scenario, and verifies the effectiveness of the core technical features such as multi-source data fusion, dynamic weight adjustment, and cross-validation.

[0184] Compared with the prior art, the present invention has the following beneficial effects: First, the invention offers significant benefits in terms of accuracy. By integrating multiple heterogeneous data sources, including financial quantitative time-series data, technical attribute data, operational activity trajectory data, and textual description data, it constructs a multi-dimensional entity business feature space, overcoming the information limitations of a single data source. Through a dynamic weight adjustment strategy, the weight coefficients of each data source are adaptively allocated based on data source quality attribute parameters, industry stage characteristic parameters, or external system configuration parameters, making the weight allocation during feature fusion dynamically adaptable. A cross-validation mechanism verifies the correlation and semantic consistency between different data sources, providing verifiable grounds for correcting the initial identification results.

[0185] Secondly, it offers advantages in processing efficiency. This invention implements a streamlined process from multi-source data acquisition, preprocessing, feature extraction, dynamic weight calculation to comprehensive weighted identification and cross-validation, enabling the identification of the target entity's business direction without manual intervention. This streamlined processing method supports batch identification operations for multiple target entities.

[0186] Third, the invention offers beneficial effects in terms of interpretability. Through a dynamic weighting mechanism, the contribution of each data source to the final identification result is presented in the form of weighted coefficients. The key evidence chain generated by the cross-validation step, including the revenue share of business categories in financial data, the distribution of technical themes in patent data, and project information in operational activity data, provides traceable support for the identification results. The confidence score numerically characterizes the reliability of the identification results.

[0187] Fourth, the beneficial effects in terms of dynamic adaptability. This invention sets up three dynamic weight adjustment strategies: an adjustment strategy based on data source quality attribute parameters enables weight allocation to respond to changes in data authenticity, completeness, and timeliness; an adjustment strategy based on industry stage characteristic parameters enables weight allocation to adapt to the business characteristics of entities at different industry stages (startup, growth, maturity, and decline); and an adjustment strategy based on external system configuration parameters allows external systems to configure weights according to analysis needs. Through a self-learning optimization mechanism, the adjustment coefficient matrix can be iteratively updated based on historical recognition accuracy.

[0188] Fifth, beneficial effects on system stability. This invention establishes a weight deviation monitoring and automatic recovery mechanism. When the confidence level of the identification result is lower than a preset confidence threshold, the deviation between the current weight coefficient and the preset benchmark weight coefficient is calculated. If the deviation exceeds the preset deviation threshold, the current weight coefficient is automatically replaced with the preset benchmark weight coefficient and renormalized. This mechanism is used to handle abnormal weight fluctuations. The weight anomaly report generation and sending function provides anomaly information to external systems.

[0189] Based on the same inventive concept, embodiments of the present invention also provide a target entity business type identification system based on multi-source data fusion and dynamic weight adjustment, comprising: A multi-source data acquisition module is used to collect raw data from multiple heterogeneous data sources, including at least financial quantitative time-series data sources, technical attribute data sources, business activity trajectory data sources, and text description data sources. The data preprocessing and feature extraction module is used to preprocess the collected raw data and extract the feature vectors corresponding to each data source. The weight analysis and modeling module is used to calculate the weight coefficient of each data source in the current analysis scenario according to the preset dynamic weight adjustment strategy. The dynamic weight adjustment strategy is generated based on at least one factor among the data source quality attribute parameters or industry stage characteristic parameters, and performs comprehensive weighted calculation based on the feature vectors of each data source and the corresponding weight coefficients, and outputs the business type identification result of the target entity and the corresponding confidence level. The cross-validation and logical reasoning module is used to call preset cross-validation rules to verify the credibility of the business type identification results; The results generation and output module is used to generate and output a structured report that includes business type identification results, confidence level, and key evidence chain.

[0190] It should be noted that the system embodiments and the corresponding method embodiments are based on the same inventive concept. Therefore, the relevant technical features, implementation details and the technical effects that can be achieved in the method embodiments are also applicable to the system embodiments, and will not be repeated here.

[0191] Those skilled in the art will understand that the above-described processing unit can be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, the system may further include a memory storing a computer program, which, when executed by the processing unit, performs the operations described above.

[0192] This invention also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the method described in this invention.

[0193] This invention also provides a computer-readable storage medium storing computer-executable instructions for performing the methods described in this invention.

[0194] It should be understood that the various forms of processes shown above can be used to reorder, add, or delete steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this invention can be achieved, and this is not limited herein.

[0195] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A method for identifying target entity business types based on multi-source data fusion and dynamic weight adjustment, characterized in that, Includes the following steps: Collect raw data from multiple heterogeneous data sources, including at least financial quantitative time-series data sources, technical attribute data sources, business activity trajectory data sources, and text description data sources; The collected raw data is preprocessed to extract the feature vectors corresponding to each data source; According to the preset dynamic weight adjustment strategy, the weight coefficient of each data source in the current analysis scenario is calculated; the dynamic weight adjustment strategy is generated based on at least one of the data source quality attribute parameters, industry stage characteristic parameters, or external system configuration parameters. Based on the feature vectors and corresponding weight coefficients of each data source, a comprehensive weighted calculation is performed to output the business type identification result of the target entity and the corresponding confidence level.

2. The target entity business type identification method according to claim 1, characterized in that, The dynamic weight adjustment strategy includes an adjustment strategy based on data source quality attribute parameters, and the adjustment strategy based on data source quality attribute parameters further includes: Obtain the score value of each data source on the preset credibility evaluation dimension; Calculate the overall credibility score for each data source based on the aforementioned score values; Based on the comprehensive credibility score of each data source, the preset base weight coefficients are adjusted to obtain the preliminary adjusted weights; The initial weight adjustments are normalized to obtain the weight coefficients of each data source after adjustment based on the data source quality attribute parameters.

3. The target entity business type identification method according to claim 2, characterized in that, The preset credibility assessment dimensions include authenticity, completeness, and timeliness. The score for the authenticity dimension is calculated by weighting the cross-validation score, the data source tracing score, and the logic verification score. The score for the completeness dimension is calculated based on the proportion of missing data and the completeness of key information items; The score for the timeliness dimension is calculated using an exponential decay function based on the difference between the most recent update time of the data and the current time.

4. The target entity business type identification method according to claim 1, characterized in that, The dynamic weight adjustment strategy includes an adjustment strategy based on industry stage characteristic parameters, and the adjustment strategy based on industry stage characteristic parameters further includes: Obtain quantitative values ​​of multiple stage characteristic indicators of the industry to which the target entity belongs; The quantified values ​​are normalized to construct an industry stage feature vector; The industry stage feature vector is input into a pre-trained stage classification model, which outputs the industry stage category of the target entity. Call the preset adjustment coefficient matrix corresponding to the currently determined industry stage category, adjust the preset basic weight vector to obtain the initial adjustment weight; The initial adjustment weights are normalized to obtain the weight coefficients of each data source after adjustment based on industry stage characteristic parameters.

5. The target entity business type identification method according to claim 4, characterized in that, The adjustment strategy based on industry stage characteristic parameters also includes a self-learning optimization step: Obtain the historical business type identification accuracy; When the accuracy rate of historical business type identification is lower than the target accuracy rate, the preset adjustment coefficient matrix is ​​iteratively updated according to the reinforcement learning algorithm, and the update formula is: α t+1 stage =α t stage +η×(Acc target -Acc t )×α t stage ; Where, α t stage Let be the adjustment coefficient matrix for the t-th iteration, η be the learning rate, and Acc be the... t Let Acc be the recognition accuracy at the t-th iteration. target The target accuracy rate.

6. The target entity business type identification method according to claim 1, characterized in that, The dynamic weight adjustment strategy includes an adjustment strategy based on external system configuration parameters, and the adjustment strategy based on external system configuration parameters further includes: Receive weight configuration parameters from external systems via application programming interfaces; The validity of the weight configuration parameters is verified. After the verification is passed, the weight coefficients of each data source in the current analysis scenario are determined according to the weight configuration parameters, and the confidence level of the business type identification result is recalculated in real time.

7. The target entity business type identification method according to claim 6, characterized in that, The adjustment strategy based on external system configuration parameters also includes a weight adjustment log saving step: Each time a target weight coefficient is received from an external system via an application programming interface, a weight configuration operation record is automatically saved. The weight configuration operation record includes the caller's identity, operation timestamp, operation instruction type, weight parameter values ​​before and after the operation, and a description of the operation reason.

8. The target entity business type identification method according to claim 1, characterized in that, The method also includes a cross-validation step: After outputting the business type identification result, a preset cross-validation rule set is invoked to verify the credibility of the business type identification result; If the credibility verification fails, at least one of the following actions will be performed: reduce the confidence level of the business type identification result, trigger a secondary verification process, or generate a verification exception message.

9. The target entity business type identification method according to claim 1 or 2, characterized in that, The method also includes a weight deviation monitoring and automatic recovery step: When the confidence level of the business type identification result of the target entity is lower than the preset confidence threshold, the deviation between the current weight coefficient of each data source and the preset benchmark weight coefficient is calculated. If the deviation of any data source exceeds the preset deviation threshold, the weight coefficient of that data source will be automatically replaced with the preset benchmark weight coefficient, and the weight coefficients of all enabled data sources will be renormalized.

10. A target entity business type identification system based on multi-source data fusion and dynamic weight adjustment, characterized in that, include: A multi-source data acquisition module is used to collect raw data from multiple heterogeneous data sources, including at least financial quantitative time-series data sources, technical attribute data sources, business activity trajectory data sources, and text description data sources. The data preprocessing and feature extraction module is used to preprocess the collected raw data and extract the feature vectors corresponding to each data source. The weight analysis and modeling module is used to calculate the weight coefficient of each data source in the current analysis scenario according to the preset dynamic weight adjustment strategy. The dynamic weight adjustment strategy is generated based on at least one factor among the data source quality attribute parameters or industry stage characteristic parameters, and performs comprehensive weighted calculation based on the feature vectors of each data source and the corresponding weight coefficients, and outputs the business type identification result of the target entity and the corresponding confidence level. The cross-validation and logical reasoning module is used to call preset cross-validation rules to verify the credibility of the business type identification results; The results generation and output module is used to generate and output a structured report that includes business type identification results, confidence level, and key evidence chain.